CN105183722A - Chinese-English bilingual translation corpus alignment method - Google Patents
Chinese-English bilingual translation corpus alignment method Download PDFInfo
- Publication number
- CN105183722A CN105183722A CN201510592410.4A CN201510592410A CN105183722A CN 105183722 A CN105183722 A CN 105183722A CN 201510592410 A CN201510592410 A CN 201510592410A CN 105183722 A CN105183722 A CN 105183722A
- Authority
- CN
- China
- Prior art keywords
- original
- translated document
- section
- translated
- paragraph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a Chinese-English bilingual translation corpus alignment method. The method comprises the steps of 1, acquiring an original file and a corresponding translated file; 2, dividing the original file and the translated file by paragraph; 3, dividing any paragraph of the original file and the corresponding paragraph of the translated file by sentence, conducting rule alignment on sentences obtained through division, and establishing correlation; 4, traversing the paragraph, directly executing the step 6 if the number of sentences in the paragraph of the original file is identical with the number of sentences in the corresponding paragraph of the translated file, and directly executing the step 5 if the number of sentences in the paragraph of the original file is different from the number of sentences in the corresponding paragraph of the translated file; 5, combining certain sentences in the original file or/and the translated file; 6, selecting any undivided paragraph in the original file and the corresponding paragraph of the translated file again, and executing the step 3 to the step 5 on the paragraph; 7, executing the step 6 till all paragraphs are processed; 8, exporting an aligned file. By the adoption of the method, repeated use of existing translated files can be achieved.
Description
Technical field
The present invention relates to translation technology field, more specifically, relate to the alignment schemes of a kind of Chinese-English bilingual translation language material.
Background technology
Along with the continuous progress of science and technology, international exchange is more and more frequent, and the more and more opening of world economy, globalizes more and more deep, and the translation between various language file material also gets more and more, especially between English, the Chinese.Translated document relates to the every aspect of life: the every field such as trade, law, electronics, communication, computing machine, machinery, chemical industry, oil, medicine, food.
Translation belongs to service sector, and service sector will customer-orientation all the time., file number of words increasing in translation amount increasing today, how improving translation speed, the demand meeting client is very important.The popular translation speed that makes of CAT technology improves greatly.Fractionation and the distribution method of the file of existing translation can be avoided repeatedly translating identical paragraph, to improve translation efficiency to a certain extent.But it is only dropped into row to the repeated segments in same section file and rejects, and the paragraph repeated in one section of file is after all few, effective raising translation efficiency that can not be real.And existing translated document gets more and more, the paragraph of repetition also gets more and more, and how to accomplish that the recycling of existing translated document is very important to improve translation speed.
Summary of the invention
The present invention is in order to solve the problems of the technologies described above the alignment schemes providing a kind of Chinese-English bilingual to translate language material, and it can realize the recycling of existing translated document, improves translation efficiency.
The present invention's adopted technical scheme that solves the problem is:
An alignment schemes for Chinese-English bilingual translation language material, is characterized in that, comprising:
Step 1, obtains original and corresponding translated document;
Step 2, splits by section respectively to original and translated document;
Step 3, splits by sentence arbitrary section of original and the corresponding section of translated document, fractionation statement is carried out rule and aligns and be associated;
Step 4, travels through this section, if original is consistent with the sentence number of translated document in this section, directly jumps to step 6; If the sentence number of original and translated document is inconsistent in this section, directly jump to step 5;
Step 5, to original or/and some sentence in translated document merges;
Step 6, again chooses arbitrary section of fractionation of original and the corresponding section of translated document, according to step 3 to the method for step 5, operates this section;
Step 7, according to the method for step 6, until be disposed all paragraphs;
Step 8, derives alignment file.
The present invention analyzes document, splits on the basis of existing completed translated document, original original and translated document are regularly exported, the language material document of generation standard, to solve the problem of language material content recycling, improve translation speed, greatly accelerate manufacturing process, shorten the language material file generated time, improve efficiency.Original and the translated document of its correspondence generate and corresponding associate statement after segmentation, subordinate sentence, and regularly export, and are convenient to follow-up translation and repeatedly utilize.In Han-Ying translation process, exist a pair two or two to one situation, when statement associates, to merge this clause, make it meet translation brief, strengthen statement incidence relation, improve translation quality.Raw material are converted to grog by method of the present invention, are converted to the TMX file that can directly utilize by original and corresponding translated document.Least unit in translation process is sentence, instead of section.In the process of process language material, be preferably treated to sentence, so that only do simple modification when directly reusing or reuse later.
As preferably, in Han-Ying process, translation statement exist two to one situation, in order to investigate this situation, step 5 comprises: search original and translated document, to find out in original and translated document the corresponding sentence of " two to ", merge these two the corresponding sentences in original, merge in original the statement of fractionation after sentencing successively on move and adjust incidence relation.
As preferably, in Han-Ying process, there is the situation of a pair two in translation statement, in order to investigate this situation, step 5 comprises: search original and translated document, find out the corresponding sentence of " a pair two " in original and translated document, merge these two the corresponding sentences in translated document, merge in translated document the statement of fractionation after sentencing successively on move and adjust incidence relation.
As preferably, also comprise in step 4: travel through whole section, search specific vocabulary and check the translation vocabulary of specific vocabulary, check whether translation vocabulary is certain translation vocabulary, if not, then replaced with certain translation vocabulary.
To sum up, the invention has the beneficial effects as follows:
Method of the present invention regularly exports original and corresponding translated document being converted to the TMX file that can directly utilize on the basis of existing completed translated document, the language material document of generation standard, to solve the problem of language material content recycling, improve translation speed, greatly accelerate manufacturing process, shorten the language material file generated time, improve efficiency.
Embodiment
Below in conjunction with embodiment, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Embodiment
An alignment schemes for Chinese-English bilingual translation language material, comprising:
Step 1, obtains original and corresponding translated document;
Step 2, splits by section respectively to original and translated document;
Step 3, splits by sentence arbitrary section of original and the corresponding section of translated document, fractionation statement is carried out rule and aligns and be associated;
Step 4, travels through this section, if original is consistent with the sentence number of translated document in this section, directly jumps to step 6; If the sentence number of original and translated document is inconsistent in this section, directly jump to step 5;
Step 5, to original or/and some sentence in translated document merges;
Step 6, again chooses arbitrary section of fractionation of original and the corresponding section of translated document, according to step 3 to the method for step 5, operates this section;
Step 7, according to the method for step 6, until be disposed all paragraphs;
Step 8, derives alignment file.
Step 5 comprises: search original and translated document, to find out in original and translated document the corresponding sentence of " two to ", merges these two the corresponding sentences in original, merge in original the statement of fractionation after sentencing successively on move and adjust incidence relation.
Step 5 comprises: search original and translated document, finds out the corresponding sentence of " a pair two " in original and translated document, merges these two the corresponding sentences in translated document, merge in translated document the statement of fractionation after sentencing successively on move and adjust incidence relation.
Also comprise in step 4: travel through whole section, search specific vocabulary and check the translation vocabulary of specific vocabulary, check whether translation vocabulary is certain translation vocabulary, if not, then replaced with certain translation vocabulary.
We above basis are illustrated with concrete example below.
The original of Chinese is:
I likes doing housework, and special love washes the dishes.But once I has broken bowl into pieces.
I, when clearing up disintegrating slag, has hurt finger because of carelessness.
English translated document is:
Iliketodohousework,especiallylovetowashdishes.ButonceIbrokemybowl.
WhenIwasincleaningupcrumbs,accidentallycutmyfinger.
In alignment procedure, first according to the paragraph relation of original and translated document by its segmentation, original is divided into two sections, and first paragraph is: I likes doing housework, and likes to wash the dishes especially.But once I has broken bowl into pieces.The first paragraph of the translated document corresponding to it is: Iliketodohousework, especiallylovetowashdishes.ButonceIbrokemybowl.
The second segment of original is: I, when clearing up disintegrating slag, has hurt finger because of carelessness.The second segment of the translated document corresponding to it is: WhenIwasincleaningupcrumbs, accidentallycutmyfinger.
Now for first paragraph, the first paragraph of original and translated document is carried out to subordinate sentence and is associated, in original, " I likes doing housework, and special love washes the dishes." corresponding translation statement is: " Iliketodohousework, especiallylovetowashdishes. " " but once I has broken bowl into pieces." corresponding translation statement is: " ButonceIbrokemybowl. " checks the corresponding relation of its sentence, this section do not exist a pair two or two to one situation, then directly process second segment.Finally, the language material of alignment is derived.
The file translated is carried out language material alignment, in translation process, only needs to carry out degree of correlation retrieval to statement to be translated, be convenient to call alignment language material, improve translation efficiency.For example, statement to be translated is: I likes doing housework, and special love washes the dishes.In translation process, directly retrieve to search for matching degree to parallel corpus, when have in search road parallel corpus mate completely " I likes doing housework, and likes to wash the dishes especially.", directly its translation statement is called, greatly shorten translation cycle.
In step 3 to step 7, fractionation registration process can be carried out not in accordance with random mode to paragraph, also can carry out fractionation registration process to paragraph successively according to by section, namely press first paragraph, second segment, the 3rd section until the processing mode of final stage.
As mentioned above, the present invention can be realized preferably.
Claims (4)
1. an alignment schemes for Chinese-English bilingual translation language material, is characterized in that, comprising:
Step 1, obtains original and corresponding translated document;
Step 2, splits by section respectively to original and translated document;
Step 3, splits by sentence arbitrary section of original and the corresponding section of translated document, fractionation statement is carried out rule and aligns and be associated;
Step 4, travels through this section, if original is consistent with the sentence number of translated document in this section, directly jumps to step 6; If the sentence number of original and translated document is inconsistent in this section, directly jump to step 5;
Step 5, to original or/and some sentence in translated document merges;
Step 6, again chooses arbitrary section of fractionation of original and the corresponding section of translated document, according to step 3 to the method for step 5, operates this section;
Step 7, according to the method for step 6, until be disposed all paragraphs;
Step 8, derives alignment file.
2. the alignment schemes of a kind of Chinese-English bilingual translation language material according to claim 1, it is characterized in that: step 5 comprises: search original and translated document, to find out in original and translated document the corresponding sentence of " two to ", merge these two the corresponding sentences in original, merge in original the statement of fractionation after sentencing successively on move and adjust incidence relation.
3. the alignment schemes of a kind of Chinese-English bilingual translation language material according to claim 1, it is characterized in that: step 5 comprises: search original and translated document, find out the corresponding sentence of " a pair two " in original and translated document, merge these two the corresponding sentences in translated document, merge in translated document the statement of fractionation after sentencing successively on move and adjust incidence relation.
4. the alignment schemes of a kind of Chinese-English bilingual translation language material according to claim 1, it is characterized in that: also comprise in step 4: travel through whole section, search specific vocabulary and check the translation vocabulary of specific vocabulary, check whether translation vocabulary is certain translation vocabulary, if not, then replaced with certain translation vocabulary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510592410.4A CN105183722A (en) | 2015-09-17 | 2015-09-17 | Chinese-English bilingual translation corpus alignment method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510592410.4A CN105183722A (en) | 2015-09-17 | 2015-09-17 | Chinese-English bilingual translation corpus alignment method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105183722A true CN105183722A (en) | 2015-12-23 |
Family
ID=54905811
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510592410.4A Pending CN105183722A (en) | 2015-09-17 | 2015-09-17 | Chinese-English bilingual translation corpus alignment method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105183722A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106775338A (en) * | 2016-12-23 | 2017-05-31 | 语联网(武汉)信息技术有限公司 | A kind of method and system by pulling alignment language material |
CN106775339A (en) * | 2016-12-26 | 2017-05-31 | 语联网(武汉)信息技术有限公司 | A kind of method and system that adjustment language material position is clicked on by pulling |
CN106802753A (en) * | 2016-12-21 | 2017-06-06 | 语联网(武汉)信息技术有限公司 | A kind of language material alignment schemes and system |
CN107436865A (en) * | 2016-05-25 | 2017-12-05 | 阿里巴巴集团控股有限公司 | A kind of word alignment training method, machine translation method and system |
CN110807337A (en) * | 2019-11-01 | 2020-02-18 | 北京中献电子技术开发有限公司 | Patent double sentence pair processing method and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002001401A1 (en) * | 2000-06-26 | 2002-01-03 | Onerealm Inc. | Method and apparatus for normalizing and converting structured content |
CN101706777A (en) * | 2009-11-10 | 2010-05-12 | 中国科学院计算技术研究所 | Method and system for extracting resequencing template in machine translation |
CN102043773A (en) * | 2009-10-20 | 2011-05-04 | 张龙哺 | Method and device for forming modularized bilingual sentence pairs |
CN102622340A (en) * | 2012-03-28 | 2012-08-01 | 成都优译信息技术有限公司 | Translated file splitting and distributing method |
CN103530284A (en) * | 2013-09-22 | 2014-01-22 | 中国专利信息中心 | Short sentence segmenting device, machine translation system and corresponding segmenting method and translation method |
CN104360996A (en) * | 2014-11-27 | 2015-02-18 | 武汉传神信息技术有限公司 | Sentence alignment method of bilingual text |
-
2015
- 2015-09-17 CN CN201510592410.4A patent/CN105183722A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002001401A1 (en) * | 2000-06-26 | 2002-01-03 | Onerealm Inc. | Method and apparatus for normalizing and converting structured content |
CN102043773A (en) * | 2009-10-20 | 2011-05-04 | 张龙哺 | Method and device for forming modularized bilingual sentence pairs |
CN101706777A (en) * | 2009-11-10 | 2010-05-12 | 中国科学院计算技术研究所 | Method and system for extracting resequencing template in machine translation |
CN102622340A (en) * | 2012-03-28 | 2012-08-01 | 成都优译信息技术有限公司 | Translated file splitting and distributing method |
CN103530284A (en) * | 2013-09-22 | 2014-01-22 | 中国专利信息中心 | Short sentence segmenting device, machine translation system and corresponding segmenting method and translation method |
CN104360996A (en) * | 2014-11-27 | 2015-02-18 | 武汉传神信息技术有限公司 | Sentence alignment method of bilingual text |
Non-Patent Citations (1)
Title |
---|
薛松: "汉英平行语料库中名词短语对齐算法的研究", 《中国优秀硕博士学位论文全文数据库(硕士)信息科技辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107436865A (en) * | 2016-05-25 | 2017-12-05 | 阿里巴巴集团控股有限公司 | A kind of word alignment training method, machine translation method and system |
CN107436865B (en) * | 2016-05-25 | 2020-10-16 | 阿里巴巴集团控股有限公司 | Word alignment training method, machine translation method and system |
CN106802753A (en) * | 2016-12-21 | 2017-06-06 | 语联网(武汉)信息技术有限公司 | A kind of language material alignment schemes and system |
CN106775338A (en) * | 2016-12-23 | 2017-05-31 | 语联网(武汉)信息技术有限公司 | A kind of method and system by pulling alignment language material |
CN106775339A (en) * | 2016-12-26 | 2017-05-31 | 语联网(武汉)信息技术有限公司 | A kind of method and system that adjustment language material position is clicked on by pulling |
CN110807337A (en) * | 2019-11-01 | 2020-02-18 | 北京中献电子技术开发有限公司 | Patent double sentence pair processing method and system |
CN110807337B (en) * | 2019-11-01 | 2021-11-12 | 北京中献电子技术开发有限公司 | Patent double sentence pair processing method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105183722A (en) | Chinese-English bilingual translation corpus alignment method | |
CN102073706B (en) | Combined application method of distributed file storage system and relation database | |
CN103885942B (en) | A kind of rapid translation device and method | |
CN104182465A (en) | Network-based big data processing method | |
CN108920472A (en) | A kind of emerging system and method for the machine translation system based on deep learning | |
CN109241543A (en) | The preconditioning technique of consistency translationese | |
CN105740218A (en) | Post-editing processing method for mechanical translation | |
CN105279506A (en) | Manchu script central axis positioning method | |
CN105183723A (en) | Associating method for translation software and language material searching | |
CN102968539A (en) | Method for massively and quickly generating format drawing | |
Yamada | Revising text: An empirical investigation of revision and the effects of integrating a TM and MT system into the translation process | |
CN109241076A (en) | A kind of data query method and device | |
CN108536724A (en) | Main body recognition methods in a kind of metro design code based on the double-deck hash index | |
Haque et al. | Terminology-aware sentence mining for NMT domain adaptation: ADAPT’s submission to the adap-MT 2020 English-to-Hindi AI translation shared task | |
CN103455477A (en) | Term unifying method for aided translation | |
CN111723297B (en) | Dual-semantic similarity judging method for grid society situation research and judgment | |
Yamazaki et al. | Ensemble Models for Detecting Wikidata Vandalism with Stacking-Team Honeyberry Vandalism Detector at WSDM Cup 2017 | |
Eisele et al. | Improving machine translation performance using comparable corpora | |
CN110647988A (en) | Accelerated calculation method of SSD (solid State disk) target detection convolutional neural network | |
Boito et al. | How does language influence documentation workflow? unsupervised word discovery using translations in multiple languages | |
Heo et al. | Identifying UX Issues for Multimodal Interaction of Intelligent Systems Using User-Centered Design Techniques | |
CN106126500A (en) | A kind of statistical method associating hot word | |
Li et al. | Generating Poetry Title Based on Semantic Relevance with Convolutional Neural Network | |
Skadina et al. | Searching for the Best Translation | |
Fouladpour et al. | Some Reflections of the Iranian Manichaean fictional elements on the Folkloric Iranian Literature A Comparative Study of the Parables |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 610000 B, building 4, building 200, Tianfu five street, Chengdu hi tech Zone, Sichuan, Applicant after: Chengdu excellent translation information technology Limited by Share Ltd Address before: 610000, No. 1, building 107, 1 West Bauhinia Road, Chengdu hi tech Zone, Sichuan, 6 Applicant before: Chengdu Urelite Information technology Co., Ltd. |
|
COR | Change of bibliographic data | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20151223 |