CN108733658A

CN108733658A - Institution term Chinese-English translation method

Info

Publication number: CN108733658A
Application number: CN201710779839.3A
Authority: CN
Inventors: 李斌; 杨建华; 汤诗华; 钱丰收; 马宁
Original assignee: Anhui Radio And Television University
Current assignee: Anhui Radio And Television University
Priority date: 2017-09-01
Filing date: 2017-09-01
Publication date: 2018-11-02

Abstract

The invention discloses a kind of institution term Chinese-English translation method, the specific steps are：Obtain the corresponding expanding query set of institution term entity；Using the new term retrieval network resource comprising expanded set, obtains and mix bilingual digest resources；Institution term entity translation candidate is extracted from the bilingual digest resources of mixing and is ranked up according to confidence level；Obtain translation result；Expanding query method combines two methods of the inquiry of entity translation result construction and co-occurrence descriptor translation expanding query, and translation is obtained to optimal alignment result using greedy algorithm when building translation model, improve the accuracy and efficiency that subsequent language block extracts and language block translation probability calculates, the present invention considers the internal structure feature of institution term, it uses and establishes translation model by translation unit of language block, emphasis solves the extraction of candidate language block and probability calculation and is translated and decoded algorithm based on context-free, reduce translation time complexity, improve accuracy and the efficiency of translation.

Description

Institution term Chinese-English translation method

Technical field

The present invention relates to field of language translation, and in particular to a kind of institution term Chinese-English translation method.

Background technology

Entity is named relative to name, place name etc., the structure of institution term is increasingly complex, because can in institution term It can both include name, place name even another mechanism name.It is usually using in conjunction with transliteration and free translation to the translation of institution term It is translated, simultaneously because it is complicated, it needs to carry out a degree of word sequencing, so not to institution term translation Only to solve the problems, such as common machines translate it is intrinsic, as word selection, word sequencing, it is also necessary to solve the problems, such as transliteration and The problem of transliteration and free translation are combined, therefore the translation of institutional framework name is still a difficulty in natural language processing problem Point still has prodigious challenge.

Currently, the research of the institution term based on local translation model relatively gos deep into and ripe, the transliteration based on statistics Model method solves the problems, such as to meet the transliteration of transliteration rule to a certain extent, for partly meeting transliteration rule or not The case where meeting transliteration rule is helpless.Phrase-based context-sensitive institution term model is with conventional machines mould It is improved based on type, the internal structure feature of institution term is not considered, and time complexity is high, for organization The translation model of name whole (transliteration and free translation) is not mature enough, and research is fewer, it is necessary to further further investigation.

Invention content

In order to solve the above technical problems, the present invention proposes a kind of institution term Chinese-English translation method, to reach more accurate The purpose of true translation institution term.

In order to achieve the above objectives, technical scheme is as follows：

A kind of institution term Chinese-English translation method, method and step are as follows：

Step 1：Obtain the corresponding expanding query set of institution term entity；

Step 2：Using the new term retrieval network resource comprising expanded set, obtains and mix bilingual digest resources；

Step 3：Institution term entity translation candidate is extracted from the bilingual digest resources of mixing and is arranged according to confidence level Sequence；

Step 4：Obtain translation result.

Preferably, the expanding query set described in step 1 includes：Institution term entity translation result construction is looked into Inquiry and co-occurrence descriptor translation expanding query,

The institution term entity translation result construction inquiry is as follows：Build institution term translation It is right；To institution term translation to carrying out internal alignment；The extraction of statement block is carried out according to the translation confidence level of calculating； Generate the institution term translation model based on the statement block；Effective information result is extracted,

The co-occurrence descriptor translation expanding query method and step is：By source query word submission search engine, acquisition includes Then the original language summary info of source inquiry is extracted from the original language summary info obtained using TF-IDF and co-occurrence is inquired in source Theme vocabulary, obtain theme vocabulary after, the translation that these theme vocabulary are searched from bilingual dictionary is last as this method Expanded set.

Preferably, the step of internal alignment, is：Utilize the GIZA++ word alignments generally used in machine translation Tool handles having carried out word contraposition the Chinese-English translation of mechanism name, including Han-Ying Heying-Chinese both direction, according to two sides To alignment result intersection obtain alignment anchor point；Extract candidate character string；It is aligned anchor point respectively to the left and right according to each is obtained Directional Extension is current to be aligned anchor point plus the words extended as candidate word string until next alignment anchor point；It calculates bilingual The translation confidence level of single language string；For each Named entity translation pair, optimal alignment result is obtained using greedy algorithm.

Preferably, the computational methods of the translation confidence level are using similar to translation of the TF-IDF methods to acquisition Segment is given a mark, and given Chinese string o and English string e translations confidence level are calculated as follows：

Preferably, the statement block is extracted is translated and decoded algorithm using context-free, organization's status The Keywords section, region or range qualifier part and other qualifier parts are indicated for three parts, it first will be after alignment Institution term entity retains its derivation position in entirely name entity to being split as three parts, and to preceding two class part Confidence ceases, and forms a series of derivation rule and corresponding confidence level in this way, for the translation process packet of given name entity It includes：Language block is split, i.e., given institution term is split as three parts；Entity derives translation, and the sequence of translation is region Or range qualifier part, keyword fragment, other qualifier parts, if certain class part is not present in training corpus, Transliteration interpretation method is combined to translate using conventional machines translation.

It is as follows preferably, the greedy algorithm obtains optimal alignment result：For a certain specific life Name entity pair, extract the entity to comprising all { c, e }；According to the descending sort of the score of { c, e }, and it is stored in collection It closes in scoreArray；First element { cc, ee } is deleted from scoreArray, the name entity to according to { cc, ee } Contraposition update；It deletes { cc, * } and { *, ee } all in scoreArray；Repeat score descending sort until ScoreArray is sky；Best name entity is obtained to contraposition.

Preferably, being extracted described in step 3, institution term entity combines frequency measure of variation and adjacency information comes Candidate translatable strings are extracted, translation similarity, co-occurrence information, the length between candidate translatable strings and entity to be translated are calculated separately Information and transliteration information, consider multiple feature scores, sort according to comprehensive score, export translation sequences.

The invention has the advantages that：

(1) the present invention using translation model and the network translation to Chinese-English institution term by extracting the technology being combined It is furtherd investigate, realizes the institution term translation system that a high-performance is combined based on translation with network, the system energy It excavates all possible translation of cadmium ingot institution term and calculates the confidence level of translation, and extract the webpage for including the translation Resource is read for user, finally corrects translation result by user, and builds Chinese-English institution term translation word on this basis Allusion quotation.

(2) the present invention obtains optimal alignment as a result, improving subsequent language block extraction and language block translation using greedy algorithm The accuracy of probability calculation and efficiency.

(3) the present invention considers the internal structure feature of institution term, uses to establish as translation unit using language block and translate Model, emphasis solve the extraction of candidate language block and probability calculation and are translated and decoded algorithm based on context-free, reduce and turn over Time complexity is translated, accuracy and the efficiency of translation are improved.

Description of the drawings

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described.

Fig. 1 is interpretation method flow chart disclosed by the embodiments of the present invention；

Fig. 2 is structure translation model flow chart disclosed by the embodiments of the present invention.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes.

The present invention provides a kind of institution term Chinese-English translation method, operation principle be by using translation model and The network translation is extracted the technology being combined and is furtherd investigate, and realizes the group that a high-performance is combined based on translation with network Loom structure name translation system achievees the purpose that precise and high efficiency translates institution term.

With reference to embodiment and specific implementation mode, the present invention is described in further detail.

As depicted in figs. 1 and 2, steps are as follows for specific implementation of the invention：

Step 4：Obtain translation result.

Expanding query set described in step 1 includes：The inquiry of institution term entity translation result construction and co-occurrence master Translation expanding query is write inscription,

By the current research to existing institution term structure and translation feature, result of study is Chinese organization Word inside name is all notional word, they are at least translated as one or more English glossaries, English institution term in addition to " of ", " with ", " the ", " and ", " for " remaining is also all notional word, and the vocabulary alignment inside institution term It is that blocky alignment structures are presented, is aligned by establishing an institution term internal vocabulary based on alignment anchor point or so extension Method, wherein important solution is the probability calculation of the internal word string being aligned and the selection of global optimum's alignment thereof.

First, using the GIZA++ word alignments tool generally used in machine translation to the Chinese-English translation of mechanism name to carrying out Word contraposition processing, including Han-Ying Heying-Chinese both direction, GIZA++ tools only allow each Chinese in Ying-Chinese alignment The at most corresponding English words of word (assuming that after participle) only allow each English words to correspond to one equally when negative direction is aligned Chinese word.It is the Chinese word being aligned each other in two directions and English words to be aligned anchor point.Secondly it is ground using proposed by the present invention Study carefully method and optimizes vocabulary alignment result on the basis of the first step.The method includes the steps of：

Step 1：Using the GIZA++ word alignments tool generally used in machine translation to the Chinese-English translation of mechanism name into Word contraposition of having gone is handled, including Han-Ying Heying-Chinese both direction.It is obtained and is aligned according to the intersection of the alignment result of both direction Anchor point；

Step 2：Extract candidate character string；It is extended in the lateral direction until next respectively according to each alignment anchor point is obtained A alignment anchor point, it is current to be aligned anchor point plus the words extended as candidate word string；

Step 3：Calculate the translation confidence level of bilingual single language string；

Step 4：For each Named entity translation pair, optimal alignment result is obtained using greedy algorithm；

Main algorithm in above-mentioned steps is as follows：

The computational methods of translation confidence level are using the translation segment marking for being similar to TF-IDF methods to acquisition, for giving Fixed Chinese string o and English string e translations confidence level is calculated as follows：

Wherein：Represent the co-occurrence number of e and o；Generation Table translates the number of the classification of e with o each other；To the length punishment parameter of Chinese；Chinese segment o is an English piece The translation of section e；N represents the classification number of all English entity segments.

The acquisition algorithm of optimal alignment is this hair on the basis of calculating the probability of each pair of candidate Chinese string c and English string e It is bright that optimal alignment is obtained as a result, being as follows using Greedy strategy：

Step 1：For a certain specific name entity pair, extract the entity to comprising all { c, e }；

Step 2：According to the descending sort of the score of { c, e }, and it is stored in set scoreArray；

Step 3：First element { cc, ee } is deleted from scoreArray, the name entity to according to { cc, ee } Contraposition update；

Step 4：It deletes { cc, * } and { *, ee } all in scoreArray；

Step 5：Step 2 is repeated until scoreArray is sky；

Step 6：Best name entity is obtained to contraposition；

Institution term interpretation method based on statement block is mainly used establishes translation model by translation unit of language block, weight It puts the extraction for solving candidate language block and probability calculation and algorithm is translated and decoded based on context-free.

The present invention translates institution term using synchronous context Grammars, specifically, organization's status The Keywords section, region or range qualifier part and other qualifier parts are indicated for three parts.It first will be after alignment Institution term entity retains its derivation position in entirely name entity to being split as three parts, and to preceding two class part Confidence ceases, and forms a series of derivation rule and corresponding confidence level in this way.For the translation process packet of given name entity It includes：Language block is split, i.e., given institution term is split as three parts；Entity derives translation, and the sequence of translation is region Or range qualifier part, keyword fragment, other qualifier parts.If certain class part is not present in training corpus, Transliteration interpretation method is combined to translate using conventional machines translation.

Such as：<The national safety in production committee, National Committee of Industry Safety>In training After process, it is extracted as three rules：Rule one：<National #, National#>, rule two：<The # committees, Committee of#>, rule three：<Safety in production, Industry Safety>And the translation probability believed.

The translation process of " the national safety in production committee " is：The name entity cutting is by language block cutting：Region or model Qualifier [whole nation] is enclosed, other modified parts [safety in production], keyword [committee]；Translation process is：Use rule one：< The national safety in production committee, #>-><The national safety in production committee, National#>；Use rule two：<National safety is raw The production committee, National#>-><The national safety in production committee, National Committee of#>；Use rule three： <The national safety in production committee, National Committee of#>-><The national safety in production committee, National Committee of Industry Safety>。

Enquiry expanding method construction is the internal characteristics by the way that the effective information in extraction translation result to be used as to vocabulary, together When combine co-occurrence word be used as external feature, construct query expansion, due to construct query expansion two methods both consider tissue The internal characteristics of structure name entity it is further contemplated that the co-occurrence information of webpage occurs in institutional framework name entity, thus can obtain effective Bilingual digest resources, simultaneously because bilingual abstract word is fewer, and institution term Entity recognition often introduces mistake Accidentally, the present invention extracts translation directly from bilingual abstract, considers the translation information, length information, transliteration letter of candidate string Breath, the conduct candidate translation of output integrated highest scoring.Present invention employs extracted from translation result based on probability-weighted algorithm Effective information constructs query expansion in combination with co-occurrence descriptor translation.

The selection of query expansion seriously affects the quantity and quality for obtaining bilingual resource, inquires and returns after extending by analysis Abstract as a result, find its quality with merely using source inquiry return result compared with, quality, which has, to be obviously improved, substantially The upper correct translation for including name entity.

Query construction based on institution term translation result is mainly by counting probability-weighted in Top-N translation results Maximum N number of minimum translation unit (word or word), be used as the query expansion set of this method construction.It is general to weight frequency Rate is calculated according to following formula：

As a result, p (T_i| α) be i-th of translation result confidence level, c represents some Chinese character or word in result.

In conjunction with the inquiry of institution term entity translation result construction with co-occurrence descriptor translation expanding query as a result, into one Step extracts translation result, and the method that translation result extracts is to obtain to contain using effective enquiry expanding method The bilingual web page of institution term entity translation, since the identification process of institution term entity often introduces mistake, therefore not Institution term Entity recognition can be carried out to the bilingual web page of acquisition.Structure extraction is translated for institution term, is combined first Frequency measure of variation and adjacency information extract candidate translatable strings.Next calculates separately candidate translatable strings and entity to be translated Between translation similarity, co-occurrence information, length information and transliteration information, multiple feature scores are considered, according to comprehensive Divide sequence, exports translation sequences.

Candidate translatable strings extract the frequency measure of variation used and adjacency information to extract candidate translatable strings.Formula is as follows：

Wherein, the phrase that s is made of several words, the frequency of freq (s) phrases s, xi are any of phrase s single The frequency of word,The average frequency of all words in phrase s, left_n be with the various words of the left adjoinings of s sum, Right_n is and the various words of the right adjoinings of s are total.In the candidate translation set of strings being drawn into, pass through the computer candidate String left and right.

Claims

1. a kind of institution term Chinese-English translation method, which is characterized in that method and step is as follows：

Step 3：Institution term entity translation candidate is extracted from the bilingual digest resources of mixing and is ranked up according to confidence level；

Step 4：Obtain translation result.

2. institution term Chinese-English translation method according to claim 1, which is characterized in that the extension described in step 1 is looked into Asking set includes：The inquiry of institution term entity translation result construction and co-occurrence descriptor translation expanding query,

The institution term entity translation result construction inquiry is as follows：Build institution term translation pair；It is right The institution term translation is to carrying out internal alignment；The extraction of statement block is carried out according to the translation confidence level of calculating；Generate base In the institution term translation model of the statement block；Effective information result is extracted,

The co-occurrence descriptor translation expanding query method and step is：Search engine, acquisition is submitted to be looked into comprising source source query word Then the original language summary info of inquiry extracts the theme that co-occurrence is inquired with source using TF-IDF from the original language summary info obtained Vocabulary searches the translation of these theme vocabulary superset last as this method after obtaining theme vocabulary from bilingual dictionary It closes.

3. institution term Chinese-English translation method according to claim 2, which is characterized in that described internal the step of being aligned For：Using the GIZA++ word alignments tool generally used in machine translation to the Chinese-English translation of mechanism name to having carried out at word contraposition Reason, including Han-Ying Heying-Chinese both direction obtain alignment anchor point according to the intersection of the alignment result of both direction；It extracts candidate Word string；According to each alignment anchor point is obtained, extension is currently aligned anchor point up to next alignment anchor point in the lateral direction respectively In addition the words of extension is as candidate word string；Calculate the translation confidence level of bilingual single language string；It is turned over for each name entity It translates pair, optimal alignment result is obtained using greedy algorithm.

4. institution term Chinese-English translation method according to claim 2, which is characterized in that the meter of the translation confidence level Calculation method is that given Chinese string o and English string e are turned over using the translation segment marking for being similar to TF-IDF methods to acquisition Confidence level is translated to be calculated as follows：

5. institution term Chinese-English translation method according to claim 2, which is characterized in that the statement block, which extracts, to be used Context-free is translated and decoded algorithm, and institution term, which is divided into three parts, indicates that the Keywords section, region or range are repaiied Excuse part and other qualifier parts, first by the institution term entity after alignment to being split as three parts, and to preceding Two class parts retain its derivation location information in entirely name entity, form a series of derivation rule and corresponding in this way Confidence level, the translation process for given name entity include：Language block is split, i.e., given institution term is split as three A part；Entity derives translation, and the sequence of translation is region or range qualifier part, keyword fragment, other qualifier portions Point, if certain class part is not present in training corpus, combine transliteration interpretation method to translate using conventional machines translation.

6. institution term Chinese-English translation method according to claim 3, which is characterized in that the greedy algorithm obtains most Excellent alignment result is as follows：For a certain specific name entity pair, extract the entity to comprising all { c, e }； According to the descending sort of the score of { c, e }, and it is stored in set scoreArray；First is deleted from scoreArray Element { cc, ee } updates the name entity to being aligned according to { cc, ee }；Delete in scoreArray all { cc, * } with {*,ee}；The descending sort of score is repeated until scoreArray is sky；Best name entity is obtained to contraposition.

7. institution term Chinese-English translation method according to claim 1, which is characterized in that extract tissue described in step 3 Mechanism name entity combines frequency measure of variation and adjacency information to extract candidate translatable strings, calculates separately candidate translatable strings and waits for Translation similarity, co-occurrence information, length information and the transliteration information between entity are translated, multiple feature scores is considered, presses It sorts according to comprehensive score, exports translation sequences.