CN108763229A

CN108763229A - A kind of machine translation method and device of the dry extraction of feature based sentence

Info

Publication number: CN108763229A
Application number: CN201810544842.1A
Authority: CN
Inventors: 李晶洁; 胡文杰
Original assignee: Donghua University
Current assignee: Donghua University; National Dong Hwa University
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2018-11-06
Anticipated expiration: 2038-05-31
Also published as: CN108763229B

Abstract

The present invention relates to the machine translation methods and device of a kind of dry extraction of feature based sentence, specially：1) more word sequences are obtained in language A corpus and identifies that structure meets the dry desired sequence of sentence；2) determine that characteristic sentence dry doubling is based on MIN-MAX normalization algorithms and local maximum the weight method that disappears and is screened to it based on internal adhesion strength, outer boundary independence and chapter range of distribution；3) translation feature sentence does to obtain the dry database of characteristic sentence；4) language A texts to be translated are inputted, sentence is extracted sentence by sentence and does, the dry translation of sentence is searched in the dry database of characteristic sentence, translation sentence, which does outer word and is combined to its translation in the dry translation of sentence according to the word order of object language B, obtains translation.Device includes the dry Database Unit of characteristic sentence, language in-put unit, the dry extraction unit of sentence, the dry recognition unit of sentence, translation unit and assembled unit.The machine translation method and device of the present invention, translation efficiency is high, processing time is short, great application prospect.

Description

A kind of machine translation method and device of the dry extraction of feature based sentence

Technical field

The invention belongs to machine translation fields, are related to a kind of machine translation method and dress of the dry extraction of feature based sentence It sets, and in particular to a kind of machine translation method and device dry based on corpus extraction characteristic sentence.

Background technology

Rule translation from the dictionary matching of early stage to dictionary combination linguistic expertise knowledge, then arrive the system based on corpus Machine translation is counted, with the promotion of computer computation ability and the explosive growth of multilingual information, machine translation mothod is gradual Ivory tower is walked out, starts to provide real-time easily translation service for ordinary user.

Machine translation method based on corpus starts the Main way as machine translation area research.By Sinclair The Translation Equivalence research method for the corpus driving that team advocates exactly generates in this context.The core of Translation Equivalence is thought Want between two kinds of (or a variety of) language there are Translation Equivalence (translation equivalence), i.e. some in corpus L1 Translation Equivalence word (translation equivalent) in the text environments and corpus L2 of word has close association.Pass through meter Calculation machine identifies the text environments of word, so that it may which with this, to determine, corresponding in L2 respectively actually occurs in each of the word in L1 Which word.

The step of based on this structure Machine Translation Model, is as follows：1) tools such as Wordsmith are utilized to retrieve rope in JDEST It is cited evidence, the dry form of Expressive Features sentence and significance characteristic, establishes the correspondence of form and function；2) in Parallel Corpus In, Chinese or object language translation are found, the higher translation of frequency is determined as " potential equity unit " (potential equivalent)；3) it will test in potential reciprocity unit input Chinese or target language corpora, investigate its form and work( Energy feature, finally establishes the degree of correspondence of the two in context.In this model, characteristic sentence does (sentence stem) Refer to the semifixed sentence grade sequence of high frequency for implementing language piece tissue and attitude expressive function in English for academic purpose corpus, is a kind of special Minor sentence rank phrase unit, it include subject-predicate phrase, be the core of sentence.All it is machine all the time for its extraction The technological difficulties in translation especially Translation field.

And in recent years, as computing capability is constantly promoted and language material resource is enriched constantly, phrase is studied also continuous Deeply, characteristic sentence does the development of extractive technique and comes out gradually dawn.The extraction method of existing phrase unit mainly has following Two classes：1) frequency threshold method is mainly used for generating preliminary candidate sequence, and advantage is that computational complexity is low, but disadvantage exists It is relatively low in the accuracy rate and recall rate of identification；2) correlation measurement value method is judged using iteration or combination, can expand extractive technique Open up more word sequences, improve to a certain degree its identification accuracy rate, but problem be carry out English for academic purpose text translation when, it is existing Correlation measurement value method extract more word sequences be more than half be all technical term or noun phrase, the sequence of same language construction More than 95%, and the other sequence across structural units of sentence Ganlei especially characteristic sentence is dry few, and sentence is dry to be different from technical term Or noun phrase, internal correlation degree is relatively low, and boundary difficulty determines that existing term extraction method cannot be directly used to specific The dry identification of sentence judges.Although the extraction method of current phrase unit has certain development, the above method is only It is the extraction for simple phrase, the dry actual demand for machine translation of extraction chapter sentence cannot be met.

Therefore, how effectively machine translation is done and then carried out to automatic identification and extraction feature sentence from mass data, As major issue urgently to be resolved hurrily.

Invention content

The purpose of the invention is to overcome the prior art is low to across languages text translation quality, accuracy rate is low to lack Fall into, provide a kind of dry extraction of characteristic sentence precisely, treating capacity is small and good to across languages text translation quality, accuracy rate it is high based on The machine translation method and device of the dry extraction of characteristic sentence.The present invention utilizes the characteristic that characteristic sentence is done, and tries to extract characteristic Sentence is dry, and hoisting machine translates effect, and the present invention is proposed to be disappeared using MIN-MAX normalization algorithms and be handled again, and extracts characteristic The dry method of sentence, the machine translation method and machine translation apparatus of machine translation are improved with this, can be efficiently modified machine translation Quality.

In order to achieve the above object, the technical solution adopted by the present invention is：

A kind of machine translation method of the dry extraction of feature based sentence, inputs language A texts to be translated, then sentence by sentence first The sentence for extracting language A texts is dry, and the dry translation of sentence is then searched in the dry database of characteristic sentence, while translating sentence and doing outer word The translation of the dry outer word of sentence is finally combined to according to the word order of object language B in the dry translation of sentence and obtains translation by language；

The establishment step of the dry database of characteristic sentence is as follows：

(1) more word sequences are obtained in language A corpus；

(2) identify that structure meets the dry desired sequence of sentence in more word sequences；

(3) the dry desired sequence of sentence is met in structure based on internal adhesion strength, outer boundary independence and chapter range of distribution Middle determining characteristic sentence is dry；

(4) being disappeared based on MIN-MAX normalization algorithms and local maximum, method is dry to characteristic sentence again screens；

(5) will screening gained characteristic sentence it is dry be translated into object language B, record each characteristic sentence it is dry with its translation up to special The sign property dry database of sentence.

As preferred technical solution：

A kind of machine translation method of the dry extraction of feature based sentence as described above, the more word sequences of acquisition are specific For：Not endowed Academic word A text corpus is obtained first, and it is endowed to carry out part of speech to text using endowed software；Then to assigning Text after code carries out linear cutting, obtains several sequences, generates more word sequence set of 2~7 words, then the line to segmenting Property sequence carry out preprocessing handle to obtain more word sequences；The preprocessing processing includes deleting mess code, deleting interior sequences punctuate And the frequency of each sequence of statistics.

A kind of machine translation method of the dry extraction of feature based sentence as described above, the language A and object language B choosings Two kinds from English, Chinese, French, German, Italian and Japanese；

When the language A is English, the endowed collection of the endowed C7 using endowed software of the part of speech or TreeTagger；It is described When language A is Chinese, endowed software is ICTCLAS；When the language A is French, German or Italian, endowed software is TreeTagger；When the language A is Japanese, endowed software is Mecab.Language A is all made of existing endowed software and carries out part of speech Endowed, protection scope of the present invention is not limited to that, other unrequited endowed softwares are equally applicable to the present invention, language A It is also not limited to this, other can carry out part of speech endowed language such as Russian, Portuguese, Spanish etc. and be equally applicable to this Invention, it is endowed that the suitable endowed software of selection carries out part of speech.

A kind of machine translation method of the dry extraction of feature based sentence as described above, the identification structure meet that sentence is dry to be wanted The sequence asked is specially：The sentence with subject-predicate phrase is searched first in more word sequences does sequence；Then it is taken for above-mentioned subject-predicate It is individually handled with the case where predicate not to be covered omits (such as if possible) in classification, in the mistake of extraction subject-predicate phrase Cheng Zhong is defined the position of verb and noun in sentence in conjunction with the characteristic distributions of part of speech in each clause.By this step, Extract and meet the dry desired more word sequences of sentence in structure, the sentence with subject-predicate phrase do sequence include subject type with Without subject type.

A kind of machine translation method of the dry extraction of feature based sentence as described above, it is described based on internal adhesion strength, it is outer Portion boundary independence and chapter range of distribution determine that characteristic sentence is dry in structure meets the dry desired sequence of sentence and are specially：

The internal adhesion strength of joint, outer boundary independence and chapter are distributed field parameter, from angle of statistics overall merit Sentence does the typical degree in a technics piece；

Based on the dry sequence evaluation of sentence its conspicuousness that said extracted comes out, including three evaluation indexes：Calculate internal adhesion Power, Measured Boundary independence and setting chapter are distributed field parameter；Specific steps include：

1) internal adhesion strength is calculated；

1.1) theoretical according to the quasi- binary sequenceization of vacation, two word sequences are intended into n word sequence vacations, n >=2 make more word sequences have Standby measurability and comparativity；

1.2) n word sequences are directed to, it is no to repeat to choose n-1 discrete point, the attraction on various discrete point both sides is calculated one by one MI_i, MI_iThe inside adhesion strength of the partial sequence is represented, 1≤i≤n-1, i are the possibility discrete point inside n word sequences；

1.3) it uses mathematical expectation of probability weighting method to calculate the probability of occurrence of each false quasi- two word sequence MI values, and it is weighed Weight；

1.4) it sums to all MI values after weight, formula is as follows；

MI=P (MI₁)MI₁+P(MI₂)MI₂+P(MI₃)MI₃+…+P(MI_n-1)MI_n-1=∑ P (MI_i)MI_i；

P (MI in formula_i) indicate MI_iProbability；

N word sequences MI (W) calculation formula after being adjusted using mathematical expectation of probability weighting method is as follows：

W={ w in formula₁,w₂,w₃..., w_n, i is the possibility discrete point inside sequence W, and W is divided into w₁,w₂....,w_iWith w_(i+1)....,w_nTwo parts, 1≤i≤n-1, n >=2；Wherein, w₁、w₂、....w_nRespectively first of sequence W, second N-th of composition word of a ..., P (W) is the actual observation value of sequence W probabilities of occurrence, P (w₁,w₂....,w_i) it is sequence { w₁, w₂....,w_iPractical probability of occurrence, P (w_(i+1)....,w_n) it is sequence { w_(i+1)....,w_nPractical probability of occurrence,For when discrete point is i, the theoretical eapectation of the probability of occurrence of corresponding sequence W；In 1.3) probability In mean value weighting method, need sequence W to be converted into n-1 false quasi- two word sequences, i indicates that n-1 inside sequence W is possible discrete W is divided into w by point₁,w₂....,w_iAnd w_(i+1)....,w_nTwo parts, 1≤i≤n-1, n >=2 form false quasi- two word sequences；

2) Measured Boundary independence；

The present invention measures the dry boundary independence of sentence using entropy, and boundary entropy is used to measure the boundary randomness of sequence；Side Boundary's entropy is bigger, and the uncertainty of the sequence is bigger, and independence is higher, is more likely to become a stipulations chunking；

The specific steps are：

2.1) sequence W is done for each candidate sentence, automatically generates two set of left margin Collocation and right margin Collocation, Include the set A={ a for all words that adjacent position occurs on the left of W_k| k is positive integer }, a_kOccur for adjacent position on the left of W Several k-th of word from left to right, the set B={ b for all words that the right sides W adjacent position occurs_k| k is positive integer }, b_kFor the right sides W K-th of word several from left to right that side adjacent position occurs；

2.2) the dry left margin maximum entropy H (W) of each sentence is calculated_leftWith right margin maximum entropy H (W)_right, formula is as follows；

P in formula (aW | W) indicates that the conditional probability of word a occur in sequence W left margins, and P (Wb | W) it is that W right margins occur b's Conditional probability；

2.3) 2.2) algorithm in is improved, in conjunction with left and right boundary maximum entropy, calculates the dry ultimate bound entropy H of sentence (W), formula is as follows：

F (W) indicates total frequency that sequence W occurs in formula；

3) setting chapter is distributed field parameter；

Chapter range of distribution (D) refers to while occurring certain dry article record, and chapter distribution field parameter (text is added in the present invention Dispersion it) is used as evaluation index, is in order to ensure the dry distribution of sentence will not be concentrated excessively, this obtains recognizing for several academic authors It can；

Described in synthesis, setting the threshold values of three parameters altogether, to carry out attributive function sentence dry：(threshold value is for internal adhesion (MI) 1.8), boundary independence (H) (threshold value is 0.5) and chapter range of distribution (D) (threshold value is 2)；When sentence does three above category in sequence Property be higher than threshold value, i.e., internal adhesion strength MI (W), which is more than 1.8, ultimate bound entropy H (W) and is more than 0.5, chapter distribution thresholding, is more than 2 When, which is confirmed as the extraction of characteristic sentence dry doubling.

A kind of machine translation method of the dry extraction of feature based sentence as described above, it is described to be normalized based on MIN-MAX Algorithm and local maximum the weight method that disappears screen it to obtain that multiple characteristic sentences are dry is specially：

First, at based on MIN-MAX normalization algorithms to internal adhesion strength MI (W) and ultimate bound entropy H (W) normalization Reason obtains the weight parameter that disappears；

In the method for reruning that disappears, the weight parameter that disappears of choosing is divided into 3 classes：1. MI (internal adhesion strength value) 2. H (boundary entropy) 3. MI* H (the internal adhesion strength of joint and boundary entropy)；

Internal adhesion strength and boundary entropy collective effect are in the weight parameter that disappears, so internal adhesion strength and boundary entropy will be common It determines to disappear and weighs the size of parameter；

The present invention select the 3. kind disappear weight parameter, using MIN-MAX normalization algorithms pre-process internal adhesion strength MI values and Boundary entropy H values；Handle internal adhesion strength and boundary entropy respectively using MIN-MAX normalization algorithms, to internal adhesion strength value and Boundary entropy carries out linear transformation, makes two threshold values between 0~1, thus inside not change data in the case of property, The effect for balancing each factor peer-to-peer value, the last result of the two is balanced, and be unlikely to because some value is excessive, and to knot Fruit plays decisive influence, and the formula of the MIN-MAX normalization algorithms is as follows；

Wherein, MI_j' be normalization after inside adhesion strength MI (W), MI_max、MI_minRespectively internal adhesion strength MI (W) Maximum value, minimum value, MI_jBeing characterized property sentence does the inside adhesion strength MI (W), H of j_j' be normalization after ultimate bound entropy H (W), H_max、H_minRespectively maximum value, the minimum value of ultimate bound entropy H (W), H_jBeing characterized property sentence does the ultimate bound entropy H (W) of j, will MI_j' and H_j' be multiplied the weight parameter GI that disappears to obtain the final product；Then, it is carried out in the characteristic sentence of extraction is dry according to the local maximum weight method that disappears Screening；

Then, it is screened in the characteristic sentence of extraction is dry according to the local maximum weight method that disappears；

Local maximum (Localmaxs) disappear weight method：The sentence is done and only makees ratio with n-1 members subsequence and n+1 member auxiliary sequences Compared with wherein n is the dry word number for including of this, and n-1 member subsequences refer to that the sentence that the dry length for including of this is n-1 word is dry Sequence, n+1 auxiliary sequences are to do sequence comprising the sentence that this is dry and length is n+1 word, using local maximum to extracting The dry sequence of all candidate sentences come disappears to be handled again, deletes the overlap of the different length generated by Repetitive Word Segmentation, it is ensured that carry It is all independent individual that each characteristic sentence of taking-up is dry, there is no with other n-1 metasequences and n+1 metasequences are Chong Die shows As；

Local maximum disappear weight method specific formula it is as follows：

GI(S_n)>GI(S_n+1) if n=2；

GI(S_n)>=GI (S_n-1)∨GI(S_n)>GI(S_n+1)if 7>n>2；

GI(S_n)>=GI (S_n-1) if n=7；

In formula, S_nThe characteristic sentence that some includes n word is represented to do；

The present invention is not limited in local maximum (Localmaxs) for the dry screening technique of the characteristic sentence of extraction and disappears Weight method, global maximum (Globalmaxs) disappear weight method be equally applicable to the present invention, can be chosen according to actual demand.

Global maximum (Globalmaxs) disappear weight method：By the dry all subsequences and auxiliary sequence with length for 2~7 words of sentence Compare, wherein subsequence refer to by dry all 2~7 words and phrases for including of this do sequence, auxiliary sequence be comprising this it is dry all 2 ~7 words and phrases do sequence；Disappeared to the dry sequence of the candidate sentence extracted using global maximum and handled again, deleted due to Repetitive Word Segmentation The overlap of the different length of generation, it is ensured that it is all independent individual that each function sentence for extracting is dry, be not present and other The folded phenomenon of sentence dry weight；Its specific formula is as follows：

GI(S_n)>GI(S_super-string) if n=2；

GI(S_n)>=GI (S_sub-string)∨GI(S_n)>GI(S_super-string)if 7>n>2；

GI(S_n)>=GI (S_sub-string) if n=7；

In formula, S_nIt represents the characteristic sentence that some includes n word to do, S_sub-stringIndicate S_nSubsequence, S_super-stringIndicate S_nAuxiliary sequence.

A kind of machine translation method of the dry extraction of feature based sentence as described above, it is described in the dry database of characteristic sentence The middle dry translation of sentence of searching refers to compared with doing the dry characteristic sentence with the dry database of characteristic sentence of sentence, as sentence is dry and characteristic sentence Characteristic sentence in dry database is dry the same, then the dry translation of this feature sentence is the dry translation of sentence.As sentence is dry and characteristic sentence Characteristic sentence in dry database is done inconsistent, then translation forms the dry each phrase of this respectively, after according to object language word order It combines each phrase and obtains the dry translation of this.

The present invention also provides a kind of device of the machine translation method using the dry extraction of feature based sentence as described above, Including the dry Database Unit of characteristic sentence, language in-put unit, the dry extraction unit of sentence, the dry recognition unit of sentence, translation unit, combination Unit；

The dry Database Unit of characteristic sentence includes input subelement, core processing subelement and database subsystem unit； The input subelement is for obtaining more word sequences；The core processing subelement includes cutting word and statistics computing module, threshold value Screening module and the molality block that disappears；The cutting word is with statistics computing module mainly by cutting word function sub-modules and statistics computational submodule Composition, structure meets the dry desired sequence of sentence to the cutting word function sub-modules for identification, and the statistics computational submodule is used for Calculate inside adhesion strength, outer boundary independence and chapter range of distribution that structure meets the dry desired sequence of sentence；The threshold value sieve Modeling block is used for the extraction characteristic sentence in structure meets the dry desired sequence of sentence and does, and the molality block that disappears is mainly by normalizing beggar Module and disappear baryon module composition, the normalization submodule be based on MIN-MAX normalization algorithms to internal adhesion strength value with Boundary entropy is handled, and the baryon module that disappears is to disappear that method is dry to characteristic sentence screens for weight according to local maximum；Institute Stating database subsystem unit, characteristic sentence is dry is translated into object language B for that will screen gained, and records that characteristic sentence is dry and its translation；

The language in-put unit, for inputting language A texts to be translated；

The dry extraction unit of sentence, the sentence for extracting language A texts sentence by sentence are dry；

The dry recognition unit of sentence searches the dry translation of sentence in the dry database of characteristic sentence；

Sentence is done outer word and is translated into object language B by the translation unit；

The translation that the assembled unit, the dry translation of distich and sentence do outer word is combined, and obtains translation；

The language in-put unit, the dry extraction unit of sentence, the dry recognition unit of sentence and assembled unit are sequentially connected, the dry extraction of sentence Unit, translation unit and assembled unit are sequentially connected, and subelement, cutting word function are inputted in the dry Database Unit of characteristic sentence Submodule, statistics computational submodule, threshold value screening module, normalize submodule, the baryon module that disappears and database subsystem unit successively Connection, the database subsystem unit are connect with the dry recognition unit of sentence.

Device as described above, the language in-put subelement include path selection module, and user can be arbitrary according to demand Select the path in the path and output file of input file, software that will automatically be created under the outgoing route that user selects ExtractingOutput files to preserve existing destination file, with statistics computing module be responsible for generating initial by the cutting word Alternative sequence database, in cutting word function sub-modules, the dry length of cutting sentence and model can be voluntarily arranged in user according to demand It encloses, the length that cut sentence does sequence can arbitrarily be selected in 2 words between 7 words, and sentence is voluntarily arranged and does length, this system software is by root Come, to the dry linear cutting of progress of sentence in input file, to ultimately produce more word sequences of different length according to the range of user setting. In counting computational submodule, software will calculate internal adhesion strength MI values and boundary entropy H values automatically, and record a left side for each sequence The appearance frequency and text position of right adjacent word, are finally stored in corresponding file respectively, and the threshold value screening module includes Parameter setting and screening unit, after having carried out the dry extraction of sentence and threshold calculations, user can be with three parameters of sets itself Automatic sieve is selected all language performatives in parameter area and done by size, software.In normalizing submodule, user can be with It chooses whether to need that (To Normalise) is normalized to MI values and H values according to demand, and the product for calculating MI and H obtains To the weight parameter that disappears.If selection normalization, software will use MIN-MAX method for normalizing to carry out linear transformation to MI values and H values, Make two threshold values between 0~1, to reduce as far as possible in screening process due between threshold value gap it is excessive caused by Harmful effect；If selecting nonstandardized technique, software will use original MI values and H values；The most termination of the result exposition Fruit displayed page includes four parts：The dry display box of sentence：The frame is located at the top at interface, is currently selected for highlighting user In sentence is dry and its corresponding part of speech code；The dry information table of sentence：The table is located at the left side of result interface, shows 7 column datas, respectively For utterance act sentence is dry, sentence does corresponding part of speech code, disappear weight parameter value, the sentence of user's selection do frequency, association relationship boundary entropy Value and chapter are distributed thresholding；Text selecting combobox：The combobox is located at the right at interface；Text display box：The text is aobvious Show that frame is located at the right side of result interface, for showing the dry context language occurred each time of selected sentence dry original text content and sentence Border；The output function part, output file are the processed file to sort according to processing time under specified path, and format is Txt texts.

Device as described above, the parameter setting and the parameter that sets in screening submodule as internal adhesion strength MI (W), The threshold value of ultimate bound entropy H (W), chapter range of distribution D.

Invention mechanism：

The present invention has been firstly introduced into the more word sequences of internal adhesion strength, outer boundary independence and chapter range of distribution to identification Evaluate and therefrom selected characteristic sentence is dry, it is then Promethean to have used MIN-MAX normalization algorithms, it is dry to characteristic sentence Inside adhesion strength value and boundary entropy be normalized, then it is dry using the local maximum weight method screening characteristic sentence that disappears, Translation feature sentence is dry to obtain the dry database of characteristic sentence, and the rear dry database of feature based sentence translates language A texts.

Wherein normalized can both retain the property between initial data to the greatest extent, can also control each parameter Balance to the influence for extracting result, the present invention screen to obtain characteristic sentence and do to lack, and are remarkably improved treatment effeciency, reduction processing Time.Method using the present invention, under the conditions of same computing environment, the processing time of 1,000,000 words is only 2 minutes, and 5,000,000 The processing time of word is also only 12 minutes (Computer models：HP348G3, processor：Core^TMi7-6500U CPU@ 2.50GHz 2.60GHz, memory：8.00GB system type：64 bit manipulation systems).In addition, the device of the invention is with higher Flexibility and reliability, the different parameters that can be inputted according to user carry out calculation processing, user can be according to needing select Corresponding text path is selected without specifying fixed path, device that can carry out unlimited number to identical pending text Extraction operation, if having existed identical destination file, the device can the result that can check of automatic prompt and inquiry make Whether user covers.

Advantageous effect：

(1) machine translation method of the dry extraction of a kind of feature based sentence of the invention, translation efficiency is high, processing time It is short, great application prospect；

(2) machine translation apparatus of the dry extraction of a kind of feature based sentence of the invention, flexibly, reliably, user can root According to actual conditions setup parameter and path.

Description of the drawings：

Fig. 1 is the Establishing process figure of the dry database of characteristic sentence of the present invention；

Fig. 2 is the internal possible discrete point schematic diagram of n word sequences (n >=2)；

Fig. 3 is a kind of specific translation flow figure of the machine translation method of the dry extraction of feature based sentence of the present invention；

Fig. 4 is a kind of structure composition figure of the machine translation apparatus of the dry extraction of feature based sentence of the present invention；

Wherein, " * " is possible discrete point.

Specific implementation mode

The invention will be further elucidated with reference to specific embodiments.It should be understood that these embodiments are merely to illustrate this hair It is bright rather than limit the scope of the invention.In addition, it should also be understood that, after reading the content taught by the present invention, art technology Personnel can make various changes or modifications the present invention, and such equivalent forms equally fall within the application the appended claims and limited Fixed range.

A kind of machine translation method of the dry extraction of feature based sentence, is as follows：

(1) the dry database of characteristic sentence is established, step is as shown in Figure 1：

1.1) more word sequences are obtained in language A corpus：

Not endowed language A text corpus is obtained first, and it is endowed to carry out part of speech to text；Then to the text after endowed into Line cutting obtains several sequences, generates more word sequence set of 2~7 words, is then carried out to the linear order segmented pre- Working process obtains more word sequences；Preprocessing processing includes deleting mess code, deleting interior sequences punctuate and each sequence of statistics Frequency；It is endowed to text progress part of speech using the endowed collection of the C7 of endowed software or TreeTagger when language A is English, such as language When saying that A is Chinese, endowed software is ICTCLAS；If language A is French, German or when Italian, endowed software is TreeTagger；As language A be Japanese when, endowed software be Mecab.

1.2) identify that structure meets the dry desired sequence of sentence in more word sequences；

The sentence with subject-predicate phrase is searched first in more word sequences does sequence；Then it is directed in above-mentioned subject-predicate collocation classification The case where predicate omission (such as if possible) not to be covered, is individually handled, during extracting subject-predicate phrase, in conjunction with The characteristic distributions of part of speech in each clause are defined the position of verb and noun in sentence.By this step, structure is extracted On meet the dry desired more word sequences of sentence, it includes subject type and without subject type that the sentence with subject-predicate phrase, which does sequence,；

1.3) the dry desired sequence of sentence is met in structure based on internal adhesion strength, outer boundary independence and chapter range of distribution Middle determining characteristic sentence is dry, specific as follows：

1.3.1 internal adhesion strength) is calculated；

1.3.1.1) theoretical according to the quasi- binary sequenceization of vacation, n word sequences are converted to false quasi- two word sequences, n >=2；

1.3.1.2 n word sequences) are directed to, it is no to repeat to choose n-1 discrete point, the suction on various discrete point both sides is calculated one by one Gravitation MI_i, MI_iThe inside adhesion strength of the partial sequence is represented, 1≤i≤n-1, i are the possibility discrete point inside n word sequences；

1.3.1.3 mathematical expectation of probability weighting method) is used to calculate the probability of occurrence of each false quasi- two word sequence MI values, and to it Weight；

1.3.1.4 it) sums to all MI values after weight, formula is as follows；

P (MI in formula_i) indicate MI_iProbability；

W indicates n word sequences, W={ w in formula₁,w₂,w₃..., w_n}；I is the possibility discrete point inside sequence W, and W is divided into w₁,w₂....,w_iAnd w_(i+1)....,w_nTwo parts, 1≤i≤n-1, n >=2；Wherein, w₁、w₂、....w_nRespectively sequence W's First, second ... n-th of composition word, w₁,w₂....,w_iIndicate that binary sequence is intended in the vacation divided by discrete point i First part, w_(i+1)....,w_nIndicate that the second part of binary sequence is intended in the vacation divided by discrete point i, n word sequences (n >=2) are interior The possible discrete point schematic diagram in portion is as shown in Fig. 2, P (W) is the actual observation value of sequence W probabilities of occurrence, P (w₁,w₂...., w_i) it is sequence { w₁,w₂....,w_iPractical probability of occurrence, P (w_(i+1)....,w_n) it is sequence { w_(i+1)....,w_nReality Probability of occurrence,For when discrete point is i, the theoretical eapectation of the probability of occurrence of corresponding sequence W；

1.3.2) Measured Boundary independence；

1.3.2.1 sequence W) is done for each candidate sentence, automatically generates left margin Collocation and right margin Collocation two Set includes the set A={ a for all words that adjacent position occurs on the left of W_k| k is positive integer }, a_kFor adjacent position on the left of W K-th of the word several from left to right occurred, the set B={ b for all words that the right sides W adjacent position occurs_k| k is positive integer }, b_kK-th of the word several from left to right occurred for adjacent position on the right side of W；

1.3.2.2 the dry left margin maximum entropy H (W) of each sentence) is calculated_leftWith right margin maximum entropy H (W)_right, formula is such as Under；

1.3.2.3 the dry ultimate bound entropy H (W) of sentence) is calculated, formula is as follows：

F (W) indicates total frequency that sequence W occurs in formula；

Three above attribute in sequence is done when sentence and is higher than threshold value, i.e., internal adhesion strength MI (W) is more than 1.8, ultimate bound entropy H (W) when being more than 0.5, chapter distribution thresholding more than 2, which is confirmed as the extraction of characteristic sentence dry doubling；

1.4) being disappeared based on MIN-MAX normalization algorithms and local maximum, method is dry to characteristic sentence again screens；

1.4.1 MIN-MAX normalization algorithms) are based on to internal adhesion strength value and boundary entropy normalized, are disappeared Weight parameter；

The formula of MIN-MAX normalization algorithms is as follows；

1.4.2 it) is screened in the characteristic sentence of extraction is dry according to the local maximum weight method that disappears；

Its specific formula is as follows：

GI(S_n)>GI(S_n+1) if n=2；

GI(S_n)>=GI (S_n-1)∨GI(S_n)>GI(S_n+1)if 7>n>2；

GI(S_n)>=GI (S_n-1) if n=7；

In formula, GI (S_n) represent the weight parameter that disappears that the characteristic sentence that some includes n word is done, GI (S_n+1) represent some Including the weight parameter that disappears that the characteristic sentence of n+1 word is dry, GI (S_n-1) represent some include n-1 word characteristic sentence do Disappear weight parameter, S_nThe characteristic sentence that some includes n word is represented to do；

1.5) will screening gained characteristic sentence it is dry be translated into object language, record each characteristic sentence it is dry with its translation up to special The sign property dry database of sentence；

The specific translation flow of the present invention is as shown in figure 3, specific steps such as step (2)~(5) are described：

(2) language A texts to be translated are inputted；

(3) sentence of extraction language A texts is dry sentence by sentence；

(4) the dry translation of sentence is searched in the dry database of characteristic sentence, specially：

Compared with the dry characteristic sentence with the dry database of characteristic sentence of sentence is done, in sentence is dry and the dry database of characteristic sentence Characteristic sentence it is dry the same, then the dry translation of this feature sentence is the dry translation of sentence；In sentence is dry and the dry database of characteristic sentence Characteristic sentence do inconsistent, then translation forms the dry each phrase of this respectively, after according to object language B word orders combine each phrase Obtain the dry translation of this；

(5) translation sentence does outer word, then the translation of the dry outer word of sentence is combined to sentence according to the word order of object language B Translation is obtained in dry translation.

Using the device of above-mentioned machine translation method, structure composition figure is as shown in figure 4, include the dry database of characteristic sentence Unit, language in-put unit, the dry extraction unit of sentence, the dry recognition unit of sentence, translation unit and assembled unit；

The dry Database Unit of characteristic sentence includes input subelement, core processing subelement and database subsystem unit；

Input subelement is for obtaining more word sequences comprising the path selection module for selecting input, outgoing route；

Core processing subelement includes cutting word and statistics computing module, threshold value screening module and the molality block that disappears；

Cutting word is responsible for generating initial alternative sequence database with statistics computing module, mainly by cutting word function sub-modules and Computational submodule composition is counted, structure meets the dry desired sequence of sentence to cutting word function sub-modules for identification, and statistics calculates submodule Block is used to calculate inside adhesion strength, outer boundary independence and the chapter range of distribution that structure meets the dry desired sequence of sentence；

Threshold value screening module is used for the extraction characteristic sentence in structure meets the dry desired sequence of sentence and does comprising parameter is set The parameter set in fixed and screening submodule, parameter setting and screening submodule is internal adhesion strength, boundary entropy, chapter range of distribution Threshold value；

The molality block that disappears mainly is made of normalization submodule and the baryon module that disappears, and normalization submodule is to be based on MIN-MAX Normalization algorithm handles internal adhesion strength value and boundary entropy, and the baryon module that disappears is to be disappeared to weigh method pair according to local maximum Characteristic sentence is dry to be screened；Database subsystem unit is translated into object language B for characteristic sentence to be dry obtained by screen, and records spy Sign property sentence is done and its translation；

Language in-put unit, for inputting language A texts to be translated；

The dry recognition unit of sentence, searches the dry translation of sentence in the dry database of characteristic sentence；

Sentence is done outer word and is translated into object language B by translation unit；

The translation that assembled unit, the dry translation of distich and sentence do outer word is combined, and obtains translation；

The dry extraction unit of language in-put unit, sentence, the dry recognition unit of sentence and assembled unit are sequentially connected, and the dry extraction of sentence is single Member, translation unit and assembled unit are sequentially connected, and the dry Database Unit of characteristic sentence is connect with the dry recognition unit of sentence, characteristic sentence Subelement, cutting word function sub-modules, statistics computational submodule, threshold value screening module, normalizing beggar are inputted in dry Database Unit Module, the baryon module that disappears and database subsystem unit are sequentially connected, and database subsystem unit is connect with the dry recognition unit of sentence.

Claims

1. a kind of machine translation method of the dry extraction of feature based sentence, it is characterized in that：Language A texts to be translated are inputted first This, then the sentence of extraction language A texts is done sentence by sentence, the dry translation of sentence is then searched in the dry database of characteristic sentence, while translating sentence Outer word is done, finally the translation of the dry outer word of sentence is combined to according to the word order of object language B in the dry translation of sentence and is translated Text；

(1) more word sequences are obtained in language A corpus；

(3) true in structure meets the dry desired sequence of sentence based on internal adhesion strength, outer boundary independence and chapter range of distribution It is dry to determine characteristic sentence；

(5) will screening gained characteristic sentence it is dry be translated into object language B, record each characteristic sentence it is dry with its translation up to characteristic The dry database of sentence.

2. a kind of machine translation method of the dry extraction of feature based sentence according to claim 1, which is characterized in that described Obtaining more word sequences is specially：Not endowed Academic word A text corpus is obtained first, and word is carried out to text using endowed software Property is endowed；Then linear cutting is carried out to the text after endowed, obtains several sequences, generate more word sequence set of 2~7 words, Then preprocessing is carried out to the linear order segmented to handle to obtain more word sequences；Preprocessing processing include deletion mess code, Delete the frequency of interior sequences punctuate and each sequence of statistics.

3. a kind of machine translation method of the dry extraction of feature based sentence according to claim 2, which is characterized in that described Two kinds in English, Chinese, French, German, Italian and Japanese of language A and object language B；

When the language A is English, the endowed collection of the endowed C7 using endowed software of the part of speech or TreeTagger；The language When A is Chinese, endowed software is ICTCLAS；When the language A is French, German or Italian, endowed software is TreeTagger；When the language A is Japanese, endowed software is Mecab.

4. a kind of machine translation method of the dry extraction of feature based sentence according to claim 1, which is characterized in that described Identification structure meets the dry desired sequence of sentence：The sentence with subject-predicate phrase is first searched in more word sequences does sequence；Afterwards The case where being omitted for predicate is individually handled, and it includes subject type and dereliction that the sentence with subject-predicate phrase, which does sequence, Language type.

5. a kind of machine translation method of the dry extraction of feature based sentence according to claim 1, which is characterized in that described Based on internal adhesion strength, outer boundary independence and chapter range of distribution characteristic is determined in structure meets the dry desired sequence of sentence Sentence is dry to be specially：

1) internal adhesion strength is calculated；

1.1) theoretical according to the quasi- binary sequenceization of vacation, two word sequences, n >=2 are intended into n word sequence vacations；

1.2) it is directed to n word sequences W, no repetition chooses n-1 discrete point, calculates the attraction MI on various discrete point both sides one by one_i, MI_iThe inside adhesion strength of the partial sequence is represented, 1≤i≤n-1, i are the possibility discrete point inside n word sequences；

1.3) mathematical expectation of probability weighting method is used to calculate each false quasi- two word sequence MI_iThe probability of occurrence of value, and to its weight；

1.4) to all MI after weight_iValue summation, formula are as follows；

P (MI in formula_i) indicate MI_iProbability；

It is adjusted using mathematical expectation of probability weighting method, n word sequences inside adhesion strength MI (W) calculation formula after adjustment is such as Under：

W={ w in formula₁,w₂,w₃..., w_n, i is the possibility discrete point inside sequence W, and W is divided into w₁,w₂....,w_iWith w_(i+1)....,w_nTwo parts, 1≤i≤n-1, n >=2；Wherein, w₁、w₂、....w_nRespectively first of sequence W, second N-th of composition word of a ..., P (W) is the actual observation value of sequence W probabilities of occurrence, P (w₁,w₂....,w_i) it is sequence { w₁, w₂....,w_iPractical probability of occurrence, P (w_(i+1)....,w_n) it is sequence { w_(i+1)....,w_nPractical probability of occurrence,For when discrete point is i, the theoretical eapectation of the probability of occurrence of corresponding sequence W；

2) Measured Boundary independence；

2.1) for each n word sequences W, two set of left margin Collocation and right margin Collocation are automatically generated, including on the left of W Set A={ a for all words that adjacent position occurs_k| k is positive integer }, a_kOccur from left to right for adjacent position on the left of W Several k-th of word, the set B={ b for all words that the right sides W adjacent position occurs_k| k is positive integer }, b_kFor adjacent bit on the right side of W Set k-th of word several from left to right of appearance；

2.2) the left margin maximum entropy H (W) of each n word sequences is calculated_leftWith right margin maximum entropy H (W)_right, formula is as follows；

P in formula (aW | W) indicates that the conditional probability of word a occur in sequence W left margins, and P (Wb | W) it is that W right margins the condition of b occur Probability；

2.3) the ultimate bound entropy H (W) of n word sequences is calculated, formula is as follows：

F (W) indicates total frequency that sequence W occurs in formula；

3) setting chapter is distributed thresholding；

When three above attribute is higher than threshold value in n word sequences W, i.e., internal adhesion strength MI (W) is more than 1.8, ultimate bound entropy H (W) When being more than 2 more than 0.5, chapter distribution thresholding, which is confirmed as the extraction of characteristic sentence dry doubling.

6. a kind of machine translation method of the dry extraction of feature based sentence according to claim 1, which is characterized in that described Being disappeared based on MIN-MAX normalization algorithms and local maximum, method is dry to characteristic sentence again is screened specially：

Based on MIN-MAX normalization algorithms to internal adhesion strength MI (W) and ultimate bound entropy H (W) normalized, the weight that disappears is obtained Parameter is screened further according to the local maximum weight method that disappears in the characteristic sentence of extraction is dry；

The formula of the MIN-MAX normalization algorithms is as follows；

Wherein, MI_j' be normalization after inside adhesion strength MI (W), MI_max、MI_minThe maximum of respectively internal adhesion strength MI (W) Value, minimum value, MI_jBeing characterized property sentence does the inside adhesion strength MI (W), H of j_j' be normalization after ultimate bound entropy H (W), H_max、 H_minRespectively maximum value, the minimum value of ultimate bound entropy H (W), H_jBeing characterized property sentence does the ultimate bound entropy H (W) of j, by MI_j’ With H_j' be multiplied the weight parameter GI that disappears to obtain the final product；

The local maximum disappear weight method formula it is as follows：

GI(S_n)>GI(S_n+1) if n=2；

GI(S_n)>=GI (S_n-1)∨GI(S_n)>GI(S_n+1)if 7>n>2；

GI(S_n)>=GI (S_n-1) if n=7；

In formula, GI (S_n) represent the weight parameter that disappears that the characteristic sentence that some includes n word is done, GI (S_n+1) represent some include n+ The dry weight parameter that disappears of the characteristic sentence of 1 word, GI (S_n-1) represent the weight that disappears that the characteristic sentence that some includes n-1 word is done Parameter, S_nThe characteristic sentence that some includes n word is represented to do.

7. a kind of machine translation method of the dry extraction of feature based sentence according to claim 1, which is characterized in that described It refers to by the dry dry ratio of characteristic sentence with the dry database of characteristic sentence of sentence that the dry translation of sentence is searched in the dry database of characteristic sentence Compared with if sentence is dry dry the same with the characteristic sentence in the dry database of characteristic sentence, then the translation that this feature sentence is done, which is that sentence is dry, to be translated Text.

8. using the device of the machine translation method such as the dry extraction of claim 1~7 any one of them feature based sentence, It is characterized in：It is single including the dry Database Unit of characteristic sentence, language in-put unit, the dry extraction unit of sentence, the dry recognition unit of sentence, translation Member and assembled unit；

The dry Database Unit of characteristic sentence includes input subelement, core processing subelement and database subsystem unit；It is described Input subelement is for obtaining more word sequences；The core processing subelement includes that cutting word is screened with statistics computing module, threshold value Module and the molality block that disappears；The cutting word is with statistics computing module mainly by cutting word function sub-modules and statistics computational submodule group At structure meets the dry desired sequence of sentence to the cutting word function sub-modules for identification, and the statistics computational submodule is based on Calculate inside adhesion strength, outer boundary independence and chapter range of distribution that structure meets the dry desired sequence of sentence；The threshold value screening Module is used for the extraction characteristic sentence in structure meets the dry desired sequence of sentence and does, and the molality block that disappears is mainly by normalization submodule Block and the baryon module composition that disappears, the normalization submodule is based on MIN-MAX normalization algorithms to internal adhesion strength value and side Boundary's entropy is handled, and the baryon module that disappears is to disappear that method is dry to characteristic sentence screens for weight according to local maximum；It is described Characteristic sentence is dry is translated into object language B for that will screen gained for database subsystem unit, and records that characteristic sentence is dry and its translation；

The language in-put unit, for inputting language A texts to be translated；

The language in-put unit, the dry extraction unit of sentence, the dry recognition unit of sentence and assembled unit are sequentially connected, and the dry extraction of sentence is single Member, translation unit and assembled unit are sequentially connected, and subelement, cutting word function are inputted in the dry Database Unit of characteristic sentence Module, statistics computational submodule, threshold value screening module, normalize submodule, the baryon module that disappears and database subsystem unit connect successively It connects, the database subsystem unit is connect with the dry recognition unit of sentence.

9. device according to claim 8, which is characterized in that the input subelement includes for selecting input, output The path selection module in path, the cutting word are responsible for generating initial alternative sequence database, the threshold with statistics computing module Value screening module includes parameter setting and screening submodule.

10. device according to claim 9, which is characterized in that the parameter setting and the ginseng set in screening submodule It counts as the threshold value of internal adhesion strength MI (W), ultimate bound entropy H (W), chapter range of distribution.