CN101187924A

CN101187924A - Method and system for obtaining word pair translation from bilingual sentence

Info

Publication number: CN101187924A
Application number: CNA2007101782909A
Authority: CN
Inventors: 高立琦; 刘挺; 王海洲
Original assignee: Harbin Institute of Technology; Beijing Kingsoft Software Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Current assignee: Harbin Institute of Technology; Beijing Kingsoft Software Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date: 2007-11-28
Filing date: 2007-11-28
Publication date: 2008-05-28
Anticipated expiration: 2027-11-28
Also published as: CN100524293C

Abstract

The invention provides a method for obtaining word pair translation from a bilingual sentence pair. The method includes the following steps: A. a lemma to be treated is received; B. the bilingual sentence pair to be chosen is searched from an index resource pool according to the lemma to be treated; C. two groups of bilingual sentence pairs are chosen from the index result, a longest public substring with the same language type sentence as that of the lemma to be treated in the two groups of the bilingual sentence pairs is obtained; D. whether the substring is consistent to the lemma to be treated or not is judged, if being not consistent, another two groups of bilingual sentence pairs are chosen from the index result, the step C is repeated; if being consistent, then, E. the longest public substring of a corresponding sentence in the two groups of the bilingual sentence pairs is obtained. The index way is utilized, thereby reducing the workload of data processing, and improving the efficiency for obtaining the translation. The invention provides a system obtaining the word pair translation from the bilingual sentence pairs.

Description

A kind of from bilingual sentence to obtaining the method and system of speech to translation

Technical field

The present invention relates to language message conversion field, particularly relate to a kind of from bilingual sentence to obtaining the method and system of speech to translation.

Background technology

Internationalization along with the whole world, increasing cultural exchanges worldwide take place, increasing Chinese make in English or other language exchange with the foreigner, as being the Chinese of mother tongue with Chinese, in the spoken language of foreign language, written word, often run into and do not know how to express the expression way that is only the tunnel, do not know how certain foreign name should be spelt, do not know how some Chinese regular collocation should translate into foreign language or the like.Also usually run into identical problem when equally, the foreigner is in use civilian.For solving this type of problem, traditional method is by consulting the dictionary of various manual construction, the dictionary of manual construction, though have very high credibility, the manual construction dictionary, cost height, dictionary renewal frequency are low, can not include the translation of neologisms in time.

Along with internet, fast development of information technology, new bilingual dictionary construction method has appearred in computer realm, and it is no longer dependent on traditional artificial dictionary, has improved efficient, neologisms renewal frequency height, and very convenient for the user.The method that existing bilingual dictionary makes up automatically mainly contains: based on the method for pattern match with based on the word alignment method.Wherein, be according to specific pattern (module) based on the method for pattern match, from text, extract the text of particular form, " bracket explanation type " is wherein a kind of, " single file explanation type " also is based on the method for pattern match.With the bracket explanation type is example, supposes to wait to extract that text is " (mineral water) is of fine qualities for the mineral water of this brand ... ", and according to the pattern of bracket definition, it is right to extract " mineral water-mineral water " such translation speech.Based on the method for pattern match, its advantage is to extract neologisms and the translation that exists on the webpage, and the dictionary scale increases along with the increase of handling webpage quantity.But shortcoming is also clearly, that be exactly the data of internet dragons and fishes jumbled together, very different, and the translation that obtains based on fixed mode is to may not all being high-quality translation.With " bracket type explanation type " is example, is not the translation relation between the content in some bracket and the text before, and " translate to " of extraction is obviously inaccurate like this.And this method need be done more subsequent treatment, such as removing redundant, interfere information.Therefore the accuracy rate of this method is subjected to the restriction of webpage quality usually.

Based on the word alignment method: word alignment be meant will be in the bilingual text (such as China and Britain) speech of translation relation each other identify and the result that obtains.The word alignment method has multiple, known regular method, statistical method and dictionary methods etc.Use in the prior art is the most extensive, and technology is state-of-the-art to be statistics word alignment method.The ultimate principle of statistics word alignment method is: calculate between bilingual sentence centering speech and the speech " translation probability ", the calculating of probability is based on " statistical machine translation model " theory and obtains, and needs the iterative computation several times.Obtaining utilizing diagonal method on the basis of word alignment, can extract the translation phrase.So-called diagonal method refers to a matrix (as Fig. 1) is formed in two-way alignment speech (such as Sino-British, English-Chinese alignment speech), has the position of value to represent alignment relation in the matrix.With Fig. 1 is example, by cornerwise judgement, can think that " industrial training centre " and " industrial training centers " are the paginal translation relations.

The translation result that statistics word alignment method draws is " phrase " of real meaning not necessarily, may be " areof the " such character string.Another shortcoming of statistics word alignment method is that owing to will consider global information, promptly repeatedly iteration is asked probability, makes some little mistakes can cause other phrase alignment.With top example is example, if " training " with on " industrial " is corresponding, " " center " is probably corresponding with " training ", and propagation like this can lead to errors.Therefore, though statistics word alignment method than the whole bag of tricks advanced person before, because need repeatedly iteration to ask probability, needs the data volume of processing big, the processing time is long, need carry out the several processing to whole bilingual sentences, could determine net result.As sentence for 3,000,000 pairs of scales, on server, handle, needing to handle 3-4 talent usually has the result, simultaneously alignment errors may take place, and influences the accuracy of translation result.

Summary of the invention

Technical matters to be solved by this invention provide a kind of from bilingual sentence to obtaining the method and system of speech to translation, improve the translation formation efficiency, improve translation result's accuracy.

In order to address the above problem, the invention discloses a kind of from bilingual sentence to obtaining the method for speech to translation, comprise step:

A, receive pending entry;

B, to retrieve the bilingual sentence of candidate according to pending entry from the index resources bank right;

C, from described result for retrieval, select 2 groups of bilingual sentences right, obtain the Longest Common Substring of identical with pending entry language form of described 2 groups of bilingual sentence centerings;

D, judge whether described substring is consistent with pending entry,, then from result for retrieval, select 2 groups of bilingual sentences right once more, repeating step C if inconsistent; If consistent, then:

E, obtain the Longest Common Substring of described 2 groups of corresponding sentences of bilingual sentence centerings.

Further, also comprise:

F, repeating step C, until any 2 groups of bilingual sentences to all processed;

G, to whole Longest Common Substrings, sort from high to low according to frequency, determine candidate's substring according to predetermined threshold.Further, also comprise:

Obtain bilingual sentence to resource;

Described bilingual sentence is carried out pre-service to resource;

To setting up index, form the index resources bank according to pretreated bilingual sentence.

Wherein, the described detailed process of setting up index is:

Adopt the inverted index method to described bilingual sentence to setting up index.

Further, obtain the Longest Common Substring of described bilingual sentence to corresponding sentence after, also comprise:

Described substring is inserted the translation tabulation;

Translation is put in order, sorts, screened;

Export the translation after the described processing.

Further, receive and also comprise step behind the pending entry:

Described pending entry is carried out word segmentation processing.

Further, from the index resources bank, retrieve the bilingual sentence of candidate to after, also comprise step:

Comprising algorithm according to word string, to filter the bilingual sentence of described candidate right, forms more accurate result for retrieval.

The invention discloses another kind of from bilingual sentence to obtaining the method for speech to translation, comprising:

Receive pending Chinese entry;

It is right to retrieve the bilingual sentence of candidate according to pending Chinese entry from the index resources bank;

From described result for retrieval, select 2 groups of bilingual sentences right, obtain the Longest Common Substring of the right middle sentence of described 2 groups of bilingual sentences;

Judge whether described substring is consistent with pending entry, if inconsistent, then select 2 groups of bilingual sentences right once more from result for retrieval, it is rapid to repeat previous step; If consistent, then:

Obtain the Longest Common Substring of the right English sentence of described 2 groups of bilingual sentences.

Further, receive and also comprise step behind the pending Chinese entry:

Described pending Chinese entry is carried out word segmentation processing.

The invention also discloses another kind of from bilingual sentence to obtaining the method for speech to translation, comprising:

Receive pending English entry;

It is right to retrieve the bilingual sentence of candidate according to pending English entry from the index resources bank;

From described result for retrieval, select 2 groups of bilingual sentences right, obtain the Longest Common Substring of the right English sentence of described 2 groups of bilingual sentences;

Obtain the Longest Common Substring of the right middle sentence of described 2 groups of bilingual sentences.

Wherein, obtain 2 groups of Longest Common Substrings of English that bilingual sentence is right according to improved Longest Common Substring algorithm.

Further, also comprise:

Obtain bilingual sentence to resource;

Described bilingual sentence is carried out pre-service to resource;

Wherein, the described detailed process of setting up index is:

The invention also discloses a kind of from bilingual sentence to obtaining the system of speech to translation, comprising:

Receiving element is used to receive pending entry;

Retrieval unit, it is right to be used for retrieving the bilingual sentence of candidate according to pending entry from the index resources bank;

The substring acquiring unit is used for selecting 2 groups of bilingual sentences right from result for retrieval, obtains the Longest Common Substring of described 2 groups of identical sentences with pending entry language form of bilingual sentence centering;

Judging unit is used to judge whether described substring is consistent with pending entry, if inconsistent, then selects 2 groups of bilingual sentences right once more from result for retrieval, calls the substring acquiring unit;

First generation unit is used to obtain the Longest Common Substring of corresponding of described 2 groups of bilingual sentence centerings.

Further, also comprise the index generation unit, described index generation unit comprises:

Acquiring unit is used to obtain bilingual sentence to resource;

Processing unit is used for bilingual sentence is carried out pre-service to resource;

Second generation unit is used for according to pretreated bilingual sentence setting up index, formation index resources bank.

Further, also comprise:

The word segmentation processing unit is used for pending entry is carried out word segmentation processing.

Further, also comprise:

Filter element, being used for comprising algorithm according to word string, to filter the bilingual sentence of described candidate right, forms more accurate result for retrieval.

Further, also comprise:

The translation processing unit is used for translation is put in order, sorts, screened;

The translation output unit is used to export the translation after the processing.

Further, also comprise:

Second judging unit, whether any 2 groups of bilingual sentences that are used for judging result for retrieval to all processed finishing.

The translation generation unit is used for whole Longest Common Substrings of obtaining described, sorts from high to low according to frequency, determines candidate's substring according to predetermined threshold, and exporting described candidate's substring is that speech is to translation.Compared with prior art, the present invention has the following advantages:

The present invention utilizes the mode of index to reduce the workload of data processing, do not need whole bilingual sentences are counted the around reason, for each pending entry, pass through retrieval technique, only handle a small amount of bilingual sentence relevant and can obtain corresponding translation, improved the efficient of obtaining translation with pending entry; And, owing to only investigate local message, avoided investigating global information in the conventional statistics word alignment method and be subjected to more interference, therefore, the translation that this method obtains is more accurate.

Description of drawings

Fig. 1 is the synoptic diagram of the matrix that two-way alignment speech is formed in the prior art;

Fig. 2 be the present invention a kind of from bilingual sentence to obtaining the process flow diagram of speech to method first embodiment of translation;

Fig. 3 is the process flow diagram of embodiment index resources bank method for building up;

Fig. 4 be the present invention a kind of from bilingual sentence to obtaining the process flow diagram of speech to method second embodiment of translation;

Fig. 5 be the present invention a kind of from bilingual sentence to obtaining the process flow diagram of speech to method the 3rd embodiment of translation;

Fig. 6 be the present invention a kind of from bilingual sentence to obtaining the process flow diagram of speech to method the 4th embodiment of translation;

Fig. 7 be the present invention a kind of from bilingual sentence to obtaining the structured flowchart of speech to first embodiment of translation system;

Fig. 8 be the present invention a kind of from bilingual sentence to obtaining the structured flowchart of speech to the 3rd embodiment of translation system.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

The present invention can be used in numerous general or special purpose computingasystem environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system and comprise distributed computing environment of above any system or equipment or the like.

The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, by by communication network connected teleprocessing equipment execute the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.

Speech described in the present invention is right/and entry can be a phrase, also can be a plurality of phrases, can also be a word or a phrase.

The present invention goes for the sight of various bilingual conversions, as China and Britain's conversion, Sino-Korean conversion, the conversion of moral English, the conversion of moral method ..., for the ease of understanding, being converted to example with China and Britain among the present invention describes, be appreciated that, it does not constitute the restriction to application scenarios of the present invention, for other languages, equally can be suitable for based on same principle.

With reference to Fig. 2, show the present invention a kind of from bilingual sentence to obtaining the process flow diagram of speech to method first embodiment of translation, comprise step:

Step 201, receive pending entry.

Described entry can be a phrase, also can be a plurality of phrases, can also be a word or a phrase, described entry can be that Chinese also can be English, certainly, also can be other language classifications,, can obtain corresponding translation based on the same principle of the present invention as Japanese, Korean, German, French etc.

Step 203, to retrieve the bilingual sentence of candidate according to pending entry from the index resources bank right.

When pending entry is a word, can not need described pending entry is handled, directly the pending entry with described reception is that target is retrieved in the index resources bank.

When pending entry is phrase or phrase or during other situations that need handle, before step 203, also further comprise:

Step 202, described pending entry is carried out word segmentation processing.

As everyone knows, English is unit with the speech, be to separate by the space between speech and the speech, and Chinese is to be unit with the word, and all words link up and could describe a meaning in the sentence.For example, english sentence " I am astudent " then is " I am a student " with Chinese.Computing machine can very simply know that by the space student is a word, but can not be readily understood that " ", " life " two words just represent a speech altogether.The Chinese character sequence of Chinese is cut into significant speech, is exactly Chinese word segmentation.For example, I am a student, and the result of participle is: I am a student.

Introduce some Chinese word segmentation methods commonly used below:

1, based on the segmenting method of string matching: be meant according to certain strategy the entry in Chinese character string to be analyzed and the machine dictionary that presets is mated that if find certain character string in dictionary, then the match is successful (identifying a speech).The actual Words partition system that uses, all be mechanical Chinese word segmentation as a kind of branch means just, also need further improve the accuracy rate of cutting by utilizing various other language messages.

2, based on the segmenting method of mark scanning or sign cutting: be meant preferential identification and be syncopated as the speech that some have obvious characteristic in character string to be analyzed, with these speech as breakpoint, former character string can be divided into less string and advance mechanical Chinese word segmentation again, thereby reduce the error rate of mating; Perhaps participle and part-of-speech tagging are combined, utilize abundant grammatical category information that participle is made a strategic decision and offer help, and in the mark process, conversely word segmentation result is tested, adjusted again, thereby improve the accuracy rate of cutting.

3, based on the segmenting method of understanding: be meant by allowing the understanding of anthropomorphic distich of computer mould, reach the effect of identification speech.Its basic thought is exactly to carry out sentence structure, semantic analysis in participle, utilizes syntactic information and semantic information to handle the ambiguity phenomenon.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Under the coordination of master control part, the participle subsystem can obtain the sentence structure and the semantic information of relevant speech, sentence etc. and come the participle ambiguity is judged that promptly it has simulated the understanding process of people to sentence.This segmenting method need use a large amount of linguistries and information.

4, based on the segmenting method of adding up: be meant, the confidence level that can reflect into speech preferably owing to word and the frequency or the probability of the adjacent co-occurrence of word in the Chinese information, so can add up to the frequency of the combination of each word of adjacent co-occurrence in the language material, calculate their information that appears alternatively, and the adjacent co-occurrence probabilities that calculate two Chinese character X, Y.The information of appearing alternatively can embody the tightness degree of marriage relation between the Chinese character.When tightness degree is higher than some threshold values, can think that just this word group may constitute a speech.This method only needs the word group frequency in the language material is added up, and does not need the cutting dictionary.

The purpose of utilizing index is to reduce the scale of calculating, and raises the efficiency.The present invention adopts the inverted index method, is example with " interdepending ", is after the word segmentation processing " interdepending ", and then carries out reverse index.Suppose that the sentence that " mutually " occurs has { 5,99,101,238,1185,1382,1497}, the sentence that " dependence " occurs has { 7,11,99,238,1100,1382}, by ask union promptly know the common sentence that occurs of " mutually " " dependence " have 99,238,1382}.

Further, carry out again described result for retrieval being further processed after the preliminary search,,, can also reduce the scope by " mutually " context with " dependences " appearance as combining position information again.Utilize inverted index to dwindle process range effectively, raise the efficiency.

Further, comprising algorithm according to word string, to filter the bilingual sentence of described candidate right, forms more accurate result for retrieval.Such as pending entry is " interdepending ", if being expressed as in the Chinese " ... interdependence and dependence ... ", though can retrieve out, do not satisfy word string and comprise algorithm, must filter out.

Step 204, from described result for retrieval, select 2 groups of bilingual sentences right, obtain the Longest Common Substring of identical with pending entry language form of described 2 groups of bilingual sentence centerings.

When pending entry is Chinese, then obtain the Longest Common Substring of sentence in the bilingual sentence centering, when pending sentence when be English, then obtain the Longest Common Substring of bilingual sentence to Chinese and English, promptly obtain identical with pending entry language form Longest Common Substring.

Right to qualified bilingual sentence, select 2 groups of two sentences right, obtain the right public substring of sentence in 2 groups of 2 groups of bilingual sentence centerings according to Longest Common Substring algorithm (LCS), obtain 2 groups of right public substrings of English sentence of 2 groups of bilingual sentence centerings according to improved Longest Common Substring algorithm (ILCS).LCS is an algorithm of asking two character string Longest Common Substrings.Utilizing a matrix to write down match condition between two characters of all positions in two character strings, if coupling then is 1, otherwise is 0.Obtain 1 the longest sequence of diagonal line then, its corresponding position is exactly the position of the longest coupling substring.Improved Longest Common Substring algorithm will be described in detail in the back.

Step 205, judge whether described substring is consistent with pending entry, if inconsistent, repeating step 204 once more then; If consistent, then enter step 206.

When pending entry is when Chinese, judge whether the right Longest Common Substring of sentence in described 2 groups of bilingual sentence centerings is consistent with pending entry, if inconsistent, then repeating step 204, select two groups of bilingual sentences right once more, the right substring of sentence in obtaining, if consistent, then enter step 206.When pending entry is English, judge whether the right Longest Common Substrings of described 2 groups of English sentences are consistent with pending entry, if inconsistent, then repeating step 204, select two groups of bilingual sentences right once more, obtain the right substring of English sentence, if consistent, then enter step 206.

Step 206, obtain the Longest Common Substring of the right corresponding sentence of described 2 groups of bilingual sentences.

When pending entry is consistent with the Longest Common Substring of bilingual sentence centering same type language sentence, then obtain the Longest Common Substring of described 2 groups of bilingual sentences to corresponding sentence, right as bilingual sentence to being Sino-British sentence, when pending entry is Chinese, then obtain the Longest Common Substring of english sentence; Right when bilingual sentence to being moral method sentence, when pending entry is German, then obtain the Longest Common Substring of French sentence.

When substring is consistent with pending entry, then obtain the Longest Common Substring of corresponding 2 groups of sentences, Chinese obtains according to the Longest Common Substring algorithm, and English, German etc. do not need the sentence of participle then to obtain according to improved Longest Common Substring.The substring of described corresponding 2 groups of sentences is the corresponding translation of pending entry.

The described technical scheme of present embodiment has existed for prerequisite with the index resources bank, and setting up the right index resources bank of bilingual sentence is precondition of the present invention.The technical scheme that the present invention is set up the index resources bank is carried out a detailed description below, with reference to Fig. 3, show the process flow diagram of index resources bank method for building up among the present invention, comprises step:

Step 301, obtain bilingual sentence to resource.

Obtaining bilingual sentence has much the method for resource, as can be from the internet online obtaining, also can be by artificial input, other a variety of methods in addition, the present invention does not limit one by one to this.

Step 302, described bilingual sentence is carried out pre-service to resource.

Pretreated purpose is with text normalization, removes information useless, that disturb.Pretreated concrete mode limits according to actual needs, and in embodiments of the present invention, pre-service mainly comprises: the full half-angle conversion of Chinese, Chinese Automatic Word Segmentation, English tokenizing, the same processing of English capital and small letter, the filtration of coding mess code etc.

Step 303, according to pretreated bilingual sentence to setting up index, form the index resources bank.

Setting up index has a variety of methods, and as inverted index method, hashing mask method, the embodiment of the invention preferably adopts the inverted index method to set up index, below by example the process that adopts the inverted index method to set up index is introduced.

Suppose to have two sentences 1 and 2:

The content of sentence 1 is: Tom lives in Guangzhou, I live in Guangzhou too.

The content of sentence 2 is: He once lived in Shanghai.

1) because inverted index is based on keyword index and inquiry, at first need to obtain the keyword of these two sentences, need carry out following treatment measures:

A, to determine all words in the character string earlier, i.e. participle, participle technique is introduced in front, in order to save length, is no longer described in detail at this.

" in " in b, the sentence, " once " speech such as " too " does not have any practical significance, in the Chinese " " word such as "Yes" do not have concrete implication usually yet, filters out the described speech of not representing notion.

Can be when c, common hope inquiry " He " containing " he ", the sentence of " HE " is also found out, and capital and small letter unified in all words.

Can be when d, common hope inquiry " live " containing " lives ", the sentence of " lived " is also found out, so need " lives ", " lived " is reduced into " live ".

Punctuation mark in e, the sentence is not represented certain conception of species usually, can filter out yet.

After described processing: all keywords of sentence 1 are: [tom] [live] [guangzhou] [i] [live] [guangzhou]; All keywords of sentence 2 are: [he] [live] [shanghai]

2) keyword has been arranged after, begin to set up inverted index.Above corresponding relation be: " sentence number " is to " all keywords in the sentence ".Inverted index turns described relation around, becomes: " keyword " is to " have all sentences of this keyword number ".Sentence 1,2 is through becoming behind the row:

Keyword sentence number

guangzhou?1

he?2

i?1

live?1，2

shanghai?2

tom?1

Usually only know keyword occurs not enough in which sentence, we also need to know the position of keyword occurrence number and appearance in sentence, two kinds of positions are arranged usually: a) character position, promptly write down this speech and be which character in the sentence (advantage be keyword bright when apparent the location fast); B) keyword position, promptly writing down this speech is which keyword in the sentence (advantage is to save index space, phrase (phase) inquiry soon), what write down in the reverse index is exactly this position.

After adding " frequency of occurrences " and " position occurring " information, described index structure becomes:

The position appears in keyword sentence number [frequency of occurrences]

guangzhou?1[2]3，6

he?2[1]1

i?1[1]4

live?1[2]，2[1]2，5，2

shanghai?2[1]3

tom?1[1]1

Described index structure: live has occurred in sentence 12 times with this behavior example explanation of live, occurred once in the sentence 2, what is its appearance position that this represents " 2; 5,2 "? analyze in conjunction with the sentence number and the frequency of occurrences, occurred in the sentence 12 times, so " 2; 5 " just represent two positions that live occurs in sentence 1, occurred once in the sentence 2 that remaining " 2 " just represent that live is the 2nd key word in the sentence 2.

After setting up index by above scheme, search if desired when containing live in which sentence, the sentence number 1,2 that only need obtain this keyword correspondence gets final product.

By setting up the index resources bank and, helping quick retrieval, raise the efficiency in conjunction with index technology.

In embodiments of the present invention, improved Longest Common Substring algorithm is at the algorithm of English character string coupling substring, and its algorithm is described below:

Input: sentence s ₁, s ₂

Output: the longest public speech string c

#01 cuts speech, produces word sequence: v ₁← cut speech analysis (s ₁), v ₂← cut speech analysis (s ₂)

#02 record speech number: m ← length (v ₁), n ← length (v ₂)

#03 initialization: L[0..m]=0, CL[0..n]=0, total_len ← 0

#04?for?i←1?to?m

#05 for?j←1?to?n

#06 if?v ₁[i-1]≠v ₂[j-1]then

#07 L[i，j]＝0；CL[i，j]＝0；

#08 else

#09 L[i，j]＝word_length(v ₁[i-1])+L[i-1，j-1]

#10 CL[i，j]＝1+CL[i-1，j-1]

#11 ifL[i，j]≥total_len?then

#12 total_len←L[i，j]

#13 len←CL[i，j]

#14 answer←i

#15?for?i←0?to?len-1

#16 common＝common+v1[answer-len+i]+″″。

After producing the translation corresponding,, can also may further comprise the steps in order to obtain better result with pending entry:

Described substring is inserted the translation tabulation.

Translation is put in order, sorts, screened.

Symbols such as the unnecessary punctuate of head and the tail in the removal extraction translation, space.Translation is sorted, calculate the number of times that same translation occurs in the query tabulation, from high to low translation is sorted according to number of times then.The present invention only thinks that the identical translation of character string is same translation.Certainly, criterion should not be limited to method of the present invention, such as thinking that the insensitive word of capital and small letter is identical, perhaps think all the word original shape is identical and get final product, think that perhaps some article (as the, a) does not influence that to differentiate word mutually equal, all is applicable to the present invention.

The translation screening, screening has several different methods, the present invention preferably to adopt following 2 kinds: the one, utilize " stop words vocabulary " to filter translation, the stop words vocabulary can artificially be specified, and is generally " the ", " of ", common function words such as " of the " or function word combination; Second kind is the score value screening according to ordering and ordering, and the part that is lower than a certain value or a certain number percent is rejected.

Export the translation after the described processing.

With reference to Fig. 4, show the present invention a kind of from bilingual sentence to obtaining the process flow diagram of speech to method second embodiment of translation, pending entry be Chinese among the described embodiment, comprises step:

Step 401, the pending Chinese entry of reception.

Described entry can be a phrase, also can be a plurality of phrases, can also be a word or a phrase.

Step 403, to retrieve the bilingual sentence of candidate according to pending Chinese entry from the index resources bank right.

When pending entry is a Chinese word, can not need described pending entry is handled, directly the pending entry with described reception is that target is retrieved in the index resources bank.

When pending entry is phrase or phrase or during other situations that need handle, before step 403, also further comprise:

Step 402, described pending Chinese entry is carried out word segmentation processing.Described word segmentation processing technology is described in detail in front, considers for length, no longer introduces at this.

According to pending Chinese entry from the index resources bank, retrieve the bilingual sentence of candidate to after, in order to raise the efficiency and degree of accuracy, can also be further processed, therefore can also comprise step: it is right to filter the bilingual sentence of described candidate.Promptly requiring described pending entry must be the word string (word string comprises algorithm) of bilingual sentence centering sentence, if do not satisfy, then filters out.

Step 404, from described result for retrieval, select 2 groups of bilingual sentences right, obtain the Longest Common Substring of the right middle sentences of described 2 groups of bilingual sentences, enter step 405.

Step 405, judge whether described substring is consistent with pending entry,, then from result for retrieval, select 2 groups of bilingual sentences right once more, repeating step 404 if inconsistent; If consistent, then enter step 406.

Suppose 2 groups of sentences selecting to for (c1, e1) and (c2 e2), judges whether c1 consistent with pending Chinese entry with the Longest Common Substring (according to the LCS algorithm) of c2 earlier, if inconsistent, then selects 2 groups of sentences right once more, repeating step 404.If c1 is consistent with Longest Common Substring and the pending Chinese entry of c2, then enter step 406.

Step 406, obtain the Longest Common Substring of the right English sentence of described 2 groups of bilingual sentences.

Obtain English substring of described qualified 2 groups of bilingual sentence centerings according to improved Longest Common Substring algorithm (iLCS), described substring is the English translation of pending Chinese entry.

The described technical scheme of present embodiment has existed for prerequisite with the index resources bank, and setting up the right index resources bank of bilingual sentence is precondition of the present invention.Therefore, in embodiments of the present invention, can also comprise and set up the step of bilingual sentence that described detailed process was described in front, therefore no longer introduced to the index resources bank.

With reference to Fig. 5, show the present invention a kind of from bilingual sentence to obtaining the process flow diagram of speech to method the 3rd embodiment of translation, pending entry be an English among the described embodiment, comprises step:

Step 501, the pending English entry of reception.

Step 503, to retrieve the bilingual sentence of candidate according to pending English entry from the index resources bank right.

When pending entry is an English word, can not need described pending entry is handled, directly the pending entry with described reception is that target is retrieved in the index resources bank.

When pending entry is phrase or phrase or during other situations that need handle, before step 503, also further comprise:

Step 502, described pending English entry is carried out word segmentation processing.Because English word all separates with the space, therefore the English phrase is carried out participle and be easy to realize.

Step 504, from described result for retrieval, select 2 groups of bilingual sentences right, obtain the Longest Common Substring of the right English sentence of described 2 groups of bilingual sentences, enter step 505.

Obtain the Longest Common Substring of the right English sentence of described 2 groups of bilingual sentences according to improved Longest Common Substring algorithm (iLCS).

Step 505, judge whether described substring is consistent with pending English entry,, then from result for retrieval, select 2 groups of bilingual sentences right once more, repeating step 504 if inconsistent; If consistent, then enter step 506.

Suppose 2 groups of sentences selecting to for (c1, e1) and (c2 e2), judges whether e1 consistent with pending English entry with the Longest Common Substring of e2 earlier, if inconsistent, then selects 2 groups of sentences right once more, repeating step 504.If e1 is consistent with Longest Common Substring and the pending English entry of e2, then enter step 506.

Step 506, obtain the Longest Common Substring of the right middle sentences of described 2 groups of bilingual sentences.

Obtain the substring of the middle sentence of described qualified 2 groups of bilingual sentence centerings according to Longest Common Substring algorithm (LCS), described substring is the Chinese translation of pending English entry.

With reference to Fig. 6, show the present invention a kind of from bilingual sentence to obtaining the process flow diagram of speech to method the 4th embodiment of translation, the difference of itself and the inventive method first embodiment is: obtain a plurality of substrings, and select the high several substrings of frequency as best speech translation to be exported, comprise step:

Step 601, receive pending entry.

Step 603, to retrieve the bilingual sentence of candidate according to pending entry from the index resources bank right.

When pending entry is phrase or phrase or during other situations that need handle, before step 603, also further comprise:

Step 602, described pending entry is carried out word segmentation processing.

Step 604, from described result for retrieval, select 2 groups of bilingual sentences right, obtain the Longest Common Substring of identical with pending entry language form of described 2 groups of bilingual sentence centerings.

Step 605, judge whether described substring is consistent with pending entry, if inconsistent, repeating step 604 once more then; If consistent, then enter step 606.

When pending entry is when Chinese, judge whether the right Longest Common Substring of sentence in described 2 groups of bilingual sentence centerings is consistent with pending entry, if inconsistent, then repeating step 604, select two groups of bilingual sentences right once more, the right substring of sentence in obtaining, if consistent, then enter step 606.When pending entry is English, judge whether the right Longest Common Substrings of described 2 groups of English sentences are consistent with pending entry, if inconsistent, then repeating step 604, select two groups of bilingual sentences right once more, obtain the right substring of English sentence, if consistent, then enter step 606.

Step 606, obtain the Longest Common Substring of the right corresponding sentence of described 2 groups of bilingual sentences.

Step 607, judge in the result for retrieval whether all any 2 groups of bilingual sentences are to all processed.

If any two groups of bilingual sentences then finish all processed, if do not have processedly in addition, then repeating step 604,605,606, and the bilingual sentence of all in result for retrieval is to all disposing.Suppose that the bilingual sentence of total N is right in the result for retrieval, can learn, need to handle N* (N-1)/2 and just can dispose.

Step 608 is determined candidate's substring, and exporting described substring is that speech is to translation.

Longest Common Substring for all generations, sort from high to low according to frequency, and default threshold value, when frequency during more than or equal to described threshold value, then as the output of candidate's substring, exporting described substring is that speech is to translation, when frequency is less than predetermined threshold value, illustrate its might not be accurate speech to translation, then can carry out other processing, as abandon and do not export.The predetermined threshold value of translating can be any natural number, as 2,3.......

The described technical scheme of present embodiment has existed for prerequisite with the index resources bank, and setting up the right index resources bank of bilingual sentence is precondition of the present invention.Set up the index resources bank and describe in detail in front, no longer be introduced at this.

Contrast Fig. 7, show the present invention a kind of from bilingual sentence to obtaining the structured flowchart of speech to first embodiment of translation system, comprising:

Receiving element 701, be used to receive pending entry.

Retrieval unit 702, to be used for retrieving the bilingual sentence of candidate according to pending entry from the index resources bank right.

Substring acquiring unit 703, be used for selecting 2 groups of bilingual sentences right, obtain the Longest Common Substring of identical with pending entry language form of described 2 groups of bilingual sentence centerings from result for retrieval.

Judging unit 704, be used to judge whether described substring is consistent with pending entry,, then from result for retrieval, select 2 groups of bilingual sentences right once more, call the substring acquiring unit if inconsistent.

First generation unit 705, be used to obtain the Longest Common Substring of described 2 groups of corresponding sentences of bilingual sentence centerings.

Below the principle of work and the course of work of native system are carried out an introduction:

Receiving element 701 receives pending entry, described entry can be a phrase, also can be a plurality of phrases, can also be a word or a phrase, described entry can be that Chinese also can be English, certainly, also can be other language classifications, as Japanese, Korean, German, French etc.The pending entry that retrieval unit 702 receives according to described receiving element retrieves the bilingual sentence of candidate from the index resources bank right.Substring acquiring unit 703 selects 2 groups of bilingual sentences right from result for retrieval, obtains the Longest Common Substring of described 2 groups of identical sentences with pending entry language form of bilingual sentence centering.When pending entry is Chinese, then obtain the Longest Common Substring of sentence in the bilingual sentence centering, when pending sentence when be English, then obtain the Longest Common Substring of bilingual sentence to Chinese and English, promptly obtain identical with pending entry language form Longest Common Substring.Right to qualified bilingual sentence, select 2 groups of two sentences right, obtain the right public substring of sentence in 2 groups of 2 groups of bilingual sentence centerings according to Longest Common Substring algorithm (LCS), obtain 2 groups of right public substrings of English sentence of 2 groups of bilingual sentence centerings according to improved Longest Common Substring algorithm (ILCS).LCS is an algorithm of asking two character string Longest Common Substrings.Utilizing a matrix to write down match condition between two characters of all positions in two character strings, if coupling then is 1, otherwise is 0.Obtain 1 the longest sequence of diagonal line then, its corresponding position is exactly the position of the longest coupling substring.Judging unit 704 judges whether described substring is consistent with pending entry, if inconsistent, then selects 2 groups of bilingual sentences right once more from result for retrieval, calls the substring acquiring unit.Again obtain the Longest Common Substring of described 2 groups of identical sentences of bilingual sentence centering with pending entry language form.If judging unit 704 judges that described substring is consistent with pending entry, then first generation unit 705 obtains the Longest Common Substring of described 2 groups of corresponding sentences of bilingual sentence centering.

The described technical scheme of present embodiment has existed for prerequisite with the index resources bank, and setting up the right index resources bank of bilingual sentence is precondition of the present invention.

The present invention a kind of from bilingual sentence to obtaining second embodiment of speech to the translation system, except comprising receiving element, retrieval unit, substring acquiring unit, judging unit, first generation unit, also comprise second judging unit, the translation generation unit.Whether any 2 groups of bilingual sentences that described second judging unit is used for judging result for retrieval to all processed finishing.The bilingual sentence that finishes as being untreated in addition is right, then calls substring acquiring unit, judging unit, first generation unit once more.Described translation generation unit is used for all Longest Common Substrings that obtains are selected, and when substring frequency during more than or equal to predetermined threshold value, described substring is exported translation as speech, and when frequency was less than predetermined threshold value, then not exporting described substring was translation.

With reference to Fig. 8, show the present invention a kind of from bilingual sentence to obtaining the structured flowchart of speech to the 3rd embodiment of translation system, the present invention a kind of from bilingual sentence to obtain speech to the 3rd embodiment of translation system, except comprising receiving element, retrieval unit, substring acquiring unit, judging unit, first generation unit, also comprise the index generation unit, described index generation unit comprises:

Acquiring unit 801, be used to obtain bilingual sentence to resource.

Processing unit 802, be used for bilingual sentence is carried out pre-service to resource.

Second generation unit 803, be used for according to pretreated bilingual sentence forming the index resources bank to setting up index.

Setting up index has a variety of methods, and as inverted index method, hashing mask method, the embodiment of the invention preferably adopts the inverted index method to set up index.

The present invention a kind of from bilingual sentence to obtain speech to the 4th embodiment of translation system, except comprising receiving element, retrieval unit, substring acquiring unit, judging unit, first generation unit, index generation unit, can also comprise the word segmentation processing unit, be used for pending entry is carried out word segmentation processing.And filter element, being used for comprising algorithm according to word string, to filter the bilingual sentence of described candidate right, forms more accurate result for retrieval.With the translation processing unit, be used for translation is put in order, sorts, screened.The translation output unit is used to export the translation after the processing.

Need to prove, for aforesaid each method embodiment, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.

In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, do not have the part that describes in detail among certain embodiment, can be referring to the associated description of other embodiment.

More than to provided by the present invention a kind of from bilingual sentence to obtaining the method and system of speech to translation, be described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

One kind from bilingual sentence to obtaining the method for speech to translation, it is characterized in that, comprising:

A, receive pending entry;

B, to retrieve the bilingual sentence of candidate according to pending entry from the index resources bank right;

C, from described result for retrieval, select 2 groups of bilingual sentences right, obtain the Longest Common Substring of identical with pending entry language form of described 2 groups of bilingual sentence centerings;

D, judge whether described substring is consistent with pending entry,, then from result for retrieval, select 2 groups of bilingual sentences right once more, repeating step C if inconsistent; If consistent, then:

E, obtain the Longest Common Substring of described 2 groups of corresponding sentences of bilingual sentence centerings.
2. the method for claim 1 is characterized in that, also comprises:

F, repeating step C, D, E, until any 2 groups of bilingual sentences to all processed;

G, to the described whole Longest Common Substrings that obtain, sort from high to low according to frequency, determine candidate's substring according to predetermined threshold, exporting described candidate's substring is that speech is to translation.
3. method as claimed in claim 1 or 2 is characterized in that, also comprises:

Obtain bilingual sentence to resource;

Described bilingual sentence is carried out pre-service to resource;

To setting up index, form the index resources bank according to pretreated bilingual sentence.
4. method as claimed in claim 3 is characterized in that, the described detailed process of setting up index is:

Adopt the inverted index method to described bilingual sentence to setting up index.
5. the method for claim 1 is characterized in that, receives also to comprise step behind the pending entry:

Described pending entry is carried out word segmentation processing.
6. the method for claim 1 is characterized in that, from the index resources bank, retrieve the bilingual sentence of candidate to after, also comprise step:

Comprising algorithm according to word string, to filter the bilingual sentence of described candidate right, forms more accurate result for retrieval.
One kind from bilingual sentence to obtaining the method for speech to translation, it is characterized in that, comprising:

Receive pending Chinese entry;

It is right to retrieve the bilingual sentence of candidate according to pending Chinese entry from the index resources bank;

From described result for retrieval, select 2 groups of bilingual sentences right, obtain the Longest Common Substring of the right middle sentence of described 2 groups of bilingual sentences;

Judge whether described substring is consistent with pending entry, if inconsistent, then select 2 groups of bilingual sentences right once more from result for retrieval, it is rapid to repeat previous step; If consistent, then:

Obtain the Longest Common Substring of the right English sentence of described 2 groups of bilingual sentences.
8. method as claimed in claim 7 is characterized in that, receives also to comprise step behind the pending Chinese entry:

Described pending Chinese entry is carried out word segmentation processing.
One kind from bilingual sentence to obtaining the method for speech to translation, it is characterized in that, comprising:

Receive pending English entry;

It is right to retrieve the bilingual sentence of candidate according to pending English entry from the index resources bank;

From described result for retrieval, select 2 groups of bilingual sentences right, obtain the Longest Common Substring of the right English sentence of described 2 groups of bilingual sentences;

Judge whether described substring is consistent with pending entry, if inconsistent, then select 2 groups of bilingual sentences right once more from result for retrieval, it is rapid to repeat previous step; If consistent, then:

Obtain the Longest Common Substring of the right middle sentence of described 2 groups of bilingual sentences.
10. method as claimed in claim 9 is characterized in that:

Obtain 2 groups of Longest Common Substrings of English that bilingual sentence is right according to improved Longest Common Substring algorithm.
11. as claim 7 or 9 described methods, it is characterized in that, also comprise:

Obtain bilingual sentence to resource;

Described bilingual sentence is carried out pre-service to resource;

To setting up index, form the index resources bank according to pretreated bilingual sentence.
12. method as claimed in claim 11 is characterized in that, the described detailed process of setting up index is:

Adopt the inverted index method to described bilingual sentence to setting up index.
13. one kind from bilingual sentence to obtaining the system of speech to translation, it is characterized in that, comprising:

Receiving element is used to receive pending entry;

Retrieval unit, it is right to be used for retrieving the bilingual sentence of candidate according to pending entry from the index resources bank;

The substring acquiring unit is used for selecting 2 groups of bilingual sentences right from result for retrieval, obtains the Longest Common Substring of described 2 groups of identical sentences with pending entry language form of bilingual sentence centering;

Judging unit is used to judge whether described substring is consistent with pending entry, if inconsistent, then selects 2 groups of bilingual sentences right once more from result for retrieval, calls the substring acquiring unit;

First generation unit is used to obtain the Longest Common Substring of corresponding of described 2 groups of bilingual sentence centerings.
14. system as claimed in claim 13 is characterized in that, also comprises the index generation unit, described index generation unit comprises:

Acquiring unit is used to obtain bilingual sentence to resource;

Processing unit is used for bilingual sentence is carried out pre-service to resource;

Second generation unit is used for according to pretreated bilingual sentence setting up index, formation index resources bank.
15. as claim 13 or 14 described systems, it is characterized in that, also comprise:

The word segmentation processing unit is used for pending entry is carried out word segmentation processing.
16. as claim 13 or 14 described systems, it is characterized in that, also comprise:

Filter element, being used for comprising algorithm according to word string, to filter the bilingual sentence of described candidate right, forms more accurate result for retrieval.
17. as claim 13 or 14 described systems, it is characterized in that, also comprise:

The translation processing unit is used for translation is put in order, sorts, screened;

The translation output unit is used to export the translation after the processing.
18. system as claimed in claim 13 is characterized in that, also comprises:

Second judging unit, whether any 2 groups of bilingual sentences that are used for judging result for retrieval to all processed finishing.

The translation generation unit is used for whole Longest Common Substrings of obtaining described, sorts from high to low according to frequency, determines candidate's substring according to predetermined threshold, and exporting described candidate's substring is that speech is to translation.