CN104346325B - Information processing method and information processing device - Google Patents

Information processing method and information processing device Download PDF

Info

Publication number
CN104346325B
CN104346325B CN201310325244.2A CN201310325244A CN104346325B CN 104346325 B CN104346325 B CN 104346325B CN 201310325244 A CN201310325244 A CN 201310325244A CN 104346325 B CN104346325 B CN 104346325B
Authority
CN
China
Prior art keywords
word unit
translation rule
multi word
translation
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310325244.2A
Other languages
Chinese (zh)
Other versions
CN104346325A (en
Inventor
郑仲光
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201310325244.2A priority Critical patent/CN104346325B/en
Publication of CN104346325A publication Critical patent/CN104346325A/en
Application granted granted Critical
Publication of CN104346325B publication Critical patent/CN104346325B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides an information processing method and an information processing device. The information processing method includes identifying a multi-word unit in information; looking up translation rules, respectively matched with a most similar multi-word unit and an associated word string of the multi-word unit, in a translation rule database, wherein the associated word string includes all substrings of the multi-word unit and a multi-word unit partially coincided with the multi-word unit; determining scores of translation rules according to relationships between the translation rules and the translation rule matched with the most similar multi-word unit; determining a translation result of the multi-word unit according to the translation rule meeting a preset condition. By a translation method in which the multi-word unit without the matched translation rule is determined according to the translation rule of the most similar multi-word unit, readability of a translation is improved through reference on the translation rule of the most similar multi-word unit.

Description

Information processing method and device
Technical field
The application is related to information processing method and device in natural language processing field, more particularly to machine translation.
Background technology
Statistical machine translation is that a kind of language is learnt translation rule out and the conversion of certain algorithm automatically by some Into the natural language processing technique of another kind of language.
But, in some specific fields such as scientific and technical literature field, due to more multi word unit often occurs, and There is no the translation rule of matching in these multi word units, the translation at this point for these multi word units generally adopts word-by-word translation Mode, cause translation error, affect the readability of translation.
The content of the invention
Presently filed embodiment provides a kind of information processing method and device, can utilize most like multi word unit pair The multi word unit of the translation rule without matching is processed, and improves the readability of translation.
The application embodiment provides a kind of information processing method, including:Multi word unit in identification information;From translation rule The translation rule that the most like multi word unit and association word string of the multi word unit are matched respectively is then searched in data base, it is described All substrings of association word string including the multi word unit and with the partly overlapping multi word unit of the multi word unit;According to each The relation of the translation rule that translation rule is matched with the most like multi word unit determines the score of each translation rule;According to Score meets the translation result that pre-conditioned translation rule determines the multi word unit.
A kind of information processor is provided in another embodiment herein, including:Multi word unit identification module, configuration Into the multi word unit in identification information;Translation rule searching modul, is configured to from translation rule data base search many words The translation rule that the most like multi word unit and association word string of unit is matched respectively, the association word string includes many word lists Unit all substrings and with the partly overlapping multi word unit of the multi word unit;Score determining module, is configured to be turned over according to each The relation for translating the regular translation rule matched with the most like multi word unit determines the score of each translation rule;And turn over Result determining module is translated, is configured to meet the translation knot that pre-conditioned translation rule determines the multi word unit according to score Really.
In the application, the multi word unit of translation rule is not matched according to the determination of the translation rule of most like multi word unit Interpretation method, the reference to the translation rule of most like multi word unit improves the readability of translation.
Description of the drawings
With reference to the explanation of the application embodiment, can be more readily understood that below in conjunction with the accompanying drawings the application more than and Other objects, features and advantages.Accompanying drawing with example nonrestrictive mode illustrating the application.In the accompanying drawings, identical or Similar technical characteristic or part will be represented using same or similar reference.
Fig. 1 to Fig. 3 illustrates the parallel corpora of machine translation system;
Fig. 4 illustrates the schematic diagram being labeled to information using annotator;
Fig. 5 illustrates the schematic flow sheet of the embodiment one of the interpretation method of multi word unit;
Fig. 6 illustrates the schematic flow sheet of the embodiment two of the interpretation method of multi word unit;
Fig. 7 illustrates the flow chart of the application example of the information processing method that the application is provided;
The translation result of information in the application example that Fig. 8 illustrates shown in Fig. 7;
Fig. 9 illustrates the structural representation of the information processor that the application embodiment is provided;And
Figure 10 illustrates the structural representation of the computing device that the application embodiment is provided.
Specific embodiment
With reference now to accompanying drawing, it is described more fully presently filed embodiment.Example embodiment is provided for Make the application more detailed, and protection domain is fully passed on to those skilled in the art.Elaborate numerous specific details Such as particular elements, the example of device, to provide the detailed understanding to presently filed embodiment.For those skilled in the art Speech will be obvious that example embodiment can be implemented with many different forms and not necessarily use these specific thin Section, therefore they shall not be interpreted restriction scope of the present application.In addition, in an accompanying drawing of the present utility model or one kind Element and feature described in embodiment can with the element that illustrates in one or more other accompanying drawings or embodiment and Feature combines.In some example embodiments, for purposes of clarity, crowd is not described in detail in the drawings and in the description Well known process, structure and technology.
Below with reference to the accompanying drawings the information processing method of the application offer is provided with reference to specific embodiment.Wherein, according to many The extracting mode of word unit, the order translated to the multi word unit that obtains of extraction are introducing the information processing of the application offer Method.
The embodiment one of the extracting mode of multi word unit
Present embodiment provides a kind of extracting mode of multi word unit, specific as follows.
Bilingual parallel corpora is obtained from the corpus of machine translation system, the parallel corpora is that bilingual is carried out Paired information after the registration process of chapter, paragraph, sentence etc., with it is bilingual as Chinese and english as a example by, for example obtain such as Fig. 1 Shown parallel corpora, and the intertranslation relation of word between Chinese data as shown in Figure 2 and English language material is obtained, wherein, arrow Head represents the corresponding relation between word.Wherein, the intertranslation relation of the word between Chinese data and English language material and hereinafter described The extraction of English end part-of-speech tagging and English multi word unit both can artificially realize, it is also possible to by arranging corresponding operation journey Sequence is realized by machine, without limitation herein.
It with word is basic linguistic unit itself to be due to English, and the ambiguousness of English is less than Chinese, can pass through Analysis English end obtains multi word unit, then determines the multi word unit at Chinese end by English-Chinese intertranslation relation.As shown in figure 3, English end adds part-of-speech tagging, wherein, for example use " VV " to represent verb, " NN " represents noun, and " P " represents preposition, " DT " table Show article, " VBG " represents gerund, " JJ " represents adjective, then select one or more continuous part of speech to be NN (Noun)Word string, obtain the English word string of such as " polymeric cyanoacrylate film ", that is to say, that by point The English end of analysis obtains including the word string of multiple English words, i.e., the multi word unit at English end.
Then, Chinese terminal word string corresponding with the multi word unit at English end is obtained, for example, is got and " polymeric Cyanoacrylate film " corresponding " polynitriles/base/acrylic acid/ester/film ", using the Chinese terminal word string as multi word unit.
Substantial amounts of multi word unit matching word string, example can be obtained from the corpus of machine translation system by the way Such as<Polymeric cyanoacrylate film, polynitriles/base/acrylic acid/ester/film>.
The training set of the multi word unit extractor for Chinese is built using the substantial amounts of multi word unit matching word string for obtaining, And using the training set training multi word unit extractor so that multi word unit extractor possesses many word lists of extraction Chinese from information The ability of unit.Here, multi word unit extractor can be realized using any available grader, it is possible to using any training side Formula is being trained.
After being trained to multi word unit extractor, it is possible to use the multi word unit extractor that training is obtained extracts the Chinese Multi word unit in language information.
Note, it is in this application, by taking Chinese-English parallel corpora as an example and right as processing using Chinese as an example As.But the application not limited to this.With said circumstances conversely, can first determine when training set is obtained with English as process object Chinese multi word unit, then mapping obtains English multi word unit.With the multi word unit extractor that the training of such training set is obtained Can be used for extracting the multi word unit in English information.Certainly, such scheme can apply to the language of any two intertranslation.
In addition, for the mark of Chinese multi word unit in training set, it is also possible to do not utilize parallel corpora, and directly in Chinese Artificial or machine mark multi word unit in language material.
In addition, in embodiment above, due to the difference between language, the Chinese corresponding with the multi word unit in English " multi word unit " in language may not be real multi word unit, and may only include a word.
Therefore, in a kind of modification preferably, the further Screening Treatment of multi word unit is set, for example, is removed only Including the multi word unit of a word, so as to obtain multi word unit with a high credibility, for convenience, hereinafter will further sieve The multi word unit obtained after choosing is referred to as credible multi word unit.
In addition, in embodiment above, due to the difference between language or the difference of the extracting mode of multi word unit It is different, it is also possible to cause to extract inappropriate multi word unit, for example may " memorizer " as a multi word unit its In " " word should be deleted.
Thus, in another kind of modification preferably, can also arrange individually or with reference to above-mentioned screening conditions Other conditions, for example, can carry out following further Screening Treatments to the multi word unit for obtaining:
Language material in corpus is arranged and disables vocabulary, and the deactivation vocabulary is included for example for translation itself is without substantive shadow Loud or very universal some words for not being easily caused translation error of application, disable the stop words in vocabulary by user according to reality Border needs to arrange, such as including " one/kind ", the collocation of " this/invention " this kind of high frequency and be not easy to cause the word of translation error;
For including the multi word unit of two words, if two words exclude the multi word unit all in vocabulary is disabled;
For including the multi word unit of more than three words, if the border word of multi word unit is off word, the side is deleted Boundary's word, using remaining multi word unit as credible multi word unit, if the border word of multi word unit is not off word, directly will Multi word unit is used as credible multi word unit.Wherein, using remaining multi word unit as can also be further during credible multi word unit Be defined to when remaining multi word unit is present in and quantity in the multi word unit for extracting more than setting threshold value when, just will Remaining multi word unit is used as credible multi word unit.For example, the multi word unit "/short of money/anti-/ agent " for obtaining for extraction, investigates Its most left and most right word, find " " be off word, that is, disable the word in vocabulary, then first the word " " remove, then see Whether remaining part " short of money/anti-/ agent " exists in the multi word unit for having extracted, if there is and quantity more than setting Threshold value such as 3, then it is assumed that it is credible multi word unit to remove the multi word unit after the vocabulary of border " short of money/anti-/ agent ".
In the modification of above-mentioned embodiment, processed by the removing to multi word unit and obtain being made up of credible multi word unit Training set.
In another kind of modification preferably, the information in addition training set is further processed so that instruction The ratio of the multi word unit that the information such as sentence that white silk is concentrated includes is more than default minimum scale.
Multi word unit set Lt is built first, and the multi word unit set can be many word lists obtained by above-mentioned embodiment Set, or the set by the credible multi word unit structure obtained in the modification of above-mentioned embodiment that unit builds.
After obtaining multi word unit set Lt, sentence is selected to build multi word unit from corpus using the multi word unit set The training set of annotator, detailed process is as follows.
The sentence in corpus is selected according to the ratio of multi word unit in sentence, multi word unit therein is multi word unit collection Multi word unit in conjunction.
For example, when the sentence in corpus meets following conditions, the sentence is added into the training of multi word unit extractor Collection:
Formula 1.0
Wherein, wordnum (*) represents the word number that word string " * " is included;MWU_in_Lt represents many word lists occurred in Lt Unit;Sentence represents sentence;K is predetermined threshold value, and value can be 0.1 to 0.3, for example, can be 0.2.
That is, the word number for arranging the multi word unit included in sentence accounts for the minimum ratio of the total word number of sentence, higher than this The sentence of ratio can add the training set of multi word unit annotator.
The embodiment two of the extracting mode of multi word unit
Present invention also provides the extracting mode of another multi word unit, this mode depends on multi word unit to mark Device, the multi word unit annotator can in the information mark beginning word, medium term and the ending word of multi word unit.Machine translation system System can determine multi word unit according to the beginning word of continuous multi word unit, medium term and ending word.The multi word unit annotator Training can adopt and upper embodiment similar mode, simply in the training set obtained by a upper embodiment, need Multi word unit is further marked position of the word included by it in multi word unit, for example, starts word, medium term and ending Word.
For example, position mark can be carried out into participle to the multi word unit in the sentence in training set, for example, for sentence Son " covering the wound of skin surface with the polycyanoacrylate film with broad-spectrum anti-microbial activity " is plus multi word unit As shown in figure 4, wherein n represents non-multi word unit after mark, b represents that multi word unit starts, and m is represented in the middle of multi word unit, and e is represented Multi word unit suffix.
Then, using the training set, by machine learning model such as CRF(Condition random field)Training annotator, makes this Annotator possesses position when whether the word in label information belongs to multi word unit and belong to multi word unit in multi word unit Deng function.
After finishing to the training of multi word unit annotator, the letter that the annotator obtained using training is received to machine translation system Breath is processed, and the identification from information starts word, medium term and ending word.Then, by the middle of coherent beginning word, at least one Word and ending word or coherent beginning word and ending word(Now there is no medium term)It is defined as multi word unit.
It should be noted that there is the extracting mode of various multi word units in prior art, that is to say, that except the application Embodiment provide above-mentioned multi word unit extracting mode, using other multi word units of the prior art extracting mode according to Multi word unit can be so obtained, the interpretative system of the multi word unit to extracting then being provided below by the application still may be used To realize the translation to multi word unit.The extracting mode of the above-mentioned multi word unit that the application embodiment is provided only is in order at example And unrestriced purpose.
The interpretation method of multi word unit
Extracted from information after multi word unit according to the extracting mode of any of the above-described multi word unit, performed to extracting The process that multi word unit is translated.
In the application, when translation process is carried out to multi word unit, multi word unit is searched from translation rule data base The translation rule that most like multi word unit and association word string are matched respectively.Wherein, the most like multi word unit is multi word unit Itself meets pre-conditioned multi word unit with the similarity of multi word unit.For example, match with multi word unit when existing During translation rule, the most like multi word unit is multi word unit itself;When there is no the translation rule matched with multi word unit, The most like multi word unit is then to meet pre-conditioned multi word unit with the similarity of multi word unit outside multi word unit(Go out In the purpose of convenient description, hereafter by multi word unit outside meet pre-conditioned multi word unit with the similarity of multi word unit Referred to as similar multi word unit).Association word string includes all substrings of multi word unit(Including its own)And with multi word unit portion Divide the multi word unit for overlapping.
Translation by different embodiments introduction to multi word unit separately below is processed.
The embodiment of the interpretation method of multi word unit(One)
Present embodiment provides a kind of interpretation method of multi word unit, as shown in figure 5, including:
Step S501, searches from translation rule data base and distinguishes with the word string that associates of multi word unit and multi word unit The translation rule matched somebody with somebody.
Translation rule data base is the data base of the storage translation rule built in advance in machine translation system, wherein storing Translation rule including multi word unit translation rule and the translation rule of single word.
Step S502, judges whether the translation rule matched with multi word unit.
If it is present execution step S503;Otherwise, execution step S504.
Step S503, according to the translation rule matched with multi word unit the translation result of multi word unit is determined.
This step implement can according in prior art exist match with multi word unit translation rule when side Formula is processed, and the translation rule for for example directly being matched using multi word unit determines the translation result of multi word unit, herein to this Repeat no more.
Step S504, searches the similar multi word unit of multi word unit from translation rule data base, obtains similar many words The translation rule of units match, is labeled as s.
Wherein it is determined that the setting for meeting the similarity that pre-conditioned similar multi word unit is relied on includes but is not limited to root According to whether including one or more words of identical, such as including the multi word unit of two words, the phase of its similar multi word unit It is to include an identical word like degree condition;For including the multi word unit of more than three words, its similar multi word unit it is similar Degree condition is to include two or more identical word or more than half identical word, the application not limited to this.
Step S505, for each translation rule of each substring matching of multi word unit, according to each translation rule with The relation of the translation rule of similar multi word unit is determining the score of each translation rule.
Wherein, provide in present embodiment and for example following advised to the translation of similar multi word unit according to each translation rule Relation then is determining the mode of the score of each translation rule:
For each translation rule<s,ti>, count each translation rule and referring to translation rule, i.e. present embodiment In similar multi word unit matching translation rule in occur number of times, i.e. c1 (s, ti);
All translation rules of each translation rule original language s parts matching are counted in turning over that similar multi word unit is matched Translate the sum of the number of times occurred in rule, i.e. Sum (c1 (s, ti'));
The score of each translation rule is determined according to c1 (s, ti) and Sum (c1 (s, ti')).
Herein, the score of translation rule can be determined according to following formula:
Formula 2.0
Wherein, score (s, ti) represents the score of each translation rule, and c (s, ti) represents each translation rule in reference The number of times occurred in translation rule, s represents for example Chinese multi word unit, and ti represents a kind of translation result, such as when this method should For representing translator of English result during Chinese-English translation, above-mentioned c1 (s, ti) is in this step.
Sum (c (s, ti')) represents all translation rules of each translation rule s parts matching in the correspondence ginseng The sum of the number of times occurred in translation rule is examined, above-mentioned Sum (c1 (s, ti')) is in this step.
M and n is default score coefficient, wherein, the span of m and n can be respectively 0 to 1.
Herein, can be determining the score of translation rule according to following manner:
When the value of c (s, ti) is 0, translation rule must be divided into default value, and such as -1;
When the value of c (s, ti) is not 0, the score of correspondence translation rule is calculated according to formula 2.0.
It is to be appreciated that the above-mentioned translation rule according to each translation rule and similar multi word unit that present embodiment is provided Relation be only in order at example and unrestriced purpose the mode that determines the score of each translation rule, may be used also based on which Other deformations can be made or changed, the application not limited to this.
Step S506, according to score the translation result that pre-conditioned translation rule determines multi word unit is met.
For example, according to score more than threshold value or highest scoring translation rule determine multi word unit translation knot Really.
The embodiment of the interpretation method of multi word unit(Two)
Present embodiment provides a kind of interpretation method of multi word unit, as shown in fig. 6, including:
Step S601, searches from translation rule data base and distinguishes with the word string that associates of multi word unit and multi word unit The translation rule matched somebody with somebody.
This step may be referred to above-mentioned steps S501, will not be described here.
Step S602, judges whether the translation rule matched with multi word unit.
If it is present execution step S603;Otherwise, execution step S604.
Step S603, according to the translation of the conjunctive word String matching of the translation rule and multi word unit matched with multi word unit Rule determines the translation result of multi word unit.
Implementing for this step can be in the following manner:
Determine the translation rule that multi word unit is matched must be divided into preset reward point, i.e., the first predetermined score;Also,
For each translation rule of each substring matching of multi word unit, according to each translation rule and multi word unit The relation of the translation rule matched somebody with somebody is determining the score of each translation rule;
The translation result that pre-conditioned translation rule determines multi word unit is met according to score.
Wherein, preset reward point is, for example ,+1, or other are more than the value of above-mentioned default value.
The relation of the translation rule matched with multi word unit according to each translation rule determine each translation rule Implementing for dividing is referred to following manner:
For each translation rule<s,ti>, the number of times that each translation rule occurs in reference to translation rule is counted, That is c2 (s, ti), this is with reference to the translation rule that translation rule is multi word unit matching;
Count the corresponding substring of each translation rule(That is original language s parts)All translation rules of matching are in multi word unit The sum of the number of times occurred in the translation rule of matching, i.e. Sum (c2 (s, ti'));
The score of each translation rule is determined according to c2 (s, ti) and Sum (c2 (s, ti')).
Wherein, determine that the mode of the score of each translation rule can be adopted according to c2 (s, ti) and Sum (c2 (s, ti')) Perform with the upper identical mode of embodiment Chinese style 2.0, herein the c (s, ti) in formula 2.0 is above-mentioned c2 (s, ti), Sum (c (s, ti')) is above-mentioned Sum (c2 (s, ti')), in other aspects processed using formula 2.0 and a upper embodiment Similar, here is omitted.
Step S604 to step S606 may be referred to above-mentioned steps S504 and realize to step S506, omit herein detailed to its State, wherein it should be noted that when exist multi word unit matching translation rule when, according to reward score and meter in step S606 The magnitude relationship of the score of each translation rule for obtaining determine according to the translation rule of multi word unit matching or other turn over Translate the translation result that rule determines multi word unit.
The embodiment of the interpretation method of multi word unit(Three)
The interpretation method of the multi word unit that present embodiment is provided is in embodiment(One)Or(Two)On the basis of increase under State step:
Operations described below is performed after step S501 or S601:
Whether the substring for judging each translation rule destroys the border of multi word unit;
If it is judged that to be, determine substring matching translation rule must to be divided into default penalty values, i.e., second pre- Determine score;
If it is judged that otherwise to continue next process, i.e. step S502 or S602.
Wherein, judge whether the substring of each translation rule destroys the mode on the border of multi word unit and include:For per bar Candidate rule, can judge whether the rule destroys the border of multi word unit by its matching range.
For example, the scope of each word in the sentence of multi word unit is set, for example, represents each with natural number order from small to large The scope of word, for example, the scope of each word is represented sequentially as " 1/2/ in " peroxide/removing/agent/include/meat/bean/bandit/fat " 3/4/5/6/7/8 ", i.e., " peroxide " corresponds to 1, and " removing " corresponds to 2, successively downwards.
If spr [i, j] is the matching range of translation rule, spmwu [g, h] is the matching range of multi word unit, then:
If meeting (1) i<G and g<j<H or (2) j>H and g<i<H then thinks that translation rule destroys many word lists The border of unit;
If meeting g < i≤j≤h or g≤i≤j < h, then it is assumed that translation rule is the partial translation rule of multi word unit Then, i.e., translation rule does not destroy the border of multi word unit.
Wherein, default punishment is divided into smaller value, so that machine translation system will not be true according to the translation rule on destruction border Determine the translation result of multi word unit.
The embodiment of the interpretation method of multi word unit(Four)
The interpretation method and embodiment of the multi word unit that present embodiment is provided(Three)It is similar, but be carried out to whether breaking The opportunity of the judgement of bad selvedge circle there occurs change, and in this step, the judgement to whether destroying border can be arranged on step S505 Or before S605, other implement and may be referred to embodiment(Three), will not be described here.
The application example of information processing method
Introduce this as a example by carrying out machine translation to information " peroxide/removing/agent/include/meat/bean/bandit/fat " below The information processing method that application is provided.
For convenience of description, the embodiment of the extracting mode of above-mentioned multi word unit is adopted in this example(Two)To recognize letter Multi word unit in breath, using the embodiment of the interpretation method of multi word unit(Three)Multi word unit is translated.Should note Meaning, other combinations that can also adopt above-mentioned embodiment are processed above- mentioned information, and its process is referred to this example, this Place repeats no more.And, this application example is only in order at descriptive purpose rather than limits the concrete application of the present invention.
Fig. 7 illustrates the flow chart of the application example of the information processing method that the application is provided, including:
Step S701, using annotator identification information " peroxide/removing/agent/include/meat/bean/bandit/fat " in it is many Word unit, obtains multi word unit " peroxide/removing/agent " and " meat/bean/bandit/fat ".
Annotator is to the mark of information " peroxide/removing/agent/include/meat/bean/bandit/fat " as shown in figure 8, according to even Continuous multi word unit start word, medium term and ending word obtain multi word unit " peroxide/removing/agent " and " meat/bean/bandit/ Fat ".
Step S702, searches from translation rule data base and distinguishes with the word string that associates of multi word unit and multi word unit The translation rule matched somebody with somebody.
In this example, as shown in table 1, find and multi word unit " peroxide/removing/agent ", its substring " mistake for associating Oxide ", " peroxide/removing ", " peroxide/removing/agent ", " peroxide/X1”、“X1/ remove " and " removing/X1/ Including " translation rule that matches respectively, and find associate with multi word unit " meat/bean/bandit/fat " substring " including/meat ", " including/X1", the translation rule that matches respectively of "flesh" (nonproductive construction), " bean " and " fat ".
Table 1
Step S703, judges whether the translation rule destroys multi word unit by the matching range of every candidate rule Border, if translation rule destroys border, arranging the rule must be divided into penalty values, otherwise, continue executing with following process.
Wherein, the scope of each word in information " peroxide/removing/agent/include/meat/bean/bandit/fat " is for example pre-set It is followed successively by " 1/2/3/4/5/6/7/8 ", i.e. the scope difference of multi word unit " peroxide/removing/agent " and " meat/bean/bandit/fat " For " 1/2/3 " and " 5/6/7/8 ".Now, for above-mentioned table 1 in each translation rule, due to scope [2,4] and [4,5] respectively The border 3 and 5 of multi word unit " peroxide/removing/agent " and " meat/bean/bandit/fat " is destroyed, therefore translation rule is set(6) Extremely(8)Must be divided into penalty values such as -1.
Step S704, machine translation system searches the translation rule for obtaining being matched with multi word unit " peroxide/removing/agent " Then, arrange the translation rule must be divided into reward score such as 1.
Step S705, machine translation system calculates turning over for the substring matching of multi word unit " peroxide/removing/agent " association Translate the score of rule.
Wherein, with rule(2)As a example by, its score is calculated according to formula 2.0:Due to rule(2)In the rule of multi word unit matching It is then i.e. regular(3)In without occur, therefore, c2 (s, ti)=0, herein by the way of default value, that is, arrange rule(2) It is divided into default value -1.
For rule(1)With(4), its number of times occurred in the translation rule of multi word unit matching is 1, i.e. c2 (s, Ti)=1, and it has no other translation rules, therefore, Sum (c2 (s, ti'))=1 must be divided into according to formula 2.0 is calculated:It is assumed that m=n=1, then score (s, ti)=1.
Step S706, machine translation system does not find the translation rule of multi word unit " meat/bean/bandit/fat " matching, root The score of the translation rule of each substring is calculated according to the translation rule of its similar multi word unit " meat/bean/cool/fat ".
For example, the translation rule of the similar multi word unit " meat/bean/cool/fat " of " meat/bean/bandit/fat " is<Meat/bean/cool/ Fat, myristyl ester>, then now according to the score of each substring "flesh" (nonproductive construction), " bean " and " fat " in the computational chart 1 of formula 2.0.
By taking substring "flesh" (nonproductive construction) as an example, due to its translation rule<Meat, meat>In the translation rule of similar multi word unit<Meat/bean/ Cool/fat, myristyl ester>In do not occur, therefore c1 (s, ti)=0, it must be divided into default value -1.
The situation of substring " bean " is similar with "flesh" (nonproductive construction), must be divided into default value -1.
And such as substring " fat ", due to its translation rule<Fat, ester>In the translation rule of similar multi word unit<Meat/ Bean/cool/fat, myristyl ester>It is middle to occur 1 time, therefore, c1 (s, ti)=1, and substring " fat " is without other translation rule Then, therefore, Sum (c1 (s, ti'))=1 must be divided into according to the translation rule that formula 2.0 calculates substring " fat ":It is assumed that m=n=1, then score (s, ti)=1.
Step S707, machine translation system determines the translation result of multi word unit according to the translation rule of highest scoring, obtains To translation result " Peroxide scavenger includes ester ".
Wherein, for obtain in the process of multi word unit " peroxide/removing/agent " three must be divided into 1 translation rule, Process is carried out according to three translation rules obtain translation result " Peroxide scavenger ".
For multi word unit " meat/bean/bandit/fat ", the translation rule of highest scoring in substring is associated according to it, i.e.,<Fat, ester>It is " ester " to carry out processing the translation result for determining multi word unit " meat/bean/bandit/fat ".
Translation to other words in information is combined according to the translation result of above-mentioned multi word unit and obtains the translation knot to information Really " Peroxide scavenger includes ester ".
The translation result of multi word unit, but the application embodiment are determined in this example according to the translation rule of highest scoring Not limited to this, for example can arrange carries out processing the translation knot for determining multi word unit according to score in threshold value above translation rule Really.
In addition, above-mentioned steps S705 and S706 have no precedence relationship, it is described separately to be only in order at clearly purpose, does not limit Make it and implement order.
In the application, turning over for the multi word unit for not matching translation rule is determined according to the translation rule of similar multi word unit Method is translated, the reference to the translation rule of similar multi word unit improves the readability of translation.
The embodiment of information processor
According to identical technology design in said method embodiment, the application embodiment also provided at a kind of information Reason device, as shown in figure 9, including:Multi word unit identification module 10, the multi word unit being configured in identification information;Translation rule Searching modul 20, is configured to from translation rule data base search the most like multi word unit of the multi word unit and association word string The translation rule for matching respectively, all substrings of the association word string including the multi word unit and with the multi word unit portion Divide the multi word unit for overlapping;Score determining module 30, is configured to according to each translation rule and the most like multi word unit institute The relation of the translation rule of matching determines the score of each translation rule;With translation result determining module 40, it is configured to basis and obtains Divide and meet the translation result that pre-conditioned translation rule determines the multi word unit.
Wherein, multi word unit identification module 10 can be according to the embodiment one or two of the extracting mode of above-mentioned multi word unit In the multi word unit that comes in identification information of method.
For example, the multi word unit identification module 10 can include mark unit 11, be configured to based on machine learning in sentence In mark out the beginning word of multi word unit, medium term and ending word;Multi word unit determining unit 21, is configured to according to mark unit Coherent beginning word, medium term and the ending word for marking determines multi word unit.
Score determining module 30 can be arbitrary in the embodiment one to three of the interpretation method of above-mentioned multi word unit The mode of the score of the determination multi word unit described in embodiment is carried out.
For example, score determining module 30 can be configured to include:First number statistic unit 31, is configured to statistics described every The number of times that one translation rule occurs in the translation rule that the most like multi word unit is matched, i.e., first time number;Second Number statistic units 32, be configured to count all translation rules that the corresponding substring of each translation rule matched it is described most The sum of the number of times occurred in the translation rule that similar multi word unit is matched;And determine to obtain subdivision 33, it is configured to according to institute First number is stated with score that is described and determining each translation rule.This determine subdivision 33 may be configured to ought be most When similar multi word unit is multi word unit itself, obtaining for the translation rule that multi word unit is matched as the substring of its own is determined It is divided into the first predetermined score.
Wherein it is determined that subdivision 33 can according to first number with and proportionate relationship determine each translation rule Point.The formula 2.0 being for example referred in said method embodiment determines the score of translation rule.In addition, when first number Be worth for 0 when, it is determined that subdivision 33 can directly arrange correspondence translation rule and must be divided into default value.
In addition, the information processor can also include border judge module 50, perform as in above-mentioned method embodiment The judgement on the described border for whether association word string being destroyed to multi word unit.Correspondingly, score determining module 40 can with The judged result of judge module 50 must be divided into predetermined score when being, to determine the translation rule of conjunctive word String matching.
To the computing device for implementing the apparatus and method of the application
All modules, unit can be matched somebody with somebody by way of software, firmware, hardware or its combination in said apparatus Put.The specific means or mode that configuration can be used is well known to those skilled in the art, and will not be described here.By software or In the case that firmware is realized, from storage medium or network to the computer with specialized hardware structure(It is logical for example shown in Figure 10 With computer 1100)The program for constituting the software is installed, the computer is able to carry out various functions when various programs are provided with Deng.
In Fig. 10, CPU (CPU) 1101 according in read only memory (ROM) 1102 store program or from Storage part 1108 is loaded into the various process of program performing of random access memory (RAM) 1103.In RAM1103, also root The data required when CPU1101 performs various process etc. are stored according to needs.CPU1101, ROM1102 and RAM1103 via Bus 1104 is connected to each other.Input/output interface 1105 is also connected to bus 1104.
Components described below is connected to input/output interface 1105:Importation 1106(Including keyboard, mouse etc.), output Part 1107(Including display, such as cathode ray tube (CRT), liquid crystal display (LCD) etc., and speaker etc.), storage part Divide 1108(Including hard disk etc.), communications portion 1109(Including NIC such as LAN card, modem etc.).Communication unit 1109 are divided to perform communication process via network such as the Internet.As needed, driver 1110 can be connected to input/output and connect Mouth 1105.Detachable media 1111 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed in as needed drive On dynamic device 1110 so that the computer program for reading out is installed to as needed in storage part 1108.
It is such as removable from network such as the Internet or storage medium in the case where above-mentioned series of processes is realized by software Unload medium 1111 and the program for constituting software is installed.
It will be understood by those of skill in the art that this storage medium is not limited to the journey that is wherein stored with shown in Figure 10 Sequence and equipment are separately distributed to provide a user with the detachable media 1111 of program.The example bag of detachable media 1111 Containing disk (include floppy disk (registered trade mark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), Magneto-optic disk(Comprising mini-disk (MD) (registered trade mark)) and semiconductor memory.Or, storage medium can be ROM1102, deposit Hard disk included in storage part 1108 etc., wherein computer program stored, and it is distributed to user together with comprising their equipment.
The application also propose to be stored with machine-readable instruction code program product.The instruction code is read by machine When taking and performing, the part that can perform in above-mentioned method or method according to the application any embodiment is processed.
Correspondingly, the storage medium for carrying the program product of the instruction code of the above-mentioned machine-readable that is stored with also is wrapped Include in disclosure of the invention.The storage medium includes but is not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc. Deng.
It should be noted that term used herein is only used for describing the purpose of specific embodiment, and it is not intended to limit The application processed.The " one " of singulative used herein and " this(the)" be intended to also include plural form, unless up and down Text clearly indicates different implications.It is also understood that term " including " refers to illustrated feature, whole when using in this manual The presence of body, operation, step, element and/or part, but it is not excluded for one or more other features, entirety, operation, step Suddenly, the presence or addition of element, part and/or its combination.
The institute of corresponding construction, material, action and the key element defined with " device or step add function " in claim There are equivalents to be intended to include arbitrary structures, material or the action for carrying out perform function with reference to other claimed key elements. It is for purposes of illustration and description, and to be not intended to come exhaustive in disclosed form or limit this Shen to the description of the present application Please.Those skilled in the art can expect many modifications to the application in the case of without departing from scope of the present application and spirit And change.Selected and description embodiment is to most preferably explain the principle and practical application of the application, and so that sheet The others skilled in the art in field can for be suitable to expected particular use various modifications various embodiments come Understand the application.
By the description above, embodiments of the present invention provide following technical scheme.
A kind of 1. information processing methods are attached, including:
Multi word unit in identification information;
The most like multi word unit and association word string that the multi word unit is searched from translation rule data base is distinguished The translation rule matched somebody with somebody, all substrings of the association word string including the multi word unit and partly overlapping with the multi word unit Multi word unit;
The relation of the translation rule matched with the most like multi word unit according to each translation rule determines each turning over Translate the score of rule;
The translation result that pre-conditioned translation rule determines the multi word unit is met according to score.
Methods of the note 2. according to note 1, it is described according to each translation rule and the most like multi word unit The relation of the translation rule matched somebody with somebody determines that the score of each translation rule includes:
The number of times that each translation rule occurs in the translation rule that the most like multi word unit is matched is counted, I.e. first time number;
All translation rules that the corresponding substring of each translation rule matched are counted in most like many word lists The sum of the number of times occurred in the translation rule that unit is matched;
According to first number and score that is described and determining each translation rule.
Methods of the note 3. according to note 2, it is described according to each translation rule and the most like multi word unit The relation of the translation rule matched somebody with somebody determines that the score of each translation rule also includes:
When the most like multi word unit be the multi word unit itself when, determine the multi word unit as its own The translation rule that substring is matched must be divided into the first predetermined score.
Methods of the note 4. according to note 2, it is described according to first number with it is described and determination is described each turns over Translating the score of rule includes:
According to first number with it is described and proportionate relationship determine the score of each translation rule.
Methods of the note 5. according to note 4, it is described according to first number with it is described and determination is described each turns over Translating the score of rule also includes:
When the value of first number is 0, arrange the translation rule must be divided into default value.
Methods of the note 6. according to note 1, it is described according to each translation rule and the most like multi word unit Also include before the score that the relation of the translation rule matched somebody with somebody determines each translation rule:
Judge whether the corresponding association word string of each translation rule destroys the border of the multi word unit;
If it is judged that to be, determining the translation rule of the conjunctive word String matching must be divided into the second predetermined score.
Method of the note 7. according to note 1, the multi word unit in the identification information includes:
Using the annotator based on machine learning, beginning word, medium term and the ending of multi word unit are marked out in sentence Word;
Multi word unit is determined according to coherent beginning word, medium term and ending word or coherent beginning word and ending word.
A kind of 8. multi word unit annotator training methodes are attached, including:
The multi word unit in sentence is extracted, multi word unit set is built;
When the ratio of multi word unit in the sentence in corpus meets pre-conditioned, the sentence is added into many words The training set of unit annotator;
The multi word unit annotator is trained using the training set.
Method of the note 9. according to note 8, wherein, the sentence and the multi word unit are first languages, described Corpus also includes the sentence of the second language corresponding with the sentence of first language, the multi word unit bag in the extraction sentence Include:
Obtain the multi word unit in the sentence in corresponding second language;
Obtain the multi word unit of first language corresponding with the multi word unit of second language.
Method of the note 10. according to note 9, the structure multi word unit set includes:
Exclusion only includes the multi word unit of a word;
For including the multi word unit of two words, if two words exclude institute all in the deactivation vocabulary for pre-setting State multi word unit;Otherwise the multi word unit is added into the multi word unit set;
For including the multi word unit of more than three words, if the border word of the multi word unit is off word, delete The border word, by the remaining multi word unit multi word unit set is added;If the border word of the multi word unit Word is not off, the multi word unit is added into the multi word unit set.
Method of the note 11. according to note 10, wherein, the remaining multi word unit is added into the multi word unit Set includes:
If the remaining multi word unit is present in the multi word unit for extracting and quantity is more than the threshold value for setting, The remaining multi word unit is added into the multi word unit set.
A kind of 12. information processors are attached, including:
Multi word unit identification module, the multi word unit being configured in identification information;
Translation rule searching modul, is configured to from translation rule data base search most like many words of the multi word unit Unit and the association translation rule that matches respectively of word string, all substrings of the association word string including the multi word unit and with The partly overlapping multi word unit of the multi word unit;
Score determining module, the translation for being configured to be matched with the most like multi word unit according to each translation rule is advised Relation then determines the score of each translation rule;
Translation result determining module, is configured to meet pre-conditioned translation rule according to score and determines the multi word unit Translation result.
Information processor of the note 13. according to note 12, the score determining module includes:
First number statistic unit, is configured to statistics each translation rule and is matched in the most like multi word unit Translation rule in occur number of times, i.e., first time number;
Second number statistic unit, is configured to count all translations that the corresponding substring of each translation rule is matched The sum of the number of times that rule occurs in the translation rule that the most like multi word unit is matched;
It is determined that subdivision, be configured to according to first number with it is described and determine each translation rule Point.
Note 14. according to note 13 described in information processors, it is described determine subdivision is configured to:When it is described most When similar multi word unit is the multi word unit itself, the translation that the multi word unit is matched as the substring of its own is determined Regular must be divided into the first predetermined score.
Note 15. according to note 13 described in information processors, it is described determine subdivision is configured to:According to described First number with it is described and proportionate relationship determine the score of each translation rule.
Note 16. according to note 15 described in information processors, it is described determine subdivision is configured to:When described When the value of number is 0, arrange the translation rule must be divided into default value.
Information processor of the note 17. according to note 12, also including border judge module, is configured to judge described Each translation rule is corresponding to associate the border whether word string destroys the multi word unit;
The score determining module is configured to:When the judged result of the judge module is to be, the association is determined The translation rule of word string matching must be divided into the second predetermined score.
Information processor of the note 18. according to note 12, the multi word unit identification module includes:
Mark unit, for marking out beginning word, medium term and the ending of multi word unit in sentence based on machine learning Word;
Multi word unit determining unit, for according to the coherent beginning word of the mark unit mark, medium term and ending Word determines multi word unit.
A kind of 19. program products of the instruction code of the machine-readable that is stored with are attached, the instruction code is read by machine When taking and performing, the method described in above-mentioned any one of note 1-7 is performed.
A kind of 20. program products of the instruction code of the machine-readable that is stored with are attached, the instruction code is read by machine When taking and performing, the method described in above-mentioned any one of note 8-11 is performed.

Claims (10)

1. a kind of information processing method, including:
Multi word unit in identification information;
The most like multi word unit of the multi word unit is searched from translation rule data base and associates what word string was matched respectively Translation rule, the association word string includes all substrings and many words partly overlapping with the multi word unit of the multi word unit Unit;
Each translation rule are determined according to each translation rule and the relation of the translation rule that the most like multi word unit is matched Score then;
The translation result that pre-conditioned translation rule determines the multi word unit is met according to score.
2. method according to claim 1, it is described to be matched with the most like multi word unit according to each translation rule The relation of translation rule determines that the score of each translation rule includes:
Count the number of times that each translation rule occurs in the translation rule that the most like multi word unit is matched, i.e., Number;
All translation rules that the corresponding substring of each translation rule matched are counted in the most like multi word unit institute The sum of the number of times occurred in the translation rule of matching;
According to first number and score that is described and determining each translation rule.
3. method according to claim 2, it is described to be matched with the most like multi word unit according to each translation rule The relation of translation rule determines that the score of each translation rule also includes:
When the most like multi word unit is the multi word unit itself, determine the multi word unit as the substring of its own The translation rule for being matched must be divided into the first predetermined score.
4. method according to claim 2, described according to first number each translation rule described with described and determination Score then includes:
According to first number with it is described and proportionate relationship determine the score of each translation rule.
5. method according to claim 4, described according to first number each translation rule described with described and determination Score then also includes:
When the value of first number is 0, arrange the translation rule must be divided into default value.
6. method according to claim 1, it is described to be matched with the most like multi word unit according to each translation rule Also include before the score that the relation of translation rule determines each translation rule:
Judge whether the corresponding association word string of each translation rule destroys the border of the multi word unit;
If it is judged that to be, determining the translation rule of the conjunctive word String matching must be divided into the second predetermined score.
7. method according to claim 1, the multi word unit in the identification information includes:
Using the annotator based on machine learning, beginning word, medium term and the ending word of multi word unit are marked out in sentence;
Multi word unit is determined according to coherent beginning word, medium term and ending word or coherent beginning word and ending word.
8. a kind of information processor, including:
Multi word unit identification module, the multi word unit being configured in identification information;
Translation rule searching modul, is configured to from translation rule data base search the most like multi word unit of the multi word unit And the association translation rule that matches respectively of word string, all substrings of the association word string including the multi word unit and with it is described The partly overlapping multi word unit of multi word unit;
Score determining module, the translation rule for being configured to be matched with the most like multi word unit according to each translation rule Relation determines the score of each translation rule;
Translation result determining module, is configured to meet pre-conditioned translation rule according to score and determines turning over for the multi word unit Translate result.
9. information processor according to claim 8, the score determining module includes:
First number statistic unit, is configured to statistics each translation rule in turning over that the most like multi word unit is matched Translate the number of times occurred in rule, i.e., first time number;
Second number statistic unit, is configured to count all translation rules that the corresponding substring of each translation rule is matched The sum of the number of times occurred in the translation rule that the most like multi word unit is matched;
It is determined that subdivision, be configured to according to first number and score that is described and determining each translation rule.
10. information processor according to claim 9, it is described determine subdivision is configured to:When described most like When multi word unit is the multi word unit itself, the translation rule that the multi word unit is matched as the substring of its own is determined Must be divided into the first predetermined score.
CN201310325244.2A 2013-07-30 2013-07-30 Information processing method and information processing device Expired - Fee Related CN104346325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310325244.2A CN104346325B (en) 2013-07-30 2013-07-30 Information processing method and information processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310325244.2A CN104346325B (en) 2013-07-30 2013-07-30 Information processing method and information processing device

Publications (2)

Publication Number Publication Date
CN104346325A CN104346325A (en) 2015-02-11
CN104346325B true CN104346325B (en) 2017-05-10

Family

ID=52501958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310325244.2A Expired - Fee Related CN104346325B (en) 2013-07-30 2013-07-30 Information processing method and information processing device

Country Status (1)

Country Link
CN (1) CN104346325B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214166A (en) * 2010-04-06 2011-10-12 三星电子(中国)研发中心 Machine translation system and machine translation method based on syntactic analysis and hierarchical model
US8180624B2 (en) * 2007-09-05 2012-05-15 Microsoft Corporation Fast beam-search decoding for phrasal statistical machine translation

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3813911B2 (en) * 2002-08-22 2006-08-23 株式会社東芝 Machine translation system, machine translation method, and machine translation program
CN1567297A (en) * 2003-07-03 2005-01-19 中国科学院声学研究所 Method for extracting multi-word translation equivalent cells from bilingual corpus automatically
CN102227723B (en) * 2008-11-27 2013-10-09 国际商业机器公司 Device and method for supporting detection of mistranslation
US8326599B2 (en) * 2009-04-21 2012-12-04 Xerox Corporation Bi-phrase filtering for statistical machine translation
CN102023970A (en) * 2009-09-14 2011-04-20 株式会社东芝 Method and device for acquiring language model probability and method and device for constructing language model
CN102859515B (en) * 2010-02-12 2016-01-13 谷歌公司 Compound word splits
US9552355B2 (en) * 2010-05-20 2017-01-24 Xerox Corporation Dynamic bi-phrases for statistical machine translation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8180624B2 (en) * 2007-09-05 2012-05-15 Microsoft Corporation Fast beam-search decoding for phrasal statistical machine translation
CN102214166A (en) * 2010-04-06 2011-10-12 三星电子(中国)研发中心 Machine translation system and machine translation method based on syntactic analysis and hierarchical model

Also Published As

Publication number Publication date
CN104346325A (en) 2015-02-11

Similar Documents

Publication Publication Date Title
CN109299480B (en) Context-based term translation method and device
Biran et al. Putting it simply: a context-aware approach to lexical simplification
Alzahrani et al. Fuzzy semantic-based string similarity for extrinsic plagiarism detection
JP5356197B2 (en) Word semantic relation extraction device
CN105808711B (en) A kind of system and method that the concept based on text semantic generates model
CN105893410A (en) Keyword extraction method and apparatus
JP5544602B2 (en) Word semantic relationship extraction apparatus and word semantic relationship extraction method
JP5216063B2 (en) Method and apparatus for determining categories of unregistered words
CN103324621B (en) A kind of Thai text spelling correcting method and device
JP2013502643A (en) Structured data translation apparatus, system and method
Al-Omari et al. Arabic light stemmer (ARS)
Pandey et al. An unsupervised Hindi stemmer with heuristic improvements
Gentile et al. Explore and exploit. Dictionary expansion with human-in-the-loop
Ferrés et al. An adaptable lexical simplification architecture for major Ibero-Romance languages
CN112989808A (en) Entity linking method and device
Abidi et al. An automatic learning of an algerian dialect lexicon by using multilingual word embeddings
CN110929507B (en) Text information processing method, device and storage medium
CN111339778B (en) Text processing method, device, storage medium and processor
CN108763258B (en) Document theme parameter extraction method, product recommendation method, device and storage medium
CN104346325B (en) Information processing method and information processing device
JP2013222418A (en) Passage division method, device and program
Aziz et al. A hybrid model for spelling error detection and correction for Urdu language
JP5239161B2 (en) Language analysis system, language analysis method, and computer program
CN114446422A (en) Medical record marking method, system and corresponding equipment and storage medium
EP3203384A1 (en) Method, device, and computer program for providing a definition or a translation of a word belonging to a sentence as a function of neighbouring words and of databases

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170510

Termination date: 20180730

CF01 Termination of patent right due to non-payment of annual fee