CN102789461A - Establishing device and method for multilingual dictionary - Google Patents

Establishing device and method for multilingual dictionary Download PDF

Info

Publication number
CN102789461A
CN102789461A CN2011101302344A CN201110130234A CN102789461A CN 102789461 A CN102789461 A CN 102789461A CN 2011101302344 A CN2011101302344 A CN 2011101302344A CN 201110130234 A CN201110130234 A CN 201110130234A CN 102789461 A CN102789461 A CN 102789461A
Authority
CN
China
Prior art keywords
word
dictionary
translation
multilingual
senses
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011101302344A
Other languages
Chinese (zh)
Inventor
张洁
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN2011101302344A priority Critical patent/CN102789461A/en
Publication of CN102789461A publication Critical patent/CN102789461A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a multilingual dictionary establishing device which may comprises a monolingual dictionary module, a keyword extraction module, a bilingual dictionary module and a translation confirmation module. The monolingual dictionary module selects words from a preset monolingual dictionary and obtains paraphrases of each sense corresponding to the words; the keyword extraction module extracts key words from the paraphrases; the bilingual dictionary module inquires out translation words of the words from a preset bilingual dictionary, wherein one language of the bilingual dictionary is the same with the language of the monolingual dictionary; and the translation confirmation module calculates similarities of the translation words with the words and the key words so as to select the final translation words corresponding to each sense for the words from the translation words and generate the multilingual dictionary. The invention further provides a multilingual dictionary establishing method. According to the device and the method, the multilingual dictionary can be established automatically, manpower and material resources for dictionary establishment are saved, the accuracy of the generated multilingual dictionary is guaranteed, and the compilation of the multilingual dictionary can be finished based on ordinary monolingual and bilingual dictionaries.

Description

Multilingual dictionary construction device and multilingual dictionary construction method
Technical field
The present invention relates to the technical field of information fusion and resource restructuring, in particular to a kind of multilingual dictionary construction device and a kind of multilingual dictionary construction method.
Background technology
The monolingual dictionary that has comprised word entry, part of speech information, word lexical or textual analysis and the example sentence of a certain specific languages has accumulated abundant single language expertise.And the multilingual dictionary of word entry, part of speech information, word lexical or textual analysis and example sentence that comprises the translation each other of two kinds and two or more languages has accumulated the linguistry that single language and multi-lingual expert coact.
Monolingual dictionary is the basis of multilingual dictionary.Multilingual dictionary has played even more important effect for the practical application of linking up between the different language.How to utilize monolingual dictionary to construct extensive high-precision multilingual dictionary, for the concrete application of natural language processing, for example mechanical translation, cross-language retrieval etc. all have important effect.
At present the structure of multilingual dictionary mainly contains two kinds of methods, is based on the method for expertise respectively and based on the method for statistical knowledge.
Wherein based on the method for expertise, promptly traditional dictionary writing method is the expert body's manpower by this field, carries out the multilingual dictionary compilation.The shortcoming of this method is that the manpower that process involves is many, fabrication cycle is long, and lexicographer's working stamndard is difficult to agree, lexical entry deciliter also can't adopt unified standard.
Based on the method for statistical knowledge, the extensive multi-lingual Parallel Corpus of the utilization that has is therefrom learnt multi-lingual word and is translated right knowledge each other; What have utilizes several bilingual dictionaries; What have utilizes electronic dictionary and translation tool, directly monolingual dictionary is translated as multilingual dictionary, utilizes statistical knowledge to carry out disambiguation again; In order to correct the mistake that possibly occur in the translation process, make up multilingual dictionary.Shortcoming based on the method for statistical knowledge is to have large-scale dictionary or corpus resource, thereby therefrom extracts statistical information, and in addition, based on present disambiguation means, the alignment accuracy of lexical entry is lower than the multilingual dictionary that makes up based on expertise.
In the prior art, also exist a kind of automatically with the method for the conceptual translation among the WordNet (a kind of English glossary knowledge base) for Chinese.An english can have a plurality of senses of a dictionary entry, and each senses of a dictionary entry can be translated as a plurality of Chinese words, and this method is carried out vocabulary translation from the granularity of the senses of a dictionary entry; The most frequently used method of vocabulary translation is to use the bilingual dictionary resource, comprising the online dictionary of network; To the same senses of a dictionary entry of same speech, different dictionaries possibly provide different translations, in order to obtain comprising more how Chinese synon senses of a dictionary entry translation result, need merge these senses of a dictionary entry translations.This method also is used to carry out dictionary and makes up, but its defective is to depend on unduly WordNet, and the multilingual dictionary that generates also needs to carry out layout by the form of WordNet.
Therefore; Need a kind of mode of new structure multilingual dictionary, it can either save the manpower and materials that dictionary makes up consumption in operation, the degree of accuracy of the multilingual dictionary that can guarantee again to generate; And applicability is good, just can accomplish the multilingual dictionary establishment based on common single language, bilingual dictionary.
Summary of the invention
Technical matters to be solved by this invention is; A kind of mode of new structure multilingual dictionary is provided; It can either save the manpower and materials that dictionary makes up consumption in operation; The degree of accuracy of the multilingual dictionary that can guarantee again to generate, and applicability is good, just can accomplish multilingual dictionary and work out based on common single language, bilingual dictionary.
In view of this; The invention provides a kind of multilingual dictionary construction device, can comprise: the monolingual dictionary module, from the monolingual dictionary that presets, choose word; And obtain the lexical or textual analysis of each senses of a dictionary entry corresponding with word; And the pairing part of speech of this senses of a dictionary entry, construct a proper vector, said proper vector comprises entry, part of speech and the senses of a dictionary entry; Keyword extracting module is extracted keyword from lexical or textual analysis; The bilingual dictionary module inquires the pairing all translation words of word from the bilingual dictionary that presets, wherein, wherein a kind of languages of bilingual dictionary are identical with the languages of monolingual dictionary; Module is confirmed in translation, calculates the similarity of translation word and word and keyword respectively, in the translation word, to be the final translated product word that corresponding each senses of a dictionary entry selected in word, generates multilingual dictionary.In this technical scheme; Particularly; Through the similarity of each translation word and above-mentioned word and keyword is carried out weighted mean, select the bigger translation word of the value of obtaining as final translation word, in this way; Enlarge the ratio pair set of waiting to translate word and translation word, thereby eliminate the ambiguity in the translation process more accurately.In technique scheme, preferably, the part of speech of each senses of a dictionary entry that the bilingual dictionary module is can basis corresponding with word filters out the part of speech word inequality in the translation word.In this technical scheme,,, can improve the efficient that multilingual dictionary makes up like this so can filter in advance because the different translation word of part of speech is suitable for scarcely.
In technique scheme; Preferably; Keyword extracting module can be carried out participle to lexical or textual analysis, and according to word frequency and part of speech, from the word that participle obtains, extracts candidate keywords; And calculated candidate keyword and candidate translate the similarity between the word, to be used for selecting keyword in candidate keywords.Through this technical scheme; Particularly, can extract the word that part of speech and word frequency identical with above-mentioned word is lower than certain value (filtering out everyday expressions such as of, on), carry out similarity again and calculate; Select word that similarity is higher than certain value as keyword (at this moment, being equivalent to synonym).Simultaneously, select synon mode known have multiple, the mode that is not limited to enumerate in this programme.
In technique scheme, preferably, also comprise: the vocabulary module of stopping using set up the vocabulary of stopping using, and the word that word frequency surpasses predetermined threshold in single language corpus that will preset is recorded in the vocabulary of stopping using; The bilingual dictionary module vocabulary of use stopping using from the word that participle obtains, selects word that word frequency is no more than predetermined threshold as candidate keywords.
In technique scheme, preferably, when word has unique senses of a dictionary entry, translation affirmation module directly will be translated the final translated product word of word as word.In this technical scheme,,, can guarantee the structure efficient of multilingual dictionary through this mode so can directly confirm the final translated product word because above-mentioned word when the unique senses of a dictionary entry is only arranged, can not produce ambiguity in the translation process.
The present invention provides a kind of multilingual dictionary construction method; Can comprise: step 102; From the monolingual dictionary that presets, choose word, and obtain the lexical or textual analysis of each senses of a dictionary entry corresponding with word, and the pairing part of speech of this senses of a dictionary entry; Construct a proper vector, said proper vector comprises entry, part of speech and the senses of a dictionary entry; Step 104 is extracted keyword from lexical or textual analysis; Step 106 inquires pairing all the translation words of word from the bilingual dictionary that presets, wherein, wherein a kind of languages of bilingual dictionary are identical with the languages of monolingual dictionary; Step 108 is calculated the similarity of translating word and word and keyword respectively, in the translation word, to be the final translated product word that corresponding each senses of a dictionary entry selected in word, generates multilingual dictionary.In this technical scheme; Particularly, through the similarity of each translation word and above-mentioned word and keyword is carried out weighted mean, select the bigger translation word of the value of obtaining as final translation word; In this way, can eliminate the ambiguity in the translation process effectively.
In technique scheme, preferably, step 106 can also comprise: according to the part of speech of each senses of a dictionary entry corresponding with word, filter out the part of speech word inequality in the translation word.In this technical scheme,,, can improve the efficient that multilingual dictionary makes up like this so can filter in advance because the different translation word of part of speech is suitable for scarcely.
In technique scheme, preferably, step 104 can comprise: participle is carried out in lexical or textual analysis, and according to word frequency and part of speech, from the word that participle obtains, extract candidate keywords; Calculated candidate keyword and candidate translate the similarity between the word, to be used for selecting keyword in candidate keywords.Through this technical scheme; Particularly, can extract the word that part of speech and word frequency identical with above-mentioned word is lower than certain value (filtering out everyday expressions such as of, on), carry out similarity again and calculate; Select word that similarity is higher than certain value as keyword (at this moment, being equivalent to synonym).Simultaneously, select synon mode known have multiple, the mode that is not limited to enumerate in this programme.
In technique scheme, preferably, before step 104, also comprise: the vocabulary of set up stopping using, and the word that word frequency surpasses predetermined threshold in single language corpus that will preset is recorded in the vocabulary of stopping using; In step 104, extract candidate keywords according to word frequency and comprise: use the vocabulary of stopping using, come from the word that participle obtains, to select word that word frequency is no more than predetermined threshold as candidate keywords.
In technique scheme, preferably, also comprise: when word has unique senses of a dictionary entry, directly will translate the final translated product word of word as word.In this technical scheme,,, can guarantee the structure efficient of multilingual dictionary through this mode so can directly confirm the final translated product word because above-mentioned word when the unique senses of a dictionary entry is only arranged, can not produce ambiguity in the translation process.
Through above technical scheme; Can realize the automatic construction device of a kind of multilingual dictionary and a kind of multilingual dictionary method for auto constructing; Can either save the manpower and materials of dictionary structure consumption in operation; The degree of accuracy of the multilingual dictionary that can guarantee again to generate, and applicability is good, just can accomplish multilingual dictionary and work out based on common single language, bilingual dictionary.
Description of drawings
Fig. 1 is the process flow diagram of multilingual dictionary construction method according to an embodiment of the invention;
Fig. 2 is the block diagram of multilingual dictionary construction device according to an embodiment of the invention;
Fig. 3 is the synoptic diagram of multilingual dictionary construction method according to an embodiment of the invention;
Fig. 4 is the synoptic diagram of multilingual dictionary construction method according to an embodiment of the invention.
Embodiment
In order more to be expressly understood above-mentioned purpose of the present invention, feature and advantage, the present invention is further described in detail below in conjunction with accompanying drawing and embodiment.
Set forth a lot of details in the following description so that make much of the present invention, still, the present invention can also adopt other to be different from other modes described here and implement, and therefore, the present invention is not limited to the restriction of following disclosed specific embodiment.
Fig. 1 is the process flow diagram of multilingual dictionary construction method according to an embodiment of the invention.
As shown in Figure 1; The present invention provides a kind of multilingual dictionary construction method, can comprise: step 102, from the monolingual dictionary that presets, choose word; And obtain the lexical or textual analysis of each senses of a dictionary entry corresponding with word; And the pairing part of speech of this senses of a dictionary entry, construct a proper vector, said proper vector comprises entry, part of speech and the senses of a dictionary entry (or lexical or textual analysis of this senses of a dictionary entry); Step 104 is extracted keyword from lexical or textual analysis; Step 106 inquires pairing all the translation words of word from the bilingual dictionary that presets, wherein, wherein a kind of languages of bilingual dictionary are identical with the languages of monolingual dictionary; Step 108 is calculated the similarity of translating word and word and keyword respectively, in the translation word, to be the final translated product word that corresponding each senses of a dictionary entry selected in word, generates multilingual dictionary.In this technical scheme; Particularly, through the similarity of each translation word and above-mentioned word and keyword is carried out weighted mean, select the bigger translation word of the value of obtaining as final translation word; In this way, can eliminate the ambiguity in the translation process effectively.
In technique scheme, step 106 can also comprise: according to the part of speech of each senses of a dictionary entry corresponding with word, filter out the part of speech word inequality in the translation word.In this technical scheme,,, can improve the efficient that multilingual dictionary makes up like this so can filter in advance because the different translation word of part of speech is suitable for scarcely.
In technique scheme, step 104 can comprise: participle is carried out in lexical or textual analysis, and according to word frequency and part of speech, from the word that participle obtains, extract candidate keywords; Calculated candidate keyword and candidate translate the similarity between the word, to be used for selecting keyword in candidate keywords.Through this technical scheme; Particularly, can extract the word that part of speech and word frequency identical with above-mentioned word is lower than certain value (filtering out everyday expressions such as of, on), carry out similarity again and calculate; Select word that similarity is higher than certain value as keyword (at this moment, being equivalent to synonym).Simultaneously, select synon mode known have multiple, the mode that is not limited to enumerate in this programme.
In technique scheme, before step 104, also comprise: the vocabulary of set up stopping using, and the word that word frequency surpasses predetermined threshold in single language corpus that will preset is recorded in the vocabulary of stopping using; In step 104, extract candidate keywords according to word frequency and comprise: use the vocabulary of stopping using, come from the word that participle obtains, to select word that word frequency is no more than predetermined threshold as candidate keywords.
In technique scheme, also comprise: when word has unique senses of a dictionary entry, directly will translate the final translated product word of word as word.In this technical scheme,,, can guarantee the structure efficient of multilingual dictionary through this mode so can directly confirm the final translated product word because above-mentioned word when the unique senses of a dictionary entry is only arranged, can not produce ambiguity in the translation process.
Fig. 2 is the block diagram of multilingual dictionary construction device according to an embodiment of the invention.
As shown in Figure 2; The present invention also provides a kind of multilingual dictionary construction device 200, can comprise: monolingual dictionary module 202, from the monolingual dictionary that presets, choose word; And obtain the lexical or textual analysis of each senses of a dictionary entry corresponding with word; And the pairing part of speech of this senses of a dictionary entry, construct a proper vector, said proper vector comprises entry, part of speech and the senses of a dictionary entry; Keyword extracting module 204 is extracted keyword from lexical or textual analysis; Bilingual dictionary module 206 inquires pairing all the translation words of word from the bilingual dictionary that presets, wherein, wherein a kind of languages of bilingual dictionary are identical with the languages of monolingual dictionary; Module 208 is confirmed in translation, calculates the similarity of translation word and word and keyword respectively, in the translation word, to be the final translated product word that corresponding each senses of a dictionary entry selected in word, generates multilingual dictionary.In this technical scheme; Particularly, through the similarity of each translation word and above-mentioned word and keyword is carried out weighted mean, select the bigger translation word of the value of obtaining as final translation word; In this way, can eliminate the ambiguity in the translation process effectively.
In technique scheme, the part of speech of each senses of a dictionary entry that bilingual dictionary module 206 is can basis corresponding with word filters out the part of speech word inequality in the translation word.In this technical scheme,,, can improve the efficient that multilingual dictionary makes up like this so can filter in advance because the different translation word of part of speech is suitable for scarcely.
In technique scheme; Keyword extracting module 204 can be carried out participle to lexical or textual analysis; And according to word frequency and part of speech; From the word that participle obtains, extract candidate keywords, and calculated candidate keyword and candidate translate the similarity between the word, to be used for selecting keyword in candidate keywords.Through this technical scheme; Particularly, can extract the word that part of speech and word frequency identical with above-mentioned word is lower than certain value (filtering out everyday expressions such as of, on), carry out similarity again and calculate; Select word that similarity is higher than certain value as keyword (at this moment, being equivalent to synonym).Simultaneously, select synon mode known have multiple, the mode that is not limited to enumerate in this programme.
In technique scheme, also comprise: the vocabulary module 210 of stopping using set up the vocabulary of stopping using, and the word that word frequency surpasses predetermined threshold in single language corpus that will preset is recorded in the vocabulary of stopping using; Bilingual dictionary module 206 is used the vocabulary of stopping using, and comes from the word that participle obtains, to select word that word frequency is no more than predetermined threshold as candidate keywords.
In technique scheme, when word has unique senses of a dictionary entry, translation affirmation module 208 directly will be translated the final translated product word of word as word.In this technical scheme,,, can guarantee the structure efficient of multilingual dictionary through this mode so can directly confirm the final translated product word because above-mentioned word when the unique senses of a dictionary entry is only arranged, can not produce ambiguity in the translation process.
Fig. 3 is the synoptic diagram of multilingual dictionary construction method according to an embodiment of the invention.
As shown in Figure 3, be the synoptic diagram of multilingual dictionary construction method.Principle is following:
At first choose word at monolingual dictionary, if this word is a univocal, then bilingual dictionary I, II, the III of correspondence capable of using obtain to no ambiguity its appropriate translation morphology of this word; If this word is a polysemant; Then need carry out step 302; From the lexical or textual analysis of each senses of a dictionary entry of this word, extract keyword (can expand to the synonym of this word); Carry out step 304 then, calculate the similarity of above-mentioned word and keyword and its appropriate translation word respectively, judge whether the paginal translation morphology of its appropriate translation word that obtains in the dictionary as this word according to similarity.
Below in conjunction with Fig. 4, with the English monolingual dictionary, being generated as English-Chinese multilingual dictionary automatically is example, sets forth the embodiment that multilingual dictionary makes up automatically.
1. in the monolingual dictionary among the hypothesis figure, the bilingual dictionary to the T (following is example with Chinese) that speaks arbitrarily of English word EW is BD, if English word EW:
Only comprise the entry of a senses of a dictionary entry, be univocal.
Comprise two or more entries, be polysemant.
2. if English morphology EW is a univocality, then utilize any bilingual vocabulary of English-Chinese correspondence, electronic dictionary, morphology translated in the correspondence language that does not search out this morphology EW with all can having ambiguity.
3. if English morphology EW is the ambiguity item, then need procedure as shown in Figure 4:
Step 402 is sought its corresponding morphology EW, part of speech EP and dictionary definition EE according to the senses of a dictionary entry in English dictionary, set up and the corresponding proper vector of the senses of a dictionary entry (also being referred to as tlv triple) { EW, EP1, EE1}; { EW, EP2, EE2}, { EW; EP3, EE3} ... { EW, EPn, EEn};
Step 404 through arbitrary bilingual vocabulary, searches out all entry TW, part of speech TP and lexical or textual analysis TE with the corresponding any language T of English morphology EW, sets up several features vector (also being referred to as tlv triple) { TW1; TP1, TE1}, { TW2, TP2; TE2}, { TW2, TP3; TE3} ... { TWm, TPm, TEm};
Step 406, part of speech coupling: for English morphology { EW, the EPX of arbitrary senses of a dictionary entry; EEX}; Remove the translation candidate word different, stay identical with its part of speech translation candidate word, handle according to following method with the lexical or textual analysis of a certain senses of a dictionary entry of part of speech of languages T arbitrarily English with its part of speech:
Step 408, keyword extraction specifically comprises:
Step 4082, at single language corpus, the word that word frequency is higher than certain threshold value is selected into inactive vocabulary;
Step 4084 is carried out participle and part-of-speech tagging to lexical or textual analysis;
Step 4086 in the lexical or textual analysis of english, behind the word in the inactive vocabulary of removal, selects the speech identical with the English word part of speech as candidate word;
Step 4088 is calculated the similarity between English word and the candidate word, will be higher than the keyword of the word of certain threshold value as this English word lexical or textual analysis, also is equivalent to expand the synonym of English word.
Step 408 also can adopt following technical scheme:
The speech if English morphology is run after fame then extracts the keyword KW in the lexical or textual analysis, and this lexical or textual analysis is if contain preposition or concern conjunction, and then this keyword is positioned at first preposition of lexical or textual analysis or concerns before the conjunction; If do not contain, then keyword is positioned at last noun;
If English morphology is an adjective, then extract the keyword KW in the lexical or textual analysis, this keyword is noun, verb, the adjective in the lexical or textual analysis;
If English morphology is a verb, then extract the keyword KW in the lexical or textual analysis, this keyword is noun, the verb in the lexical or textual analysis;
If English morphology is an adverbial word, then extract the keyword KW in the lexical or textual analysis, this keyword is verb and the adjective in the lexical or textual analysis;
Step 410; Calculate the keyword KW and any similarity of the candidate keywords of semantic item among English morphology EW, the lexical or textual analysis EE respectively; Wherein, the similarity degree between English word and the candidate word can obtain through modes such as the co-occurrence frequency in the calculating corpus, point type mutual information, DICE coefficients; According to the similarity size, confirm its appropriate translation morphology of English lexical or textual analysis at last.
According to above step, be example with English " light ", in the English monolingual dictionary; Light is a polysemant, corresponds to following morphology and lexical or textual analysis thereof, can make up following proper vector (also being referred to as tlv triple): { light; N, a tool of illumination}, { light, adj; Moving easily and quickly}, { light, adj, little physical weight or density}.
Through the english-chinese bilingual vocabulary, can inquire " light " corresponding Chinese morphology has " lamp ", " gently ", " merrily and lightheartedly ".Through Chinese dictionary, can structural attitude vector (also being referred to as tlv triple) as follows: { lamp, n, a kind of instrument that is used to throw light on }, { light, adj, weight is little }, { merrily and lightheartedly, adv, flexible movements }.
For light, and n, a tool of illumination}, in " lamp, light, slim and graceful ", selecting correct paginal translation morphology can be following:
At first, according to the part of speech principle of correspondence, the inconsistent word of part of speech is deleted in candidate's vocabulary." light, slim and graceful " all deleted like this;
In second step, automatic word segmentation and part-of-speech tagging are carried out in lexical or textual analysis: a/a tool/n of/pillumination/n;
The 3rd step, can the article of very high frequency be removed with preposition according to the vocabulary of stopping using, also be left tool and illumination in the lexical or textual analysis, and this two speech is the same with light, belongs to noun n, with it as keyword candidate;
In the 4th step, respectively light and tool and illumination are carried out similarity and calculate, in order to definite last keyword.Suppose that the speech that similarity is higher than threshold value is illumination, so illumination is expanded to the synset of light.
In the 5th step, carry out similarity with " lamp " respectively with " light, illumination " and calculate.Relevant similarity is calculated; Adopt present existing technology; For example in bilingual Parallel Corpus, represent English morphology EW and the common number of times that occurs of Chinese morphology TW, represent the English morphology to occur and the absent variable number of times of Chinese morphology with b with a; Represent the English morphology not occur and the number of times of Chinese morphology appearance with c, represent two all absent variable number of times of morphology with d.N=a+b+c+d uses x 2Statistical value calculates the degree of correlation of morphology EW and morphology TW:
x 2 ( EW , TW ) = n × ( a × d - b × d ) ( a + b ) × ( a + c ) × ( b + d ) ( c + d )
Also can adopt point type mutual information (point-wise mutual information) formula to calculate:
MI ( EW , TW ) = log 2 n × a ( a + b ) × ( a + c )
Or DICE coefficient:
DICE ( EW , TW ) = 2 a ( a + b ) × ( a + c )
MI, DICE, x2 and co-occurrence frequency commonly used can be represented the degree of correlation.
According to relatedness computation, in candidate word, select the paginal translation morphology of the specific senses of a dictionary entry of English morphology EW, specifically can be following:
In the english-chinese bilingual corpus, calculate co-occurrence, the non-co-occurrence frequency of light and " lamp ", calculate Dice, MI or X2 coefficient afterwards.Calculate the co-occurrence of illumination and lamp, non-co-occurrence frequency again, calculate dice, MI or x2 coefficient afterwards again.At last with these similarity weighted means; Similarity as " light " entry that has comprised lexical or textual analysis " illumination " and " lamp " entry; Pass through vocabulary and other entries of " light " and expansion thereof again; Similarity such as " merrily and lightheartedly ", " light " compares, and finally among " lamp, slim and graceful, light ", selects its appropriate translation of the high speech of similarity as " light ".
According to technical scheme of the present invention; Can realize a kind of multilingual dictionary method for auto constructing and the automatic construction device of a kind of multilingual dictionary; Can utilize entry in each monolingual dictionary, part of speech, word lexical or textual analysis information; Utilize part of speech and the corresponding language template of word lexical or textual analysis information, find the keyword in the word lexical or textual analysis information.The similarity of the word through comparison keyword and lexical or textual analysis is eliminated the translation ambiguity, finds translation right, the automatic structure of realization multilingual dictionary.Technical scheme of the present invention is compared with traditional method; Following characteristics are arranged, and the present invention has utilized the inner information characteristics of single language, for example the lexical or textual analysis information in the dictionary; The characteristic of dictionary definition language; The word of extraction and entry information height correlation carries out the similarity expansion, has enlarged to wait to translate the word ensemble of communication that can compare between word and the translation word.Traditional method is mostly fastened the pass that the notice of research all concentrates on entry and outside entry, relies on the contact of seeking between entry and the entry.
The above is merely the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a multilingual dictionary construction device is characterized in that, comprising:
The monolingual dictionary module is chosen word from the monolingual dictionary that presets, obtain the lexical or textual analysis of each senses of a dictionary entry corresponding with said word, and the pairing part of speech of this senses of a dictionary entry, constructs a proper vector, and said proper vector comprises entry, part of speech and the senses of a dictionary entry;
Keyword extracting module is extracted keyword from said lexical or textual analysis;
The bilingual dictionary module inquires pairing all the translation words of said word from the bilingual dictionary that presets, wherein, wherein a kind of languages of said bilingual dictionary are identical with the languages of said monolingual dictionary;
Module is confirmed in translation, calculates the similarity of said translation word and said word and said keyword respectively, in said translation word, to be the final translated product word that corresponding said each senses of a dictionary entry selected in said word, generates said multilingual dictionary.
2. multilingual dictionary construction device according to claim 1 is characterized in that, the part of speech of said each senses of a dictionary entry that said bilingual dictionary module basis is corresponding with said word filters out the part of speech word inequality in the said translation word.
3. multilingual dictionary construction device according to claim 1 and 2; It is characterized in that; Said keyword extracting module is carried out participle to said lexical or textual analysis, and according to word frequency and part of speech, from the word that participle obtains, extracts candidate keywords; And calculate said candidate keywords and said candidate translates the similarity between the word, to be used for selecting said keyword in said candidate keywords.
4. multilingual dictionary construction device according to claim 3 is characterized in that, also comprises:
The vocabulary module of stopping using is set up the vocabulary of stopping using, and the word that word frequency surpasses predetermined threshold in single language corpus that will preset is recorded in the said inactive vocabulary; Said bilingual dictionary module is used said inactive vocabulary, comes from the word that said participle obtains, to select word that word frequency is no more than said predetermined threshold as said candidate keywords.
5. multilingual dictionary construction device according to claim 4 is characterized in that, when said word had unique senses of a dictionary entry, said translation confirmed that module is directly with the final translated product word of said translation word as said word.
6. a multilingual dictionary construction method is characterized in that, comprising:
Step 102 is chosen word from the monolingual dictionary that presets, and obtains the lexical or textual analysis of each senses of a dictionary entry corresponding with said word, and the pairing part of speech of this senses of a dictionary entry, constructs a proper vector, and said proper vector comprises entry, part of speech and the senses of a dictionary entry;
Step 104 is extracted keyword from said lexical or textual analysis;
Step 106 inquires pairing all the translation words of said word from the bilingual dictionary that presets, wherein, wherein a kind of languages of said bilingual dictionary are identical with the languages of said monolingual dictionary;
Step 108 is calculated the similarity of said translation word and said word and said keyword respectively, in said translation word, to select the final translated product word of corresponding said each senses of a dictionary entry for said word, generates said multilingual dictionary.
7. multilingual dictionary construction method according to claim 6 is characterized in that, said step 106 also comprises:
According to the part of speech of said each senses of a dictionary entry corresponding, filter out the part of speech word inequality in the said translation word with said word.
8. according to claim 6 or 7 described multilingual dictionary construction methods, it is characterized in that said step 104 comprises:
Participle is carried out in said lexical or textual analysis, and, from the word that participle obtains, extract candidate keywords according to word frequency and part of speech;
Calculate said candidate keywords and said candidate translates the similarity between the word, to be used for selecting said keyword in said candidate keywords.
9. multilingual dictionary construction method according to claim 8 is characterized in that, before said step 104, also comprises:
The vocabulary of set up stopping using, and the word that word frequency surpasses predetermined threshold in single language corpus that will preset is recorded in the said inactive vocabulary;
In said step 104, extract said candidate keywords according to said word frequency and comprise:
Use said inactive vocabulary, come from the word that said participle obtains, to select word that word frequency is no more than said predetermined threshold as said candidate keywords.
10. multilingual dictionary construction method according to claim 9 is characterized in that, also comprises:
When said word has unique senses of a dictionary entry, directly with the final translated product word of said translation word as said word.
CN2011101302344A 2011-05-19 2011-05-19 Establishing device and method for multilingual dictionary Pending CN102789461A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101302344A CN102789461A (en) 2011-05-19 2011-05-19 Establishing device and method for multilingual dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101302344A CN102789461A (en) 2011-05-19 2011-05-19 Establishing device and method for multilingual dictionary

Publications (1)

Publication Number Publication Date
CN102789461A true CN102789461A (en) 2012-11-21

Family

ID=47154865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101302344A Pending CN102789461A (en) 2011-05-19 2011-05-19 Establishing device and method for multilingual dictionary

Country Status (1)

Country Link
CN (1) CN102789461A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955457A (en) * 2014-05-20 2014-07-30 陈北宗 Machine-aided literature translation program
CN105912523A (en) * 2016-04-06 2016-08-31 苏州大学 Word meaning marking method and device
CN106326401A (en) * 2016-08-22 2017-01-11 联想(北京)有限公司 Industry subject term obtaining method, and subject-free term bank building method and device
CN107168958A (en) * 2017-05-15 2017-09-15 北京搜狗科技发展有限公司 A kind of interpretation method and device
CN107766337A (en) * 2017-09-25 2018-03-06 沈阳航空航天大学 Translation Forecasting Methodology based on deep semantic association
CN108563643A (en) * 2018-03-27 2018-09-21 常熟鑫沐奇宝软件开发有限公司 A kind of polysemy interpretation method based on artificial intelligence knowledge mapping
CN108829685A (en) * 2018-05-07 2018-11-16 内蒙古工业大学 A kind of illiteracy Chinese inter-translation method based on single language training
CN109408814A (en) * 2018-09-30 2019-03-01 中国地质大学(武汉) Across the language vocabulary representative learning method and system of China and Britain based on paraphrase primitive word
CN109902673A (en) * 2019-01-28 2019-06-18 北京明略软件系统有限公司 Table Header information identification and method for sorting, system, terminal and storage medium in table
CN111310481A (en) * 2020-01-19 2020-06-19 百度在线网络技术(北京)有限公司 Speech translation method, device, computer equipment and storage medium
CN111597826A (en) * 2020-05-15 2020-08-28 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation
CN112101016A (en) * 2020-11-05 2020-12-18 广州云趣信息科技有限公司 Word segmentation device obtaining method and device and electronic equipment
CN113158695A (en) * 2021-05-06 2021-07-23 上海极链网络科技有限公司 Semantic auditing method and system for multi-language mixed text

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101441643A (en) * 2007-11-22 2009-05-27 英业达股份有限公司 System and method for generating digital word stock
CN101571852A (en) * 2008-04-28 2009-11-04 富士通株式会社 Dictionary generating device and information retrieving device
US20100299132A1 (en) * 2009-05-22 2010-11-25 Microsoft Corporation Mining phrase pairs from an unstructured resource

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101441643A (en) * 2007-11-22 2009-05-27 英业达股份有限公司 System and method for generating digital word stock
CN101571852A (en) * 2008-04-28 2009-11-04 富士通株式会社 Dictionary generating device and information retrieving device
US20100299132A1 (en) * 2009-05-22 2010-11-25 Microsoft Corporation Mining phrase pairs from an unstructured resource

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955457A (en) * 2014-05-20 2014-07-30 陈北宗 Machine-aided literature translation program
CN105912523A (en) * 2016-04-06 2016-08-31 苏州大学 Word meaning marking method and device
CN105912523B (en) * 2016-04-06 2019-07-19 苏州大学 A kind of word sense tagging method and apparatus
CN106326401A (en) * 2016-08-22 2017-01-11 联想(北京)有限公司 Industry subject term obtaining method, and subject-free term bank building method and device
CN107168958A (en) * 2017-05-15 2017-09-15 北京搜狗科技发展有限公司 A kind of interpretation method and device
CN107766337A (en) * 2017-09-25 2018-03-06 沈阳航空航天大学 Translation Forecasting Methodology based on deep semantic association
CN108563643A (en) * 2018-03-27 2018-09-21 常熟鑫沐奇宝软件开发有限公司 A kind of polysemy interpretation method based on artificial intelligence knowledge mapping
CN108563643B (en) * 2018-03-27 2021-10-01 常熟鑫沐奇宝软件开发有限公司 Artificial intelligence knowledge graph-based word polysemous translation method
CN108829685A (en) * 2018-05-07 2018-11-16 内蒙古工业大学 A kind of illiteracy Chinese inter-translation method based on single language training
CN109408814B (en) * 2018-09-30 2020-08-07 中国地质大学(武汉) Chinese-English cross-language vocabulary representation learning method and system based on paraphrase primitive words
CN109408814A (en) * 2018-09-30 2019-03-01 中国地质大学(武汉) Across the language vocabulary representative learning method and system of China and Britain based on paraphrase primitive word
CN109902673A (en) * 2019-01-28 2019-06-18 北京明略软件系统有限公司 Table Header information identification and method for sorting, system, terminal and storage medium in table
CN111310481A (en) * 2020-01-19 2020-06-19 百度在线网络技术(北京)有限公司 Speech translation method, device, computer equipment and storage medium
CN111597826A (en) * 2020-05-15 2020-08-28 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation
CN111597826B (en) * 2020-05-15 2021-10-01 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation
CN112101016A (en) * 2020-11-05 2020-12-18 广州云趣信息科技有限公司 Word segmentation device obtaining method and device and electronic equipment
CN112101016B (en) * 2020-11-05 2021-03-23 广州云趣信息科技有限公司 Word segmentation device obtaining method and device and electronic equipment
CN113158695A (en) * 2021-05-06 2021-07-23 上海极链网络科技有限公司 Semantic auditing method and system for multi-language mixed text

Similar Documents

Publication Publication Date Title
CN102789461A (en) Establishing device and method for multilingual dictionary
US6101492A (en) Methods and apparatus for information indexing and retrieval as well as query expansion using morpho-syntactic analysis
Roy et al. Supervising unsupervised open information extraction models
Jiang et al. Natural language processing and its applications in machine translation: A diachronic review
CA2562366A1 (en) A system for multiligual machine translation from english to hindi and other indian languages using pseudo-interlingua and hybridized approach
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
CN104239290B (en) Statistical machine translation method and system based on dependency tree
CN104281702A (en) Power keyword segmentation based data retrieval method and device
Zhou A block-based robust dependency parser for unrestricted Chinese text
CN101739395A (en) Machine translation method and system
CN105630770A (en) Word segmentation phonetic transcription and ligature writing method and device based on SC grammar
CN102929865B (en) PDA (Personal Digital Assistant) translation system for inter-translating Chinese and languages of ASEAN (the Association of Southeast Asian Nations) countries
Lardilleux et al. The contribution of low frequencies to multilingual sub-sentential alignment: a differential associative approach
Kessler et al. Extraction of terminology in the field of construction
Al-Arfaj et al. Towards ontology construction from Arabic texts-a proposed framework
CN109992777B (en) Keyword-based traditional Chinese medicine disease condition text key semantic information extraction method
Dologlou et al. Using monolingual corpora for statistical machine translation: the METIS system
Parameswarappa et al. Kannada word sense disambiguation for machine translation
CN102982063A (en) Control method based on tuple elaboration of relation keywords extension
CN111597794B (en) Dependency relationship-based 'Yes' word and sentence relationship extraction method and device
Gong et al. Chinese word sketch and mapping principles: A corpus-based study of conceptual metaphors using the building source domain
Rahutomo et al. A review on Indonesian machine translation
KR20080019948A (en) Method for construction of lexical concept network based on lexicon and concept network using the same
Pal et al. Role of paraphrases in pb-smt
Yang et al. Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese-Japanese machine translation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20121121