CN101425087A - Method and system for constructing dictionary - Google Patents

Method and system for constructing dictionary Download PDF

Info

Publication number
CN101425087A
CN101425087A CNA2008102224266A CN200810222426A CN101425087A CN 101425087 A CN101425087 A CN 101425087A CN A2008102224266 A CNA2008102224266 A CN A2008102224266A CN 200810222426 A CN200810222426 A CN 200810222426A CN 101425087 A CN101425087 A CN 101425087A
Authority
CN
China
Prior art keywords
chinese
translation
sentence
alphabet
foreign
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008102224266A
Other languages
Chinese (zh)
Inventor
李志恒
李新娟
包塔
邓毅
周枫
周杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NET EASE YOUDAO INFORMATION TECHNOLOGY (BEIJING) Co Ltd
Netease Youdao Information Technology Beijing Co Ltd
Original Assignee
NET EASE YOUDAO INFORMATION TECHNOLOGY (BEIJING) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NET EASE YOUDAO INFORMATION TECHNOLOGY (BEIJING) Co Ltd filed Critical NET EASE YOUDAO INFORMATION TECHNOLOGY (BEIJING) Co Ltd
Priority to CNA2008102224266A priority Critical patent/CN101425087A/en
Publication of CN101425087A publication Critical patent/CN101425087A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a method and a system for constructign a dictionary. The method comprises the following steps: extracting foreign vocabularies which are in accordance with a prearranged mode and Chinese characters before and/or behind the foreign vocabularies from a great amount of web pages, confirming Chinese characters of which the appearance times reach or exceed prearranged times from the extracted Chinese characters before and/or behind the foreign vocabularies as the Chinese paraphrase of the foreign vocabularies and establishing index for the Chinese characters and the foreign vocabularies with corresponding paraphrase. The invention has the advantages that the efficiency of dictionary construction can be enhanced, and more vocabularies are collected in a dictionary as far as possible.

Description

Make up the method and system of dictionary
Technical field
The present invention relates to Internet technical field, particularly a kind of method and system that makes up dictionary.
Background technology
Along with the development of Internet technology, the service of translation on line is released in increasing website.The translation on line service that the user can provide by the website, the foreign language that inquiry Chinese is corresponding is expressed, and perhaps inquires about the Chinese of foreign language correspondence and expresses.
Whether the translation on line service that the website provides is powerful, and whether depend on the vocabulary that comprises in the employed dictionary abundant and accurate, and the vocabulary in the existing China and foreign countries cliction allusion quotation generally depends on artificial input and editor.
In research and practice process to prior art, the inventor finds to exist in the prior art following problem:
In the existing translation on line, China and foreign countries' cliction allusion quotation depends on artificial input and editor, and such mode is brought huge workload and lower efficient to constituting dictionary undoubtedly, and causes the lexical information that can include in the dictionary comparatively limited.
Summary of the invention
The purpose of the embodiment of the invention provides a kind of method and system that makes up dictionary, realizing making up automatically dictionary, and includes more vocabulary as far as possible in dictionary.
For solving the problems of the technologies described above, the method and system of the structure dictionary that the embodiment of the invention provides is achieved in that
A kind of method that makes up dictionary comprises:
From the magnanimity webpage, extract before the alphabet meet preassigned pattern and this alphabet and/or Chinese text afterwards;
The identical Chinese text that occurrence number in the Chinese text before and after the described alphabet that extracts is met or exceeded pre-determined number is defined as the Chinese lexical or textual analysis of described alphabet;
For described Chinese is set up index with the foreign language of corresponding lexical or textual analysis.
Preferably, the described alphabet that meets preassigned pattern can comprise:
Place the alphabet in the bracket; Or,
The Chinese that meets predetermined format is expressed.
Preferably, described foundation after the index can also comprise:
When receiving query requests, according to the translation of the index search query word correspondence of setting up.
Preferably, described for described Chinese with before index set up in the foreign language of corresponding lexical or textual analysis, this method also comprises:
From the magnanimity webpage, extract the bilingual words and phrases tabulation of Chinese and foreign language.
Preferably, in the described extraction process, can also comprise:
According to the common error situation filtering of relevant speech or phrase on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese word correspondence are integrated into together, and same China and foreign countries cliction, the pairing identical translation of phrase are merged.
Preferably, in the bilingual words and phrases list process of described extraction Chinese and foreign language, can also comprise:
For the sentence in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese sentence correspondence are integrated into together, and the identical translation of same China and foreign countries sentence is merged.
Preferably, described for described Chinese with before index set up in the foreign language of corresponding lexical or textual analysis, this method can also comprise:
The paragraph that foreign language alternately occurs in extracting from the magnanimity webpage, and judge translation relation each other in the paragraph that foreign language replaces from these, from the paragraph of translation relation each other, parse the sentence of mutual correspondence.
Preferably, in the paragraph process that foreign language alternately occurs in described the extracting, this method can also comprise:
For the sentence in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese sentence correspondence are integrated into together, and the identical translation of same China and foreign countries sentence is merged.
A kind of system that makes up dictionary comprises:
Bilingual fragment extracting unit is used for extracting before the alphabet meet preassigned pattern and this alphabet and/or afterwards Chinese text from the magnanimity webpage;
The Chinese text that lexical or textual analysis determining unit, the Chinese text occurrence number before and after the described alphabet that is used for extracting meet or exceed pre-determined number is defined as the Chinese lexical or textual analysis of described alphabet;
The unit set up in index, is used to described Chinese to set up index with the foreign language of corresponding lexical or textual analysis.
Preferably, the described alphabet that meets preassigned pattern can comprise:
Place the alphabet in the bracket; Or,
The Chinese that meets predetermined format is expressed.
Preferably, described system can also comprise:
Query unit is used for when receiving query requests, according to the translation of the index search query word correspondence of setting up.
Preferably, described system can also comprise:
Bilingual words and phrases tabulation take-up unit is used for extracting from the magnanimity webpage the bilingual words and phrases tabulation of Chinese and foreign language;
Correspondingly, the unit set up in described index, is used to described Chinese to set up index with the foreign language of corresponding lexical or textual analysis.
Preferably, described system can also comprise:
The unit optimized in vocabulary, be used for common error situation filtering or the wrong translation of correction candidate translation according to relevant speech or phrase on the internet page, and the different foreign languages translations of same Chinese word correspondence are integrated into together, more same China and foreign countries cliction, the pairing identical translation of phrase are merged.
Preferably, described system can also comprise:
Sentence is to optimizing the unit, for the sentence in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese sentence correspondence are integrated into together, and the identical translation of same China and foreign countries sentence is merged.
Preferably, described system can also comprise:
Bilingual paragraph extracting unit is used for the paragraph that foreign language alternately occurs from the magnanimity webpage extracts, and judges translation relation each other in the paragraph that foreign language replaces from these, parses the sentence of mutual correspondence from the paragraph of translation relation each other.
Preferably, described system can also comprise:
Sentence is to optimizing the unit, for the sentence in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese sentence correspondence are integrated into together, and the identical translation of same China and foreign countries sentence is merged.
The technical scheme that is provided by the above embodiment of the invention as seen, from the magnanimity webpage, extract before the alphabet meet preassigned pattern and this alphabet and/or Chinese text afterwards, the Chinese text that occurrence number in the Chinese text before and after the described alphabet that extracts is met or exceeded pre-determined number is defined as the Chinese lexical or textual analysis of described alphabet, for described Chinese is set up index with the foreign language of corresponding lexical or textual analysis, like this, realized automatic structure dictionary, and be that all webpages grasp the middle alphabet with translation relation from the internet with magnanimity information, phrase, and artificial input and editor have been avoided, improve the efficient of structure dictionary, and in dictionary, included more vocabulary as far as possible.
Description of drawings
Fig. 1 is the process flow diagram of the present invention first method embodiment;
Fig. 2 a and Fig. 2 b are the example of two bilingual words and phrases tabulations;
Fig. 3 is the process flow diagram of the present invention second method embodiment;
Fig. 4 is the process flow diagram of third party's method embodiment of the present invention;
Fig. 5 is the process flow diagram of the cubic method embodiment of the present invention;
Fig. 6 is the block diagram of the present invention's first system embodiment;
Fig. 7 is the block diagram of the present invention's second system embodiment;
Fig. 8 is the block diagram of tertiary system system embodiment of the present invention;
Fig. 9 is the block diagram of Quaternary system system embodiment of the present invention.
Embodiment
The embodiment of the invention provides a kind of method and system that makes up dictionary.
In order to make those skilled in the art person understand the present invention program better, the embodiment of the invention is described in further detail below in conjunction with drawings and embodiments.
Below introduce method first embodiment that the present invention makes up dictionary, Fig. 1 shows the flow process of this method first embodiment, and as shown in Figure 1, this method first embodiment comprises:
S101: from the magnanimity webpage, extract before the alphabet meet preassigned pattern and this alphabet and/or Chinese text afterwards.
Preassigned pattern can comprise following at least two kinds here:
(1) places in the bracket.
For example, the following literal of having been drawn together by a pair of bracket:
... Hong Kong University and Hong Kong Chinese University (CUHK)
Present at Hong Kong Chinese University (CUHK) ...
He and Hong Kong Chinese University (CUHK) ...
In this section literal, CUHK has been drawn together by bracket, occurs Chinese text before it, for example " Hong Kong University and Hong Kong Chinese University ", " presenting at Hong Kong Chinese University ", " he and Hong Kong Chinese University ".
Again for example, the following literal of having been drawn together by a pair of bracket:
... show The Lord of the Rings (The Lord of the Rings).
In this section literal, The Lord of the Rings has been drawn together by bracket, and one section Chinese text " ... show The Lord of the Rings " is arranged before it.
(2) Chinese that meets predetermined format is expressed.
For example, for " Chinese ..., English ... " such expression, following literal:
Chinese title: " The Lord of the Rings ", English title: " The Lord of the Rings ".
Here (1) and (2) listed preassigned pattern just is illustrated.Certainly, can also be other preassigned pattern, these preassigned patterns can be by pre-defined next clear and definite.
By this step, each the one section Chinese text before and after the alphabet that meets preassigned pattern that exists in the webpage with magnanimity in the internet and this alphabet extracts.Need to prove, have the method that grasps magnanimity webpage in the internet, for example present the Internet search technology etc. in the prior art.Therefore, extracting this operation in this step in the magnanimity webpage from the internet can realize based on this.
Also have certain situation, from the magnanimity webpage, extract the alphabet and this alphabet Chinese text afterwards that meet preassigned pattern.There is the situation of Chinese text afterwards in the English words that for example meets preassigned pattern.Certainly, also may be the situation that all there is Chinese text in the front and back of English words, meet the alphabet of preassigned pattern and the Chinese text after this alphabet.Also should be paid attention to.
Need to prove, in this step, the Chinese text that the alphabet that meets preassigned pattern of extraction and front and back thereof occur, the possibility of translation is very big each other.This step extracts the literal of this type from the internet mass webpage immediately, and then determines the Chinese lexical or textual analysis of alphabet correspondence again.
S102: the identical Chinese text that occurrence number meets or exceeds pre-determined number in the Chinese text before and after the described alphabet that will extract is defined as the Chinese lexical or textual analysis of described alphabet.
The front is mentioned, and in the Chinese text that alphabet that meets preassigned pattern that extracts among the S101 and front and back thereof occur, the possibility of translation is very big each other, and among the S102, promptly is the Chinese translation of determining this alphabet from all possible Chinese translation.
The example among the S101 for example, for this alphabet of CUHK, in the internet mass webpage, the preceding appearance " Hong Kong University " and " Hong Kong Chinese University ", and " Hong Kong Chinese University " occurred 3 times, and " Hong Kong University " occurred 1 time.For example pre-determined number is 2, has then surpassed this pre-determined number 3 times of " Hong Kong Chinese University " appearance, therefore, Hong Kong Chinese University is defined as the Chinese lexical or textual analysis of CUHK.
In the example of (1) in aforementioned S101:
... Hong Kong University and Hong Kong Chinese University (CUHK)
Present at Hong Kong Chinese University (CUHK) ...
He and Hong Kong Chinese University (CUHK) ...
First has also occurred reaching, second has also occurred, the 3rd also occurred and, but these speech are not to meet before the alphabet of preassigned pattern and this alphabet in three examples and/or identical Chinese text afterwards, that is to say, " reach ", " ", " with " these three words are inequality, therefore not will consider these words inequality, and only consider till identical " Hong Kong Chinese University ".
Need to prove that more than one of the Chinese lexical or textual analysis possibility of a certain alphabet correspondence has some kinds, also may have the difference of part of speech, and different meanings is arranged again under different linguistic context.Here, be among the S102, the Chinese text that occurrence number in the Chinese text before and after the described alphabet that extracts is reached pre-determined number is defined as the Chinese lexical or textual analysis of described alphabet, if it is a plurality of finally reaching the Chinese text of pre-determined number, then can conclude the some Chinese lexical or textual analysis that obtains this alphabet correspondence, thereby realized collection, guaranteed the rich of lexical or textual analysis in the dictionary all Chinese lexical or textual analysis of this alphabet.
When the Chinese translation of having determined the alphabet correspondence, just can obtain of the translation of this alphabet to Chinese translation; According to the translation of the alphabet that obtains, significantly, can also obtain of the translation of this Chinese translation to alphabet to Chinese translation.
Here only exemplified Chinese and the English example of translation each other, still obvious, can also be Chinese and other Languages literal.Certainly, according to aforementioned S101 and S102,, can also be the lexical or textual analysis corresponding relation of determining between two kinds of foreign languages according to the employed language of webpage in the internet that extracts.
In existing internet web page, adjoining middle alphabet might not be the relation of lexical or textual analysis each other, adopts the mode among the S102 here, and a kind of mode of definite paraphrase relation is provided, thereby has improved the accuracy and the reliability of translation service.
S103: for described Chinese is set up index with the foreign language of corresponding lexical or textual analysis.
In this step, setting up index for described Chinese with the foreign language of corresponding lexical or textual analysis, can be to set up inverted index for described Chinese with the foreign language of corresponding lexical or textual analysis.The mode of setting up inverted index is an existent method in the prior art, does not repeat them here.
One skilled in the art will appreciate that after setting up index dictionary just makes up and finished.
In addition, in S102, specifically can also comprise:
S102A: according to the common error situation filtering of relevant speech or phrase on the internet page or revise wrong translation in candidate's translation, and the different foreign languages translations of same Chinese word correspondence are integrated into together, more same China and foreign countries cliction, the pairing identical translation of phrase are merged.
Common mistake comprises the word misspelling, is inserted into the space or embeds other mess codes in the middle of the word, space omission, Chinese word hiatus or clerical mistake etc. between the word of phrase.The method that solves this class mistake can if there are some identical lexical or textual analysis, then only keep these identical lexical or textual analysis by the same English corresponding different Chinese lexical or textual analysis of relatively excavating, and is the highest because this is the probability of correct lexical or textual analysis.Otherwise several corresponding different English lexical or textual analysis of same Chinese also can be done similar processing.
Judge that identical method has a lot, use always as calculating the editing distance of two character strings.For example if the editing distance of the Chinese lexical or textual analysis of several of same English less than preset threshold value, then is judged as identical.Need to prove that the calculating of editing distance discloses multiple in existing paper.Similarly, for the some English lexical or textual analysis of same Chinese, calculate the editing distance of these English character strings, thereby judge identical lexical or textual analysis according to preset threshold value.
In the middle of reality, owing to there is the webpage of magnanimity in the internet, therefore data volume to be processed is very huge, generally possibly can't on a computing machine, computing realize this step, and may adopt the distributed arithmetic mode of map-reduce that the different foreign language lexical or textual analysis of same Chinese are integrated into together.
By the processing of S102A, solve the problem that in making up the dictionary process, may have a large amount of noises, promptly there is the problem of a large amount of wrong translations.
After the S103, this method can also comprise:
S104: when receiving query requests, according to the translation of the index search query word correspondence of setting up.
Here,,, just wish to translate into the translation of which kind of language,,, can inquire the translation of corresponding language according to index also according to the vocabulary of storing in the dictionary according to setting when the query word of a kind of language of input.
Below introduce the second method embodiment of the present invention, the Overall Steps of this method embodiment in comprising the aforementioned first method embodiment, before S103, can also comprise:
S105: from the magnanimity webpage, extract the bilingual words and phrases tabulation of Chinese and foreign language.
In the webpage of foreign language contrast, there is bilingual words and phrases tabulation in much providing.Fig. 2 a and Fig. 2 b show an example of this tabulation respectively.
In Fig. 2 a, this web displaying the tabulation of Chinese-English name of the dish, wherein each provisional capital can be divided into two according to China and foreign countries' Chinese character, and the succession that middle foreign language occurs is identical in each provisional capital, usually, in the two such row vocabulary, translate each other with two row words and phrases of delegation.Among the S105, promptly be from this webpage, extract such corresponding relation Chinese and English then.
In Fig. 2 b, this web displaying a Chinese-English word lists, in this form totally two row, each the row in two row all be that Chinese words and phrases are classified on the left side one as, the foreign language words and phrases are classified on the right one as.And obvious, the same delegation words and phrases of these two row are translated each other.Usually, in the two such row vocabulary, translate each other with two row words and phrases of delegation.Among the S105, promptly be from this webpage, extract such corresponding relation Chinese and English then.
Like this, after S105 carries out,
Need to prove, when this method embodiment comprises S105, S105 can with S101, S102 parallel processing before.And, after S105 carries out, can enter S103.Thereby, in S103, for the Chinese in the bilingual words and phrases tabulation of S105 extraction is set up index with the foreign language of corresponding lexical or textual analysis.And then, in S104, when receiving query requests, according to the translation of the index search query word correspondence of setting up.
In addition, in the S105 process, specifically can also comprise:
S102A: for speech or the phrase in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant speech or phrase on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese word correspondence are integrated into together, and same China and foreign countries cliction, the pairing identical translation of phrase are merged.
This step specific implementation can be with reference to aforementioned explanation to S102A.
Similarly,,, solve the problem that in making up the dictionary process, may have a large amount of noises here, promptly have the problem of a large amount of wrong translations by the processing of S102A.
In addition, after S105, before the S103, can also comprise among this second method embodiment:
S106: for the sentence in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese sentence correspondence are integrated into together, and the identical translation of same China and foreign countries sentence is merged.
S102A is similar with the front, and common mistake comprises the word misspelling, is inserted into the space or embeds other mess codes in the middle of the word, space omission, Chinese word hiatus or clerical mistake etc. between the word of phrase.The method that solves this class mistake can be by the same English corresponding different Chinese lexical or textual analysis of relatively excavating.If there are the lexical or textual analysis of some identical whole sentences, then only keep these identical whole sentence lexical or textual analysis, the highest because this is the probability of correct lexical or textual analysis.Otherwise several different English whole sentence lexical or textual analysis that same Chinese is corresponding also can be done similar processing.
Judge that identical method has a lot, use always as calculating the editing distance of two character strings.For example if the editing distance of the whole sentence of several Chinese lexical or textual analysis of same English less than preset threshold value, then is judged as identical.Need to prove that the calculating of editing distance discloses multiple in existing paper.Similarly, for the some English lexical or textual analysis of same Chinese, calculate the editing distance of these English character strings, thereby judge identical lexical or textual analysis according to preset threshold value.Need to prove, judging the identical of sentence according to editing distance, or when judging sentence identical according to alternate manner, the punctuation mark that can also remove in the sentence is judged, can help to judge more accurately really identical whole sentence lexical or textual analysis like this, and not because the influence of punctuate reduces possible duplicating.
In the middle of reality, owing to there is the webpage of magnanimity in the internet, therefore data volume to be processed is very huge, generally possibly can't on a computing machine, computing realize this step, and may adopt the distributed arithmetic mode of map-reduce that the different foreign language lexical or textual analysis of same Chinese are integrated into together.
Similarly,, solve the problem that in making up the dictionary process, may have a large amount of noises, promptly have the problem of a large amount of wrong translations by the processing of S106.
The flow process of this second method embodiment can be as shown in Figure 3.
Below introduce third party's method embodiment of the present invention, the Overall Steps of this method embodiment in comprising the aforementioned first method embodiment, before 103, can also comprise:
S107: the paragraph that foreign language alternately occurs in extracting from the magnanimity webpage, and judge translation relation each other in the paragraph that foreign language replaces from these, from the paragraph of translation relation each other, parse the sentence of mutual correspondence.
The judgement of translation relation each other, can be by safeguarding a standard foreign language-Chinese dictionary (comprising the translator of Chinese of each common foreign language word), foreign language in bilingual tabulation part is replaced with translator of Chinese in this dictionary one by one, calculate its editing distance with this then, if be that decidable is translated relation each other less than a threshold value with the Chinese part of this tabulation.
The method of the sentence alignment in the parallel text (promptly the middle foreign language paragraph of translation relation) each other is a lot, promptly parse the sentence of mutual correspondence, specific implementation can be with reference to following listed paper:
●Y?Lü,M?Zhou,S?Li,C?Huang,T?Zhao:Automatic?Translation?Template?AcquisitionBased?on?Bilingual?Structure?Alignment.Computational?Linguistics,2001
●C.J.Lee,J.S.Chang,J.R.Jang:Alignment?of?bilingual?named?entities?in?parallelcorpora?using?statistical?models?and?multiple?knowledge?sources.TALIP,June?2006
Need to prove, when this method embodiment comprises S107, S107 can with S101, S102 parallel processing before.And, after S107 carries out, can enter S103.Thereby, in S103, for the paragraph that escorts the translation relation that S107 extracts is set up index.And then, in S104, when receiving query requests, according to the translation of the index search query word correspondence of setting up.
Similarly,, solve the problem that in making up the dictionary process, may have a large amount of noises, promptly have the problem of a large amount of wrong translations by the processing of S107.
In addition, after S107, before the S103, can also comprise among this third party's method embodiment:
S108: for the sentence in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, and the different foreign languages translations of same Chinese sentence correspondence are integrated into together, the more identical translation of same China and foreign countries sentence is merged.
The realization of this step can repeat no more here with reference to the description among the aforementioned S106.
Similarly,, solve the problem that in making up the dictionary process, may have a large amount of noises, promptly have the problem of a large amount of wrong translations by the processing of S108.
The flow process of this third party's method embodiment can be as shown in Figure 4.
Below introduce cubic method embodiment of the present invention, this method embodiment is actually the combination of the aforementioned second method embodiment and third party's method embodiment, promptly comprises the Overall Steps of the aforementioned second method embodiment and third party's method embodiment.The flow process of this cubic method embodiment can be as shown in Figure 5.
Wherein, S108 can merge with S106.
Since each step among this cubic method embodiment with description is all arranged in aforesaid method embodiment, just repeat no more at this.
The method embodiment that is provided by the invention described above as seen, from the magnanimity webpage, extract before the alphabet meet preassigned pattern and this alphabet and/or Chinese text afterwards, the Chinese text that occurrence number in the Chinese text before and after the described alphabet that extracts is met or exceeded pre-determined number is defined as the Chinese lexical or textual analysis of described alphabet, for described Chinese is set up index with the foreign language of corresponding lexical or textual analysis, like this, realized automatic structure dictionary, and be that all webpages grasp the middle alphabet with translation relation from the internet with magnanimity information, phrase, and artificial input and editor have been avoided, improve the efficient of structure dictionary, and in dictionary, included more vocabulary as far as possible.
Below introduce the system embodiment that the present invention makes up dictionary, Fig. 6 shows first embodiment of this system, and as shown in Figure 6, this system embodiment can comprise:
Bilingual fragment extracting unit 61 is used for extracting before the alphabet meet preassigned pattern and this alphabet and/or afterwards Chinese text from the magnanimity webpage;
The Chinese text that lexical or textual analysis determining unit 62, the Chinese text occurrence number before and after the described alphabet that is used for extracting meet or exceed pre-determined number is defined as the Chinese lexical or textual analysis of described alphabet;
Unit 63 set up in index, is used to described Chinese to set up index with the foreign language of corresponding lexical or textual analysis.
Preferably, the described alphabet that meets preassigned pattern comprises:
Place the alphabet in the bracket; Or,
The Chinese that meets predetermined format is expressed.
Preferably, in the described system, can also comprise:
Query unit 64 is used for when receiving query requests, according to the translation of the index search query word correspondence of setting up.
Preferably, in the described system, can also comprise:
Unit 66 optimized in vocabulary, be used for common error situation filtering or the wrong translation of correction candidate translation according to relevant speech or phrase on the internet page, the different foreign languages translations of same Chinese word correspondence are integrated into together, and same China and foreign countries cliction, the pairing identical translation of phrase are merged.
Below introduce second embodiment of system that the present invention makes up dictionary, Fig. 7 shows second embodiment of this system, as shown in Figure 7, this embodiment all unit in comprising first embodiment, can also comprise:
Bilingual words and phrases tabulation take-up unit 65 is used for extracting from the magnanimity webpage the bilingual words and phrases tabulation of Chinese and foreign language;
Correspondingly, unit 63 set up in described index, is used to described Chinese to set up index with the foreign language of corresponding lexical or textual analysis.
Preferably, in the described system, unit 63 set up in described index can also not be directly to set up unit 63 with described index to link to each other, and links to each other but optimize unit 66 with described vocabulary earlier, links to each other thereby set up unit 63 by described vocabulary optimization unit 66 with described index.Wherein, described vocabulary is optimized unit 66 and is used for according to the common error situation filtering of relevant speech or phrase on the internet page or the translation of correction candidate translation mistake, the different foreign languages translations of same Chinese word correspondence are integrated into together, and same China and foreign countries cliction, the pairing identical translation of phrase are merged.
Preferably, in the described system, can also comprise:
Sentence is to optimizing unit 67, described bilingual words and phrases tabulation take-up unit 65 earlier with sentence to optimizing unit 67, thereby set up unit 63 with described index and link to each other optimizing unit 67 by sentence.Described sentence is to optimizing unit 67, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese sentence correspondence are integrated into together, and the identical translation of same China and foreign countries sentence is merged.
Below introduce the 3rd embodiment of system that the present invention makes up dictionary, Fig. 8 shows the 3rd embodiment of this system, as shown in Figure 8, this embodiment all unit in comprising first embodiment, can also comprise:
Bilingual paragraph extracting unit 68 is used for the paragraph that foreign language alternately occurs from the magnanimity webpage extracts, and judges translation relation each other in the paragraph that foreign language replaces from these, parses the sentence of mutual correspondence from the paragraph of translation relation each other.
Preferably, in the described system, can also comprise:
Sentence is to optimizing unit 69, for the sentence in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese sentence correspondence are integrated into together, and the identical translation of same China and foreign countries sentence is merged.
Wherein, sentence can merge optimizing unit 67 with sentence optimizing unit 69.
Below introduce the 4th embodiment of system that the present invention makes up dictionary, Fig. 9 shows the 4th embodiment of this system, and as shown in Figure 8, as seen, this embodiment comprises all unit among first, second, third embodiment.Similar in the function of above-mentioned unit and the front system embodiment do not repeat them here.And, need to prove that sentence can be realized by same unit optimizing unit 67 with sentence optimizing unit 69.
The present invention can be used in numerous general or special purpose computingasystem environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment or the like.
The dictionary that the present invention makes up can provide numerous general or special purpose computingasystem environment or configuration to visit by network, for example by dedicated network, by Internet etc., thereby can be implemented in the line translation service.Here said general or special purpose computingasystem environment or configuration for example can be: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment or the like
The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, by by communication network connected teleprocessing equipment execute the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
Though described the present invention by embodiment, those of ordinary skills know, the present invention has many distortion and variation and do not break away from spirit of the present invention, wish that appended claim comprises these distortion and variation and do not break away from spirit of the present invention.

Claims (15)

1, a kind of method that makes up dictionary is characterized in that, comprising:
From the magnanimity webpage, extract before the alphabet meet preassigned pattern and this alphabet and/or Chinese text afterwards;
The identical Chinese text that occurrence number in the Chinese text before and after the described alphabet that extracts is met or exceeded pre-determined number is defined as the Chinese lexical or textual analysis of described alphabet;
For described Chinese is set up index with the foreign language of corresponding lexical or textual analysis.
2, the method for claim 1 is characterized in that, the described alphabet that meets preassigned pattern comprises:
Place the alphabet in the bracket; Or,
The Chinese that meets predetermined format is expressed.
3, the method for claim 1 is characterized in that, described foundation after the index also comprises:
When receiving query requests, according to the translation of the index search query word correspondence of setting up.4, the method for claim 1 is characterized in that, described for described Chinese with before index set up in the foreign language of corresponding lexical or textual analysis, this method also comprises:
From the magnanimity webpage, extract the bilingual words and phrases tabulation of Chinese and foreign language.
5, as claim 1 or 4 described methods, it is characterized in that, in the described extraction process, also comprise:
According to the common error situation filtering of relevant speech or phrase on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese word correspondence are integrated into together, and same China and foreign countries cliction, the pairing identical translation of phrase are merged.
6, method as claimed in claim 4 is characterized in that, in the bilingual words and phrases list process of described extraction Chinese and foreign language, also comprises:
For the sentence in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese sentence correspondence are integrated into together, and the identical translation of same China and foreign countries sentence is merged.
7, as each described method in the claim 1,4,5,6, it is characterized in that, described for described Chinese with before index set up in the foreign language of corresponding lexical or textual analysis, this method also comprises:
The paragraph that foreign language alternately occurs in extracting from the magnanimity webpage, and judge translation relation each other in the paragraph that foreign language replaces from these, from the paragraph of translation relation each other, parse the sentence of mutual correspondence.
8, method as claimed in claim 7 is characterized in that, in the paragraph process that foreign language alternately occurs in described the extracting, this method also comprises:
For the sentence in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese sentence correspondence are integrated into together, and the identical translation of same China and foreign countries sentence is merged.
9, a kind of system that makes up dictionary is characterized in that, comprising:
Bilingual fragment extracting unit is used for extracting before the alphabet meet preassigned pattern and this alphabet and/or afterwards Chinese text from the magnanimity webpage;
The Chinese text that lexical or textual analysis determining unit, the Chinese text occurrence number before and after the described alphabet that is used for extracting meet or exceed pre-determined number is defined as the Chinese lexical or textual analysis of described alphabet;
The unit set up in index, is used to described Chinese to set up index with the foreign language of corresponding lexical or textual analysis.
10, system as claimed in claim 9 is characterized in that, the described alphabet that meets preassigned pattern comprises:
Place the alphabet in the bracket; Or,
The Chinese that meets predetermined format is expressed.
11, system as claimed in claim 9 is characterized in that, described system also comprises:
Query unit is used for when receiving query requests, according to the translation of the index search query word correspondence of setting up.
12, system as claimed in claim 9 is characterized in that, described system also comprises:
Bilingual words and phrases tabulation take-up unit is used for extracting from the magnanimity webpage the bilingual words and phrases tabulation of Chinese and foreign language;
Correspondingly, the unit set up in described index, is used to described Chinese to set up index with the foreign language of corresponding lexical or textual analysis.
13, as claim 9 or 12 described systems, it is characterized in that described system also comprises:
The unit optimized in vocabulary, be used for common error situation filtering or the wrong translation of correction candidate translation according to relevant speech or phrase on the internet page, and the different foreign languages translations of same Chinese word correspondence are integrated into together, more same China and foreign countries cliction, the pairing identical translation of phrase are merged.
14, system as claimed in claim 12 is characterized in that, described system also comprises:
Sentence is to optimizing the unit, for the sentence in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese sentence correspondence are integrated into together, and the identical translation of same China and foreign countries sentence is merged.
15, as each described system in the claim 9,12,13,14, it is characterized in that described system also comprises:
Bilingual paragraph extracting unit is used for the paragraph that foreign language alternately occurs from the magnanimity webpage extracts, and judges translation relation each other in the paragraph that foreign language replaces from these, parses the sentence of mutual correspondence from the paragraph of translation relation each other.
16, system as claimed in claim 15 is characterized in that, described system also comprises:
Sentence is to optimizing the unit, for the sentence in the bilingual words and phrases tabulation of extracting, according to the common error situation filtering of relevant sentence on the internet page or revise wrong translation in candidate's translation, the different foreign languages translations of same Chinese sentence correspondence are integrated into together, and the identical translation of same China and foreign countries sentence is merged.
CNA2008102224266A 2008-09-16 2008-09-16 Method and system for constructing dictionary Pending CN101425087A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008102224266A CN101425087A (en) 2008-09-16 2008-09-16 Method and system for constructing dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008102224266A CN101425087A (en) 2008-09-16 2008-09-16 Method and system for constructing dictionary

Publications (1)

Publication Number Publication Date
CN101425087A true CN101425087A (en) 2009-05-06

Family

ID=40615700

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008102224266A Pending CN101425087A (en) 2008-09-16 2008-09-16 Method and system for constructing dictionary

Country Status (1)

Country Link
CN (1) CN101425087A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102550049A (en) * 2009-09-25 2012-07-04 雅虎公司 Acquisition of out-of-vocabulary translations by dynamically learning extraction rules
CN107451129A (en) * 2017-08-08 2017-12-08 传神语联网网络科技股份有限公司 The judgement of unconventional word or unconventional short sentence and interpretation method and its system
CN107515848A (en) * 2017-10-12 2017-12-26 刘啸旻 The bilingual mark and composition method of books or electronic document
CN112632282A (en) * 2020-12-30 2021-04-09 中科院计算技术研究所大数据研究院 Chinese and English thesis data classification and query method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102550049A (en) * 2009-09-25 2012-07-04 雅虎公司 Acquisition of out-of-vocabulary translations by dynamically learning extraction rules
CN102550049B (en) * 2009-09-25 2016-05-25 雅虎公司 Obtain the translation outside vocabulary by dynamic learning extracting rule
CN107451129A (en) * 2017-08-08 2017-12-08 传神语联网网络科技股份有限公司 The judgement of unconventional word or unconventional short sentence and interpretation method and its system
CN107451129B (en) * 2017-08-08 2020-09-25 传神语联网网络科技股份有限公司 Method and system for judging and translating irregular words or irregular short sentences
CN107515848A (en) * 2017-10-12 2017-12-26 刘啸旻 The bilingual mark and composition method of books or electronic document
CN112632282A (en) * 2020-12-30 2021-04-09 中科院计算技术研究所大数据研究院 Chinese and English thesis data classification and query method

Similar Documents

Publication Publication Date Title
US7895030B2 (en) Visualization method for machine translation
KR102025968B1 (en) Phrase-based dictionary extraction and translation quality evaluation
US11010673B2 (en) Method and system for entity relationship model generation
US20130041647A1 (en) Method for disambiguating multiple readings in language conversion
KR20160105400A (en) System and method for inputting text into electronic devices
Way et al. On the Role of Translations in State‐of‐the‐Art Statistical Machine Translation
Al-Jumaily et al. A real time Named Entity Recognition system for Arabic text mining
CN103314369B (en) Machine translation apparatus and method
CN111259652A (en) Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment
Adeeba et al. Experiences in building urdu wordnet
CN102262621A (en) Device and method for checking translated text
US10261989B2 (en) Method of and system for mapping a source lexical unit of a first language to a target lexical unit of a second language
CN106372053B (en) Syntactic analysis method and device
CN105593845A (en) Apparatus for generating self-learning alignment-based alignment corpus, method therefor, apparatus for analyzing destructive expression morpheme by using alignment corpus, and morpheme analysis method therefor
CN102567306A (en) Acquisition method and acquisition system for similarity of vocabularies between different languages
Masmoudi et al. Transliteration of Arabizi into Arabic script for Tunisian dialect
Sibarani et al. A study of parsing process on natural language processing in bahasa Indonesia
CN107577713B (en) Text handling method based on electric power dictionary
Tursun et al. Noisy Uyghur text normalization
CN101308512A (en) Mutual translation pair extraction method and device based on web page
US8041556B2 (en) Chinese to english translation tool
CN101425087A (en) Method and system for constructing dictionary
CN103455572A (en) Method and device for acquiring movie and television subjects from web pages
JP2016164707A (en) Automatic translation device and translation model learning device
RU2595531C2 (en) Method and system for generating definition of word based on multiple sources

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20090506