CN102486770B - Character conversion method and system - Google Patents

Character conversion method and system Download PDF

Info

Publication number
CN102486770B
CN102486770B CN201010576958.7A CN201010576958A CN102486770B CN 102486770 B CN102486770 B CN 102486770B CN 201010576958 A CN201010576958 A CN 201010576958A CN 102486770 B CN102486770 B CN 102486770B
Authority
CN
China
Prior art keywords
words
language
word
target language
relevance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201010576958.7A
Other languages
Chinese (zh)
Other versions
CN102486770A (en
Inventor
杨秉哲
吴世弘
谷圳
林倩慧
卢家庆
谢文泰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute for Information Industry
Original Assignee
Institute for Information Industry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute for Information Industry filed Critical Institute for Information Industry
Priority to CN201010576958.7A priority Critical patent/CN102486770B/en
Publication of CN102486770A publication Critical patent/CN102486770A/en
Application granted granted Critical
Publication of CN102486770B publication Critical patent/CN102486770B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a character conversion method and system. The system comprises a storage unit, a classification unit and a conversion unit, wherein the storage unit is used for storing a word comparison table of a word corresponding relation between a source language and a target language; the classification unit is used for carrying out word segmentation treatment on character paragraphs of the source language to obtain a plurality of word segmentation results, and comparing the word segmentation results with the word comparison table to judge that each source language word in the character paragraphs belongs to a first or second type, wherein the source language word of the first type corresponds to a target language word and the source language word of the second type corresponds to a plurality of reserved target language words; the conversion unit is used for converting the source language word of the first type into the target language word according to the word comparison table, and selecting one word from the reserved target language words to be the target language word of the source language word of the second type according to corresponding reserved target language words and a co-occurrence relevance between the source language word of the second type and a plurality of relevant words composed of front and back words.

Description

Word conversion method and system
Technical field
The present invention relates to a kind of Word conversion method, relate in particular to a kind of Word conversion method and system of processing the corresponding multiple target language words of source language words.
Background technology
Along with the arriving in global village epoch, modern often has an opportunity to contact the information from all over the world.But in the time of the data in the face of being write by unfamiliar language, these data are converted to familiar language by the assistance that often must be dependent on language conversion instrument.
Most language conversion instrument is, by the comparison table of comparisons, the words that belongs to come source language is converted to target language.But, in the time that the table of comparisons fails to reflect meaning of one's words drop between different language and term difference, very easily produce the transformation result of comparatively distortion.In addition,, when carrying out language conversion, also often there is a source language words can be converted into the situation of multiple target language words.To this, there is the language conversion instrument of part can require user to choose the target language words that will convert in artificial mode, instrument itself cannot be selected automatically.In addition, also there is the language conversion instrument of part to decide and to want source language words in the future to convert which target language words to according to the frequency of occurrences height of each target language words.But according to statistics,, this kind of mode easily chosen wrong target language words, and cannot produce the language conversion result of high accuracy.
Summary of the invention
In view of this, the invention provides a kind of Word conversion method, particularly suitable is automatically selected preferably transformation result for words corresponding to one-to-many in the time carrying out text conversion.
The invention provides a kind of text conversion system, can process the term difference between different language, the correctness when promoting text conversion.
The present invention proposes a kind of Word conversion method, in order to the word paragraph that meets to come source language is converted to target language, its Chinese word paragraph comprises multiple sources language words, the method comprises step below: a words table of comparisons is provided, and this words table of comparisons records the words corresponding relation of source language and target language; Word paragraph is carried out hyphenation processing and obtains multiple hyphenation results; Compare above-mentioned hyphenation result and the words table of comparisons, to judge that language words system in each source belongs to the first kind and the second kind the two one of them, wherein belong to the only corresponding simple target language words of source language words of the first kind, and belong to the corresponding a plurality of candidate target language words of source language words of the second kind; The words corresponding relation recording according to the words table of comparisons target language words corresponding to the source language words that belongs to the first kind converts in word paragraph; And, the source language words of the second kind will be belonged to, according to corresponding each candidate target language words and with word paragraph in a plurality of associated characters and words of forming of at least one front and back words jointly there is relevance, from above-mentioned candidate target language words, choose one as the target language words that will convert to.
The present invention proposes a kind of text conversion system, and in order to the word paragraph that meets to come source language is converted to target language, its Chinese word paragraph comprises multiple sources language words.This system comprises: a storage element, and in order to store a words table of comparisons, the words table of comparisons records the words corresponding relation of source language and target language; One taxon, couple storage element, in order to word paragraph is carried out hyphenation processing and obtains multiple hyphenation results, and compare above-mentioned hyphenation result and the words table of comparisons, to judge that language words system in each source belongs to the first kind and the second kind the two one of them, wherein belong to the only corresponding simple target language words of source language words of the first kind, and belong to the corresponding a plurality of candidate target language words of source language words of the second kind; One converting unit, couple storage element and taxon, in order to the words corresponding relation recording according to the words table of comparisons, in word paragraph, target language words corresponding to the source language words that belongs to the first kind convert to, and the source language words of the second kind will be belonged to, according to corresponding each candidate target language words and with word paragraph in a plurality of associated characters and words of forming of at least one front and back words jointly there is relevance, from above-mentioned candidate target language words, choose one as the target language words that will convert to; And an output unit, couple converting unit, converted the word paragraph of target language in order to output.
The present invention separately proposes a kind of Word conversion method, and in order to carry out the text conversion of source language and target language, the method comprises: from meet to come the word paragraph of source language, obtain a source language words; The one words table of comparisons is provided, and the words table of comparisons records the words corresponding relation of source language and target language, and source language words corresponds to a few candidate target language words; And, according to corresponding each candidate target language words and with word paragraph in a plurality of associated characters and words of forming of at least one front and back words, respectively a plurality of language datas source jointly there is relevance, from above-mentioned candidate target language words, choose one as the target language words that will convert to.
The present invention separately proposes a kind of text conversion system, and in order to carry out the text conversion of source language and target language, this system comprises: an input block, from meet to come the word paragraph of source language, obtain source language words; One storage element, couples input block, and a words table of comparisons is provided, and the words table of comparisons records the words corresponding relation of source language and target language, and source language words corresponds to a few candidate target language words; One converting unit, couple input block and storage element, in order to according to corresponding each candidate target language words and with word paragraph in a plurality of associated characters and words of forming of at least one front and back words, respectively a plurality of language datas source jointly there is relevance, from above-mentioned candidate target language words, choose one as the target language words that will convert to; And an output unit, couples converting unit, convert the word paragraph of target language in order to output.
Based on above-mentioned, the present invention is in the time carrying out the conversion of word to word paragraph, for the situation of the corresponding several candidate target language words of a source language words, can according to corresponding each candidate target language words and with word paragraph in a plurality of associated characters and words of forming of at least one front and back words jointly there is relevance, from above-mentioned candidate target language words, select the target language words that is suitable for converting to most, thereby produce preferably text conversion result.
For above-mentioned feature and advantage of the present invention can be become apparent, special embodiment below, and coordinate appended graphic being described in detail below.
Brief description of the drawings
Fig. 1 is the calcspar according to the text conversion system shown in one embodiment of the invention.
Fig. 2 is the process flow diagram according to the Word conversion method shown in one embodiment of the invention.
Fig. 3 is the process flow diagram that belongs to the source language words of the second kind according to the conversion shown in one embodiment of the invention.
Fig. 4 is the process flow diagram that belongs to the source language words of the second kind according to the conversion shown in another embodiment of the present invention.
Fig. 5 is the calcspar according to the text conversion system shown in another embodiment of the present invention.
Fig. 6 is the calcspar according to the text conversion system shown in another embodiment of the present invention.
Fig. 7 is the process flow diagram according to the Word conversion method shown in another embodiment of the present invention.
Reference numeral:
100: text conversion system;
110: storage element;
140: taxon;
150: converting unit;
160: output unit;
210~250: each step of the Word conversion method described in one embodiment of the invention;
310~330: the conversion described in one embodiment of the invention belongs to each step of the source language words of the second kind;
410~440: the conversion described in another embodiment of the present invention belongs to each step of the source language words of the second kind;
500: text conversion system;
510: input block;
520: language model is set up unit;
530: words table of comparisons updating block;
600: text conversion system;
610: input block;
620: storage element;
630: converting unit;
640: output unit;
710~730: each step of the Word conversion method described in another embodiment of the present invention.
Embodiment
Fig. 1 is the calcspar according to the text conversion system shown in one embodiment of the invention.Refer to Fig. 1, text conversion system 100 comprises storage element 110, taxon 140, converting unit 150, and output unit 160.For instance, text conversion system 100 can be embodied in mobile phone, personal digital assistant (Personal Digital Assistant, PDA), e-book, or mobile Internet access device (Mobile Internet Device, MID) and various computer/computing machines etc.In addition, text conversion system 100 also can embedding browser, document processing software, or among website service.
Text conversion system 100 is in order to be converted to target language by the word paragraph that meets to come source language.For example, the word paragraph that belongs to simplified form of Chinese Character is converted to Chinese-traditional, the word paragraph that belongs to Chinese-traditional is converted to simplified form of Chinese Character, is converted to Chinese by belonging to English word paragraph, be maybe converted to English etc. by belonging to Chinese word paragraph.The present invention is not limited the kind of coming source language and target language.Word paragraph comprises multiple source language words (term), and source language words can be the individual character (word) that belongs to come source language, or the word/phrase (phrase) being made up of several individual characters.
Storage element 110 is for example hard disk (Hard Disk Drive, HDD), solid state hard disc (Solid State Drive, SSD) or flash memory (flash memory) storage device, the kind of storage element 110 is not limited at this.The words table of comparisons of required reference when storage element 110 is changed word in order to store, this words table of comparisons has recorded the words corresponding relation that comes source language and target language.
Taxon 140 couples storage element 110.Taxon 140 is in order to judge that according to the words table of comparisons in storage element 110 the each source language words in word paragraph belongs to the first kind or the second kind.Wherein, belong to only corresponding single the target language words of source language words of the first kind, and it is worth mentioning that, source language words might not equate with the number of words of corresponding target language words.The source language words that belongs to the second kind can corresponding multiple candidate target language words.
Converting unit 150 couples storage element 110 and taxon 140.Converting unit 150, in order to the judged result according to taxon 140, converts thereof into target language words to belonging to different types of source language words, in different ways to guarantee to produce best transformation result.
In order to further illustrate the detailed function mode of unit in text conversion system 100, below especially exemplified by another embodiment come the present invention will be described.Fig. 2 is the process flow diagram according to the Word conversion method shown in one embodiment of the invention, please refer to Fig. 1 and Fig. 2.
First in step 210, provide the words being recorded in storage element 110 table of comparisons, this words table of comparisons records the words corresponding relation of source language and target language.In detail, it (can be individual character that the words table of comparisons records several words that belong to come source language, or the phrase being formed by several individual characters), and one or more corresponding target language words (can be individual character, or the phrase being made up of several individual characters) distinguished in each above-mentioned words.Essential special instruction, in the words table of comparisons, belongs to come respectively source language and target language and two mutual corresponding words, and its number of words might not equate.For instance, suppose to come source language and be simplified form of Chinese Character and target language is Chinese-traditional, in the words table of comparisons, belong to the words " grapefruit " of simplified form of Chinese Character, its corresponding Chinese-traditional words is " grape fruit ", and belonging to the words " bus " of simplified form of Chinese Character, its corresponding Chinese-traditional words is " public Trucks ".
Then as shown in step 220, taxon 140 is carried out hyphenation processing and is obtained several hyphenation results word paragraph.In the present embodiment, taxon 140 is for example that word paragraph is carried out to doubly-linked (bi-gram) or even (n-gram) hyphenation processing of n, so that every the part that comprises continuously and not punctuation mark in word paragraph two words or n word are cut into a hyphenation result.But the hyphenation that the present invention does not adopt taxon 140 is processed algorithm and is limited.
Next in step 230, taxon 140 is compared the words table of comparisons in above-mentioned hyphenation result and storage element 110, to judge that each the source language words in word paragraph is to belong to the first kind or the second kind.In detail, if can find the words partially or completely conforming to one in word paragraph source language words in the words table of comparisons, and an only corresponding words that belongs to target language of this words, can judge that this source language words belongs to the first kind.In the words table of comparisons, find and partially or completely conform to the language words of originating in word paragraph words time, can carry out according to the principle of priority of long word.For example, connect after (n-gram) hyphenation is processed and obtain a plurality of hyphenation results according to doubly-linked or n, according to priority of long word principle, that is first with the hyphenation result compared with long word word, compare respectively each hyphenation result and the words table of comparisons, whether have and the hyphenation result person of conforming in comparison, conform to if having to judge in the words table of comparisons, the hyphenation result in judgement comparison is a words.After all hyphenation results have all been compared, according to all words that are judged out, the word in word paragraph is disassembled to a source language words that pluralizes from hyphenation result.It disassembles step, from word paragraph, first to select longer words as source language words, the words that vice-minister selected in remaining word from word paragraph again, as source language words, repeats by that analogy, until be left one word in word paragraph as source language words.
Then in step 240, the words corresponding relation that converting unit 150 records according to the words table of comparisons converts respectively all sources language words that belongs to the first kind to its corresponding target language words in word paragraph.Further time, converting unit 150 can be converted to target language words by the source language words that belongs to the first kind according to the principle of priority of long word conversion.
Finally as shown in step 250, converting unit 150 will belong to the source language words of the second kind, according to corresponding each candidate target language words and with word paragraph in a plurality of associated characters and words of forming of at least one front and back words jointly there is relevance, from corresponding candidate target language words, choose one as the target language words that will convert to.The detailed function mode of converting unit 150 will explain in rear cooperation diagram again.
When converting unit 150 basis source language words belong to the first kind or the second kind and take different modes after source language words is converted to corresponding target language words in the future, just can the word paragraph output that complete conversion to watch for user by output unit 160.
In following embodiment, suppose to come source language and be simplified form of Chinese Character and target language is Chinese-traditional, because the number of words number of words less and that Chinese-traditional uses that simplified form of Chinese Character uses is more, a simplified form of Chinese Character word may correspond to multiple Chinese-traditional words, thereby in the time that the word paragraph that belongs to simplified form of Chinese Character is converted to Chinese-traditional, easily face the situation of the corresponding multiple Chinese-traditional words of a simplified form of Chinese Character words.For instance, suppose this section of content of word paragraph record " this blog is being write on net annal, and his lover has boiled bowl noodle soup and eaten to him " that text conversion system 100 will be changed at present.
First, by taxon 140, word paragraph is carried out to hyphenation processing, the hyphenation result producing is: " this name ", " name is rich ", " blog ", " visitor exists ", " at net ", " net annal ", " in will ", " above ", " face is write ", " writing " ..., " bowl soup ", " noodle soup ", " face is given ", " giving him ", " he eats ".Taxon 140 is compared the words table of comparisons in above-mentioned hyphenation result and storage element 110, and judge in the middle of the included all simplified form of Chinese Character words of this word paragraph, only having " face " this simplified form of Chinese Character words is to belong to the second kind, and remaining simplified form of Chinese Character words all belongs to the first kind.As word table records, as shown in the corresponding relations between words belong to the first type simplified Chinese words: "this", "name", "blog", "in", "blog", "up", "write", "a" and "he" and "love", "cooking", "a", "bowl", "soup", "to", "eat" respectively corresponding to the traditional Chinese words: "this", "name", "blog", "in", "blog", "up", "mustering", "a" and "he" and "wife", "cooking", "a", "bowl", "soup", "to", "eat".Base this, converting unit 150 can, according to above-mentioned words corresponding relation, directly be converted to corresponding Chinese-traditional words by the simplified form of Chinese Character words that belongs to the first kind.But because simplified form of Chinese Character words " face " can corresponding two candidate's Chinese-traditional words " face ", “ Surface "; therefore converting unit 150 can judge respectively candidate's Chinese-traditional words " face ", “ Surface " and several associated characters and words of forming with at least one front and back words in word paragraph jointly there is relevance, and then from candidate's Chinese-traditional words " face ", “ Surface " the selection Chinese-traditional words that will convert to.Implementation in this case, the translation unit 150 produce the transformation of the results for "the blogger mustering on blogs, eat to him his wife cooked a bowl of beef noodles in soup".
In the above-described embodiments, converting unit 150 is first to change all source language words that belong to the first kind, then for the source language words that belongs to the second kind, the several associated characters and words that form according to corresponding each candidate target language words and with the front and back words in word paragraph jointly there is relevance, and then choose one as the target language words that will change from all candidate target language words.
Further, below, will illustrate that with Fig. 3 converting unit 150 is converted to the source language words that belongs to the second kind the detailed step of suitable target language words.Jointly there is relevance in what in the present embodiment, converting unit 150 can utilize that language model calculates several associated characters and words that each candidate target language words and front and back words form.Wherein, language model is for example that n connects (n-gram) language model, doubly-linked (or n connects) language model, or other any vocabulary frequency tables of comparisons that contrast frequency with word and word.
For convenience of description, below, converting unit 150 to be ready processing and to belong to the source language words for wanting to change that the source language words of the second kind claims.Refer to the step 310 of Fig. 3, each candidate target language words of the source language words that converting unit 150 is utilized a language model to calculate respectively to want to change, several associated characters and words that in itself and word paragraph, at least one front and back words forms jointly there is relevance.In detail, converting unit 150 is the position in word paragraph according to the source language words of wanting to change, obtain at least one front and back words (for example prev word, a rear word, the first two word, rear two words in word paragraph ... Deng), and candidate target language words and above-mentioned front and back words can form several associated characters and words.Jointly there is relevance by what utilize that language model calculates above-mentioned associated characters and words in converting unit 150.
, for example, assume that the source language to simplified Chinese character, traditional Chinese is the target language, the language model for translation unit 150 n the language model, and with a paragraph "the bloggers in the blog (surface) mustering the, his wife cooked a bowl of soup (surface) for him to eat" as an example, the brackets of the simplified Chinese "face" word, is not yet confirmed transformation and belong to the second type of source language words, its corresponding candidate target language words is the traditional Chinese "face", "face" of the two words.When translation unit 150 brackets will be the first of the "face" is converted to the appropriate target language words, translation unit 150 brackets according to the first of the position of "face" in the paragraph, from "the bloggers in the blog" define at least one of these words and words.Taking candidate target language words " face " as example, the associated characters and words that itself and above-mentioned front and back words form be " above ", " above lattice ", " falling above lattice " ..., " name Blogger is on blogger ", " making a Blogger on blogger ".Converting unit 150 can be found out the number of times (representing with F (face)) of occur " face " this words in language model, and in language model, find out associated characters and words " above " occurrence number (with F (above) represent).It is worth mentioning that, if the number of times finding is 0, is illustrated in and in language model, there is no corresponding associated characters and words, base this, converting unit 150 can be set as a default value by number of times, to prevent from calculating the result of probability as 0.In language model, occur that the probability P (above) of associated characters and words " above " can represent by following formula:
Then, converting unit 150 can be found out the occurrence number (representing with F (above lattice)) of associated characters and words " above lattice " in language model, and calculates with following formula the probability P (above lattice) that occurs associated characters and words " above lattice " in language model:
By analogy, translation unit 150 respectively to calculate P (above), P (above),..., P (blogger on blogs), P (a blogger on blogs), such as probability value, as a candidate for the product of the probability value to the target language words before and after the "surface" and the words of a number of related words appear together.
Similarly, in judging the candidate target language word "face" and its associated with of words before and after a few words of common correlation, translation unit 150 can also calculate the P (above), P ((above),..., P (blogger on blogs), P (a blogger on blogs), such as probability value, as a candidate for the product of the probability value to the target language words "face" of the mutual correlation.
Then in step 320, converting unit 150, in the corresponding all candidate target language words of source language words, selects the highest corresponding candidate target language words that jointly occurs relevance to be used as target language words.Continuity previous embodiment, supposing that candidate target language words " face " is corresponding occurs that relevance is higher than the corresponding relevance that jointly occurs of candidate target language words " Surface " jointly, and converting unit 150 just can select candidate target language words " face " as target language words.
Finally as shown in step 330, converting unit 150 in word paragraph in the future source language words be converted to target language words.
In another embodiment, for speed up processing, converting unit 150 also can adopt doubly-linked language model calculate each candidate target language words and with word paragraph in several associated characters and words of forming of at least one front and back words jointly there is relevance.
Also with a paragraph "the bloggers in the blog (surface) mustering the, his wife cooked a bowl of soup (surface) for him to eat" as an example, the brackets of the simplified Chinese "face" word, is not convert and belong to the second kind of source language words, its corresponding candidate target language words is the traditional Chinese "face" and "surface" two words.When translation unit 150 brackets will be the first of the "face" is converted to the appropriate target language words, translation unit in 150 "the bloggers in the blog" before and after these words in the word.And then, translation unit 150 respectively to calculate P (above), P (blog), P (blog), P (h),..., P (name), P (a), probability value (probability value calculation is similar to the example), as a candidate for the product of the probability value to the target language words "face" of the mutual correlation.Translation unit 150 will also calculate P (above), P (blog), P (blog), P (h),..., P (name), P (a), such as probability value, as a candidate for the product of the probability value to the target language words "face" of the mutual correlation.Which candidate target language words is converting unit 150 determine to select as target language words according to these two corresponding sizes that jointly occur relevance of candidate target language words.
In general,, for the source language words that belongs to Equations of The Second Kind in word paragraph, converting unit 150 can adopt the each step shown in Fig. 3 from corresponding several candidate target language words, to select the target language words that really will convert to.But in language model related data very little in the situation that, may cause that each candidate target language words is corresponding occurs that the gap of relevance is too small jointly, even may have that several candidate target language words are corresponding occurs that relevance is identical jointly.Base this, in another embodiment, converting unit 150 for example can adopt the each step shown in Fig. 4 to decide and how from several candidate target language words, to choose one as the target language words that will convert to.
Refer to Fig. 4, because step 410 is same or similar with the step 310 of Fig. 3, therefore do not repeat them here.
As shown in step 420, converting unit 150, in the corresponding all candidate target language words of source language words, is selected several higher candidate target language words that jointly occur relevance.Wherein, the above-mentioned higher candidate target language words that jointly occurs relevance occurs that for it is corresponding relevance is greater than one first threshold value jointly.For instance, the first threshold value is for example the corresponding any statistical values such as mean value or front mark that jointly occur relevance of all candidate target language words.Therefore,, when there being several candidate target language words all corresponding identical and the highest while jointly there is relevance, those candidate target language words can be selected as the higher candidate target language words that jointly occurs relevance.Or, when have several candidate target language words corresponding jointly there is the candidate target language words of relevance apparently higher than other, and for example, while jointly there is relevance gap to each other little (being less than the second threshold value) in what these candidate target language words were corresponding, using those candidate target language words as the higher candidate target language words that jointly occurs relevance.
Then in step 430, converting unit 150 is utilized the dictionary of supporting target language and a reference language, respectively each word of each higher candidate target language words that jointly occurs relevance is all translated as to a corresponding reference language word, and judge the relevance between the above-mentioned corresponding reference language word of each higher candidate target language words that jointly occurs relevance according to dictionary and each corresponding reference language word, thereby select the highest candidate target language words of relevance of corresponding reference language word to be used as target language words.
Finally as shown in step 440, converting unit 150 in word paragraph in the future source language words be converted to target language words.
For instance, supposing to come source language is that simplified form of Chinese Character, target language are Chinese-traditional, and reference language is English.Taking word paragraph " but she according to Old be elated by (drawing) Move Paddle " as example, " drawing " word is the source language words of not yet changing and belong to the second kind in its brace, its corresponding candidate target language words is " Row " and " drawing ".Converting unit 150 can decide according to the each step shown in Fig. 4 and word paragraph will be converted to " the ground Row Move Paddle but Ta Yi Old is elated by " or " but Ta Yi Old draws Move Paddle " with being elated by.
In detail, words before and after having obtained centered by the position of source language words in word paragraph since converting unit 150 in the present embodiment in n character-spacing, and by each candidate target language words and the higher candidate target language words that jointly occurs relevance of above-mentioned words composition.Equal 3 as example taking n, the higher candidate target language words that jointly occurs relevance is " contented ground Row Move Paddle ", " drawing Move Paddle " contentedly.
Converting unit 150 is utilized and is supported Chinese-traditional and English dictionary, and each word in the higher candidate target language words " contented ground Row Move Paddle " that jointly occurs relevance is translated as to corresponding reference language word.For instance, converting unit 150 is " draw " and " scratch " these two corresponding reference language words, is corresponding reference language word " oar " by " Paddle " this transliteration " Row " this transliteration, by that analogy.In addition, converting unit 150 is utilized and is supported Chinese-traditional and English dictionary, is corresponding reference language word " paddle " by " drawing " this transliteration in " drawing Move Paddle " contentedly, is corresponding reference language word " oar ", by that analogy by " Paddle " this transliteration.
In one embodiment, converting unit 150 is according to each corresponding reference language word frequency of occurrences at a plurality of grammatical interpretations in dictionary, to determine the relevance between each corresponding reference language word.For example, in the dictionary of support Chinese-traditional and English, corresponding reference language word " paddle " occurs among the grammatical interpretation of corresponding reference language word " oar ", but corresponding reference language word " draw ", " scratch " all do not appear among the grammatical interpretation of corresponding reference language word " oar ".; corresponding reference language word " paddle " in the frequency of occurrences of the grammatical interpretation of corresponding reference language word " oar " higher than corresponding reference language word " draw ", " scratch " frequency of occurrences at the grammatical interpretation of corresponding reference language word " oar ", therefore converting unit 150 judges that relevance between corresponding reference language word " paddle " and corresponding reference language word " oar " is higher than the relevance between corresponding reference language word " draw ", " scratch " and corresponding reference language word " oar ".Base this, converting unit 150 is chosen in word paragraph in the future source language words and " draws " and be converted to target language words and " draw ", instead of target language words " Row ".
But in another embodiment, converting unit 150 can also utilize a meaning of one's words relational tree (Semantic Tree) to calculate the meaning of one's words distance between each corresponding reference language word, to judge the relevance between each corresponding reference language word.Wherein, meaning of one's words distance more closely represents that relevance is higher.Owing to utilizing meaning of one's words relational tree to calculate the common technology means that the meaning of one's words distance between two words is this area, therefore do not repeat them here.
Fig. 5 is the calcspar according to the text conversion system shown in another embodiment of the present invention.As shown in Figure 5, text conversion system 500 comprises that storage element 110, taxon 140, converting unit 150, output unit 160, input block 510, language model set up unit 520, and words table of comparisons updating block 530.Because the included corresponding unit of text conversion system 100 shown in storage element 110, taxon 140, converting unit 150 and output unit 160 and Fig. 1 has same or analogous function, therefore do not repeat them here.
In the present embodiment, input block 510 couples storage element 110, in order to receive the word paragraph that meets to come source language.
Language model is set up unit 520 and is coupled to storage element 110.Storage element 110 stores at least one corpus, above-mentioned corpus can be existing Parallel Corpus (parallel corpus) or by text conversion system 500 by automatically prospecting produced Parallel Corpus.And language model is set up unit 520 and can be set up language model by the above-mentioned corpus of training.For instance, if language model is set up unit 520 and will be set up n and connect language model, language model is set up unit 520 and is understood the language material of adding up in corpus to produce word frequency information, and utilize maximal possibility estimation (Maximum Likelihood Estimation, MLE) estimate that the probability that n connects language model represents, produces accordingly n and connects language model.
Be that relevance based between words and front and back words is set up language model just because of language model is set up unit 520, therefore text conversion system 500 is in the time utilizing language model to process the transfer problem of one-to-many, just can select the corresponding words that jointly occurs that relevance is higher, thereby produce correct suitable text conversion result.
Words table of comparisons updating block 530 is coupled to storage element 110.Words table of comparisons updating block 530 can utilize the words table of comparisons existing in storage element 110, and the mode of prospecting with network produces correspondence automatically carrys out the Parallel Corpus of source language and target language, and upgrades the content of the words table of comparisons according to Parallel Corpus.
Particularly, words table of comparisons updating block 530 is prospected technology obtain originating language data collection and target language data collection by network.Wherein, the concentrated language material of language data can be word, example sentence, word paragraph, article fragment, or article etc.Then, according to the FJZ table of comparisons existing in storage element 110, always source language data set and target language data collection are found out respectively mutual corresponding come source language language material and target language language material, and recycling is come source language language material and target language language material generation Parallel Corpus.For instance, words table of comparisons updating block 530 always source language data set and target language data is concentrated, and taking out individually one section may be at the article of describing similar incidents, and in these two sections of articles, selects similar and may be to two example sentences of row.Then, utilize these two example sentences calculate these two sections of articles to row probable value, thereby judge these two sections of articles be whether high-quality to row article.If high-quality to row article, aforementioned to row two example sentences can be used as one group of data in Parallel Corpus.By the way, words table of comparisons updating block 530 just can produce Parallel Corpus, and this Parallel Corpus will be stored to storage element 110.
In addition, words table of comparisons updating block 530 can be according to the content of the Parallel Corpus expanded words table of comparisons.In detail, words table of comparisons updating block 530 from Parallel Corpus stored mutually for example, to row and be respectively and find out corresponding words (, belong to come respectively source language and target language and the contrast discrepant vocabulary of tool that gets up and be regarded as mutually corresponding words) two example sentences of source language and target language.If the corresponding words of finding out does not come across the words table of comparisons, 530 of words table of comparisons updating blocks can be added the content of the words table of comparisons with the expanded words table of comparisons.
In one embodiment, supposing to come source language is that simplified form of Chinese Character and target language are Chinese-traditional, words table of comparisons updating block 530 arrives a predetermined number (for example 10) if belong to the words " beer on draft " of simplified form of Chinese Character and belong to number of times that the words " draft beer " of Chinese-traditional corresponds to each other in Parallel Corpus, just can judge that " beer on draft " and " draft beer " are the words of changing each other.Words table of comparisons updating block 530 can be set up index (for example setting up reverse indexing (inverted index)) for these words of changing each other.Thus, words table of comparisons updating block 530 just can upgrade the content of the words table of comparisons according to words contrast relationship and index, or automatically sets up a new words table of comparisons.
Can reflect by the words table of comparisons that words table of comparisons updating block 530 upgraded or set up the term difference of coming between source language and target language, and can provide number of words inconsistent words corresponding relation.Guarantee that accordingly text conversion system 500 can produce preferably transformation result.
In one embodiment of this invention, be used in the mobile devices such as mobile phone, PDA or e-book during when text conversion system 500, because the size of speed, storer and the storage area of the processor of mobile device all has more restriction, in order to accelerate the speed of text conversion, language model is set up unit 520 after setting up language model, to manage to reduce the data volume of language model, thereby promote the treatment effeciency of text conversion system 500.
For instance, language model is set up unit 520 after setting up language model in the above described manner, only can will comprise the sentence of the one-to-many words that easily changes wrong, and the sentence that comprises the words that the frequency of occurrences is higher remains.
In addition,, for being retained each sentence getting off, language model is set up unit 520 can therefrom intercept out necessary sentence fragment, further to reduce data volume.For example, language model is set up centered by the words of unit 520 or one-to-many higher by the frequency of occurrences, n before and after taking out (for example 3) the shorter sentence fragment that individual word forms, and the words not belonging in above-mentioned sentence fragment can be deleted.For example, suppose that language model comprises " outside he Gang From Liu Bai Li, Mei Mine returns Come now " such Chinese-traditional sentence, wherein “ Li " be the words that frequency is higher.Language model is set up unit 520 and the Chinese-traditional sentence of " outside he Gang From Liu Bai Li, Mei Mine returns Come now " in language model can be simplified as " coal outside From Liu Bai Li ".
Moreover language model is set up unit 520 can also will convert scale-of-two archives (binary file) to through the language model of simplifying, the processing speed while using language model to promote.
Similarly, in order to reduce, the words table of comparisons to be compared and searched the time spending, words table of comparisons updating block 530 can use hash function (hash function) to process the words table of comparisons, thereby reaches the object of accelerating comparison speed.
Fig. 6 is the calcspar according to the text conversion system shown in another embodiment of the present invention.Refer to Fig. 6, text conversion system 600 comprises input block 610, storage element 620, converting unit 630, and output unit 640.Text conversion system 600 can be applicable to mobile phone, personal digital assistant, e-book, various computer/computing machine or mobile Internet access device.Or text conversion system 600 also can embedding browser, document processing software, or among website service.Text conversion system 600 is in order to the word paragraph that meets to come source language is converted to target language, at this not to carrying out source language and target language is limited.
In the present embodiment, input block 610 in order to obtain a source language words from the word paragraph that meets to come source language.
Storage element 620 couples input block 610.Storage element 620 is for example the various storage devices such as hard disk, solid state hard disc or flash memory, in order to a words table of comparisons to be provided, this words table of comparisons records the words corresponding relation of source language and target language, and source language words corresponds to a few candidate target language words.Because the words table of comparisons in the storage element 110 of the words table of comparisons in storage element 620 and Fig. 1 is same or similar, therefore do not repeat them here.
Converting unit 630 couples input block 610, storage element 620 and output unit 640.How converting unit 630 is in order to be converted to target language words by the source language words in word paragraph with reference to several language datas source with decision.Exported to convert to again the word paragraph of target language by output unit 640.
In another embodiment, text conversion system 600 more comprises communication unit (not shown).Communication unit couples converting unit 630, in order to link to each language data source by communication network.
Below by carry out the detailed function mode of comment converting system 600 with Fig. 7, please refer to Fig. 6 and Fig. 7.
First as shown in step 710, input block 610 is obtained a source language words from meet to come the word paragraph of source language.The words table of comparisons that provides storage element 620 to record is provided in step 720.The words table of comparisons records the words corresponding relation of source language and target language, and source language words corresponds to a few candidate target language words.
As shown in step 730, converting unit 630 according to the source corresponding each candidate target language words of language words and with word paragraph in several associated characters and words of forming of at least one front and back words respectively several language datas source jointly there is relevance, from above-mentioned candidate target language words, choose one as the target language words that will convert to.
For instance, language data source is for example webpage, network article and language database etc.Converting unit 630 can utilize a language model calculate respectively each candidate target language words and with word paragraph in several associated characters and words of forming of at least one front and back words, jointly there is relevance in above-mentioned language data source respectively.Wherein, language model can be that n connects language model, doubly-linked language model, or other any vocabulary frequency tables of comparisons that contrast frequency with word and word, is not limited at this.Jointly occur that owing to calculating the mode of relevance is similar to previous embodiment, therefore do not repeat them here.
Another kind of embodiment, in converting unit 630 there is relevance in several language datas source respectively in said several associated characters and words jointly, can be by a Search engine or a query interface, from several language datas source (webpage, network article and language database etc.), search and add up quantity or frequency that each associated characters and words occurs, and select to occur that associated characters and words that quantity/frequency is higher is as the target language words that will convert to.
Converting unit 630, in all candidate target language words, is selected the highest corresponding candidate target language words that jointly occurs relevance to be used as target language words, and change source language words with target language words in word paragraph.Converted again to the word paragraph of target language by output unit 640 outputs.
As mentioned above, text conversion system 600 meets to come in reception after the word paragraph of source language, by to a large amount of language data sources such as relevant webpage, network article and the language databases of Web search, and then determine how always to select in the corresponding at least one candidate target language words of source language words the target language words that really will convert to, to produce preferably text conversion result.
Essential special instruction, although be that the present invention is not as limit using simplified form of Chinese Character as carrying out source language and describing as target language using Chinese-traditional in the above-described embodiments.In other embodiments, carrying out source language can be Chinese-traditional, and target language is simplified form of Chinese Character.Or, carry out source language for Chinese, and target language is English.The present invention is not limited the kind of coming source language and target language.
In sum, Word conversion method of the present invention and system are in the time being converted to target language by word paragraph origin source language, can automatically process the term difference between different language, and for the corresponding situation of words of one-to-many, also can according to corresponding each candidate target language words and with word paragraph in a plurality of associated characters and words of forming of at least one front and back words jointly there is relevance, automatically and correctly therefrom select the words that is suitable for converting to most.Thus, can significantly promote the correctness that word paragraph is converted to different language.
Although the present invention discloses as above with embodiment; so it is not in order to limit the present invention, any person of ordinary skill in the field, without departing from the spirit and scope of the present invention; when doing a little change and retouching, therefore protection scope of the present invention is worked as with being as the criterion that claim was defined.

Claims (14)

1. a Word conversion method, in order to a word paragraph that meets to come source language is converted to a target language, wherein this article field falls to comprising multiple sources language words, it is characterized in that, the method comprises step below:
The one words table of comparisons is provided, and this words table of comparisons records this carrys out the words corresponding relation of source language and this target language;
This article field is dropped into row one hyphenation processing and obtained multiple hyphenation results;
Compare those hyphenation results and this words table of comparisons, to judge that each those source language words belong to one first kind and one second kind the two one of them, wherein belong to the only corresponding simple target language words of source language words of this first kind, and belong to the corresponding a plurality of candidate target language words of source language words of this second kind;
The words corresponding relation recording according to this words table of comparisons, falls to middlely this target language words corresponding to the source language words that belongs to this first kind converting in this article field;
To belong to the source language words of this second kind, utilize a language model to calculate respectively each those candidate target language words and jointly occur relevance with this article field falls a plurality of associated characters and words that at least one front and back words forms;
From those candidate target language words, select a plurality of higher candidate target language words that jointly occur relevance, wherein those higher candidate target language words that jointly occurs relevance is corresponding jointly occurs that relevance is greater than one first threshold value; And
Utilize a dictionary of supporting this target language and a reference language, respectively by each word of each those higher candidate target language words that jointly occur relevance, be translated as a corresponding reference language word, and from this dictionary and each this correspondence reference language word, the respectively relevance between this correspondence reference language word of each those the higher candidate target language words that jointly occur relevance of judgement, is used as this target language words with the highest candidate target language words of relevance of selecting corresponding reference language word.
2. Word conversion method according to claim 1, is characterized in that, wherein judges that respectively the step of the relevance between this correspondence reference language word comprises:
According to each this correspondence reference language word frequency of occurrences at a plurality of grammatical interpretations in this dictionary, to determine the respectively relevance between this correspondence reference language word.
3. Word conversion method according to claim 1, is characterized in that, more comprises step below:
By training at least one corpus to set up this language model.
4. Word conversion method according to claim 1, is characterized in that, more comprises step below:
Prospect to obtain a source language data collection and a target language data collection by network;
Find out respectively mutual corresponding one from this source language data collection and this target language data collection and come source language language material and a target language language material;
Utilize this to carry out source language language material and this target language language material produces a Parallel Corpus; And
Expand the content of this words table of comparisons according to this Parallel Corpus.
5. a text conversion system, in order to a word paragraph that meets to come source language is converted to a target language, wherein this article field falls to comprising multiple sources language words, it is characterized in that, this system comprises:
One storage element, in order to store a words table of comparisons, this words table of comparisons records this carrys out the words corresponding relation of source language and this target language;
One taxon, couple this storage element, this article field is dropped into row one hyphenation processing and obtained multiple hyphenation results, and compare those hyphenation results and this words table of comparisons, to judge that each those source language words belong to one first kind and one second kind the two one of them, wherein belong to the only corresponding simple target language words of source language words of this first kind, and belong to the corresponding a plurality of candidate target language words of source language words of this second kind;
One converting unit, couple this storage element and this taxon, the words corresponding relation recording according to this words table of comparisons, fall to middlely this target language words corresponding to the source language words that belongs to this first kind converting in this article field, and the source language words of this second kind will be belonged to, utilize a language model to calculate respectively each those candidate target language words and jointly occur relevance with this article field falls a plurality of associated characters and words that at least one front and back words forms, from those candidate target language words, select a plurality of higher candidate target language words that jointly occur relevance, wherein those higher candidate target language words that jointly occurs relevance is corresponding jointly occurs that relevance is greater than one first threshold value, and a dictionary of this target language and a reference language is supported in utilization, respectively by each word of each those higher candidate target language words that jointly occur relevance, be translated as a corresponding reference language word, and from this dictionary and each this correspondence reference language word, the respectively relevance between this correspondence reference language word of each those the higher candidate target language words that jointly occur relevance of judgement, be used as this target language words with the highest candidate target language words of relevance of selecting corresponding reference language word, and
One output unit, couples this converting unit, and this article field that has converted this target language in order to output to falls.
6. text conversion system according to claim 5, is characterized in that, wherein, this system more comprises:
One input block, couples this storage element, meets this this article field of carrying out source language and falls to receive.
7. text conversion system according to claim 5, it is characterized in that, wherein this converting unit more comprises in order to according to each this correspondence reference language word frequency of occurrences at a plurality of grammatical interpretations in this dictionary, to determine the respectively relevance between this correspondence reference language word.
8. text conversion system according to claim 5, it is characterized in that, wherein this storage element more comprises and stores at least one corpus, and this word converting system more includes a language model and sets up unit, couple this storage element, in order to by training this at least one corpus to set up this language model.
9. text conversion system according to claim 5, is characterized in that, more comprises:
One bilingual words table of comparisons updating block, couples this storage element, prospects to obtain a source language data collection and a target language data collection by network; Find out respectively mutual corresponding one from this source language data collection and this target language data collection and come source language language material and a target language language material; Utilize this to carry out source language language material and this target language language material produces a Parallel Corpus; And, the content that expands this words table of comparisons according to this Parallel Corpus.
10. a Word conversion method, in order to carry out coming the text conversion of source language and a target language, is characterized in that, the method comprises:
From meet this word paragraph that carrys out source language, obtain a source language words;
The one words table of comparisons is provided, and this words table of comparisons records this carrys out the words corresponding relation of source language and this target language, and this source language words corresponds to a few candidate target language words;
Utilize a language model to calculate respectively each this at least one candidate target language words and fall with this article field in a plurality of associated characters and words of forming of at least one front and back words, jointly there is relevance in a plurality of language datas source respectively;
From those candidate target language words, select a plurality of higher candidate target language words that jointly occur relevance, wherein those higher candidate target language words that jointly occurs relevance is corresponding jointly occurs that relevance is greater than one first threshold value;
Utilize a dictionary of supporting this target language and a reference language, by each each word of those higher candidate target language words that jointly occur relevance, be translated as a corresponding reference language word respectively, and from this dictionary and each this correspondence reference language word;
The respectively relevance between this correspondence reference language word of each those the higher candidate target language words that jointly occur relevance of judgement, is used as this target language words with the highest candidate target language words of relevance of selecting corresponding reference language word; And
In falling, this article field changes this source language words with this target language words.
11. Word conversion methods according to claim 10, is characterized in that, wherein, those language data sources comprise webpage, network article and language database.
12. 1 kinds of text conversion systems, in order to carry out coming the text conversion of source language and a target language, is characterized in that, this system comprises:
One input block is obtained a source language words from meet this word paragraph that carrys out source language;
One storage element, couples this input block, and a words table of comparisons is provided, and this words table of comparisons records this carrys out the words corresponding relation of source language and this target language, and this source language words corresponds to a few candidate target language words;
One converting unit, couple this input block and this storage element, utilize a language model to calculate respectively each this at least one candidate target language words and fall with this article field in a plurality of associated characters and words of forming of at least one front and back words, jointly there is relevance in a plurality of language datas source respectively, from those candidate target language words, select a plurality of higher candidate target language words that jointly occur relevance, wherein those higher candidate target language words that jointly occurs relevance is corresponding jointly occurs that relevance is greater than one first threshold value, utilize a dictionary of supporting this target language and a reference language, respectively by each word of each those higher candidate target language words that jointly occur relevance, be translated as a corresponding reference language word, and from this dictionary and each this correspondence reference language word, the respectively relevance between this correspondence reference language word of each those the higher candidate target language words that jointly occur relevance of judgement, be used as this target language words with the highest candidate target language words of relevance of selecting corresponding reference language word, and in falling, this article field changes this source language words with this target language words, and
One output unit, couples this converting unit, and this article field that has converted this target language in order to output to falls.
13. text conversion systems according to claim 12, is characterized in that, wherein, those language data sources comprise webpage, network article and language database.
14. text conversion systems according to claim 12, is characterized in that, wherein, this system more comprises a communication unit, couples this converting unit, in order to link to those language data sources by communication network.
CN201010576958.7A 2010-12-02 2010-12-02 Character conversion method and system Active CN102486770B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010576958.7A CN102486770B (en) 2010-12-02 2010-12-02 Character conversion method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010576958.7A CN102486770B (en) 2010-12-02 2010-12-02 Character conversion method and system

Publications (2)

Publication Number Publication Date
CN102486770A CN102486770A (en) 2012-06-06
CN102486770B true CN102486770B (en) 2014-09-17

Family

ID=46152264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010576958.7A Active CN102486770B (en) 2010-12-02 2010-12-02 Character conversion method and system

Country Status (1)

Country Link
CN (1) CN102486770B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794110B (en) * 2014-01-20 2018-11-23 腾讯科技(深圳)有限公司 Machine translation method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295298A (en) * 2007-04-23 2008-10-29 株式会社船井电机新应用技术研究所 Translation system, translation program, and bilingual data generation method
CN101707873A (en) * 2007-03-26 2010-05-12 谷歌公司 Large language models in the mechanical translation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7191115B2 (en) * 2001-06-20 2007-03-13 Microsoft Corporation Statistical method and apparatus for learning translation relationships among words
AU2003222126A1 (en) * 2002-03-28 2003-10-13 University Of Southern California Statistical machine translation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101707873A (en) * 2007-03-26 2010-05-12 谷歌公司 Large language models in the mechanical translation
CN101295298A (en) * 2007-04-23 2008-10-29 株式会社船井电机新应用技术研究所 Translation system, translation program, and bilingual data generation method

Also Published As

Publication number Publication date
CN102486770A (en) 2012-06-06

Similar Documents

Publication Publication Date Title
TWI434187B (en) Text conversion method and system
US9424246B2 (en) System and method for inputting text into electronic devices
CN103136352B (en) Text retrieval system based on double-deck semantic analysis
CN102479191B (en) Method and device for providing multi-granularity word segmentation result
US9659002B2 (en) System and method for inputting text into electronic devices
CN100437585C (en) Method for carrying out retrieval hint based on inverted list
CN105468900A (en) Intelligent medical record input platform based on knowledge base
CN104008126A (en) Method and device for segmentation on basis of webpage content classification
CN104199965A (en) Semantic information retrieval method
CN102968411B (en) Multi-lingual mechanical translation intelligence auxiliary process method and system
US20110258202A1 (en) Concept extraction using title and emphasized text
KR101573854B1 (en) Method and system for statistical context-sensitive spelling correction using probability estimation based on relational words
CN102135967A (en) Webpage keywords extracting method, device and system
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN111428494A (en) Intelligent error correction method, device and equipment for proper nouns and storage medium
CN104881397B (en) Abbreviation extended method and device
CN108733745B (en) Query expansion method based on medical knowledge
CN101751430A (en) Electronic dictionary fuzzy searching method
CN104951469A (en) Method and device for optimizing corpus
US20230061731A1 (en) Significance-based prediction from unstructured text
CN105404677A (en) Tree structure based retrieval method
CN109885641A (en) A kind of method and system of database Chinese Full Text Retrieval
CN103064846B (en) Retrieval device and search method
KR20150083961A (en) The method for searching integrated multilingual consonant pattern, for generating a character input unit to input consonants and apparatus thereof
Barla et al. From ambiguous words to key-concept extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant