CN1916888A - Method and system of identifying language of double-byte character set character data - Google Patents
Method and system of identifying language of double-byte character set character data Download PDFInfo
- Publication number
- CN1916888A CN1916888A CN 200510091971 CN200510091971A CN1916888A CN 1916888 A CN1916888 A CN 1916888A CN 200510091971 CN200510091971 CN 200510091971 CN 200510091971 A CN200510091971 A CN 200510091971A CN 1916888 A CN1916888 A CN 1916888A
- Authority
- CN
- China
- Prior art keywords
- language
- byte
- double
- character set
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
A method for identifying language of double byte character set character data includes confirm a characteristic value set separately for each candidate language, fetching DBCS character data and accumulating double byte tale, comparing fetching data with confirmed characteristic value set for each candidate language, accumulating counts for corresponding language if said data is matched by characteristic value set of certain language, conforming that it is Korean if counts corresponding to Korean is greater or equal to 10% of tale or otherwise confirming that it is the other oriental language with maximum counts.
Description
Technical field
Relate generally to information processing of the present invention, (DoubleByte Character Set, DBCS) character data is the method and system of which kind of oriental language especially to relate to a kind of identification double-byte character set.
Background technology
Along with computer network and development of Communication Technique, people are more and more general to the use of internet and associated electrical service, and the information transmission of holding in different places by these services between the people of different natural languages is also more and more frequent.
Yet the user of the different language in different places uses the different character set that is used for the computerized information exchange of country variant (or area) regulation, for example ascii character-set, DBCS character set, Unicode character set etc. in computer system.Just be located in the Asia in east, what generally use is the local character set of DBCS, comprising for example, and CNS GB2312-80, GBK, GB18030-2000; Advance the BIG5 of meeting and the foundation of five tame software companys by Taiwan juridical person's information industry plan; The S_JIS of Japanese; With KSC of Korean or the like.Thereby, cause existing in the prior art therewith relevant many technical matterss, wherein for example, before the operation of carrying out selection, demonstration, printing, must determine the employed language of file to be operated such as the characteristic that depends on specific natural language.Again for example, when the user browses World Wide Web, the user generally all wish with its native language of the being held input character string of searching for, and wish that search engine or webpage only search for or demonstrate the result into its native language.And if this moment, webpage or search engine can not support that to the identification of input native language, then it only can be used as input of character string English search, its result does not normally have occurrence, therefore can not correctly carry out search.Also have some webpages can allow the user manually to specify and browse and search for desired language.
In addition, some natural language processing instrument, for example spelling checker, grammar checker etc. need learn before operation that also the used language of examine literal section is with operation correctly.
Occur some and solved the feasible method and the technology of correlation technique problem, the method of Mozilla and Microsoft for example, they all are based on the frequency that the code-point to language-specific occurs in text statistics detects local character set under it, these methods implement than more complicated, and result's accuracy is also not ideal enough.And these methods all are based on the dictionary that is used for western language.In addition, Xiang Guan patent documentation for example has U.S. Patent number US6157905 (name is called " IDENTIFYING LANGUAGEAND CHARACTER SET OF DATA REPRESENTING TEXT "), US6704698 (name is called " WORDCOUNTING NATURAL LANGUAGE DETERMINATION ") and US6539118 (name is called " SYSTEMAND METHOD FOR EVALUATING CHARACTER SETS OF A MESSAGE CONTAINING APLURALITY OF CHARACTER SETS ").Wherein US6157905 utilizes data are widely set up a character occurrence frequency statistical model, comes accumulation data by one " training stage (training phase) ", as the foundation of distinguishing thereafter.In the process of differentiating, the inspection that the result who utilizes its statistics of oneself setting up and data of newly arriving to produce is mated.But this invention utilizes statistics to guess judgement, and some useless data (for example encoded point of all using always) can be sneaked in its simple frequency statistics the inside in various language, and result's accuracy is caused interference.And the method that should invent is not at some specific character set, and entire method is very complicated relatively.US6704698 mainly is that what to detect the data statement is any natural language, does not relate to and differentiating belonging to which character set.And it is to judge according to the everyday words of various european languages, does not also comprise the language of Asian countries such as China, Japan and Korea S..The purpose of US6539118 is to check the lteral data of known Unicode of being to use or other general format (universal format), and which kind of natural language what judge its statement is, and then uses any local character set can represent the literal that this segment data comprises.But this invention does not relate to the lteral data of the unknown coded format of judging double-byte character set is to use for which kind of local character set and coding.
In a word, lack and need a kind of method and system that comes efficiently and discern exactly at the DBCS character data that may comprise certain oriental language in the prior art.
Summary of the invention
In view of the above problems, made the present invention.
An object of the present invention is to provide a kind of simple and high method and system of accuracy that can discern the language of DBCS character data.
Another object of the present invention provides a kind of language that can discern the DBCS character data and and then can differentiate the method for local character set under it.
The present inventor has utilized various oriental languages self, especially the unique distinctive feature aspect input and the storage format has been made the present invention in computing machine.That is to say that in the local character set of DBCS, various language all have some unique eigenwerts respectively, these eigenwerts are in the local character set of other kind of language or less than defining or usually need not.Such as, when include in one section character data to simplified Chinese character during unique eigenwert, then its very little possibility is arranged can be Japanese or other what language because this eigenwert can basic not definition in Japanese or other language or basically need not.This is the Important Thought of the present invention in order to the language of identification DBCS character data.
According to an aspect of the present invention, provide that a kind of to be used in computing machine identification may be the method that comprises the language of the DBCS character data of one of candidate language of Korean and other multiple oriental language, comprise the following steps: that a. determines a characteristic value collection for each candidate language respectively, wherein determined eigenwert does not define in the corresponding local character set of other candidate language or usually need not; B. read the tale of DBCS character data to be identified and accumulative total double byte; C. the characteristic value collection with DBCS character data and every kind of candidate language compares, and during one of the eigenwert in the characteristic value collection of one of DBCS character data and described candidate language coupling, to the stored count of corresponding candidate language; If d. corresponding to the counting of Korean greater than 10% of tale, then DBCS character data to be identified is identified as Korean, otherwise, DBCS character data to be identified is identified as the maximum language of its counting in described other multiple oriental language.
According to a further aspect in the invention, provide also that a kind of to be used in computing machine identification may be the system that comprises the language of the DBCS character data of one of candidate language of Korean and other multiple oriental language, comprise: storage unit, be used for storage respectively for the determined characteristic value collection of each candidate language, wherein determined eigenwert does not define in the corresponding local character set of other candidate language or usually need not; Reading unit is used to read DBCS character data to be identified; The tale unit is used for the tale of the double byte of the DBCS character data that read of accumulative total; Comparing unit is used for the DBCS character data that will be read and the characteristic value collection of every kind of candidate language and compares and export comparative result; A plurality of totalizers, it is respectively corresponding to every kind of candidate language, when the comparative result of being exported showed one of eigenwert in the characteristic value collection of one of described DBCS character data and described candidate language coupling, the totalizer corresponding to this candidate language in these a plurality of totalizers added up 1 with its counting; The speech recognition unit, be used for according to the tale of described tale unit and the described language of discerning described DBCS character data respectively corresponding to the corresponding counts of a plurality of totalizers of every kind of candidate language, if wherein corresponding to the counting of Korean greater than 10% of tale, then DBCS character data to be identified is identified as Korean, otherwise, DBCS character data to be identified is identified as the maximum language of its counting in described other multiple oriental language.
Thus, the language of DBCS this locality character data of the additional character repertoire information that do not have sign institute's language that adopts and affiliated character set or font information can simply be distinguished out efficiently by the method according to this invention and system.And compare with existent method in the prior art, method of the present invention implements simpler, and higher accuracy is provided, and required step is few, and the resource that takies is also little.
Others of the present invention and/or advantage will partly be illustrated in explanation subsequently, and partly, will can obviously find out from explanation, or study obtain from the practice of the present invention.
Description of drawings
From below in conjunction with the explanation of accompanying drawing to embodiment, these and/or others of the present invention and advantage will be conspicuous and very easy understanding, in the accompanying drawings:
Fig. 1 is the block scheme of structure of system of language that is used to discern the DBCS character data that illustrates according to the embodiment of the invention;
Fig. 2 is the process flow diagram of method of language that is used to discern the DBCS character data according to the embodiment of the invention; With
Fig. 3 is the process flow diagram of method that is used to discern the language of DBCS character data according to another embodiment of the present invention.
Embodiment
With reference now to these embodiment of description of drawings, to explain the present invention.
In the following detailed description, suppose that all DBCS character data to be identified follows the rule of various oriental languages (as simplified Chinese character, traditional Chinese, Japanese, Korean etc.) on linguistics, correctly annotated with punctuation mark with the space and separate (if applicable).And will not consider abnormal data, for example be mess code of importing at random by the user etc.
As previously mentioned, in the local character set of DBCS, various language all have some unique eigenwerts respectively.Particularly, Korean uses the byte space between word or speech, and simplified Chinese character, traditional Chinese, Japanese need not.The unique feature of this of Korean can be used as the key point of distinguishing Korean and other east languages.If there is in the middle of the DBCS literal space separately, and space quantity accounts for the 10-15% of DBCS literal, can think that this section DBCS literal is the Korean.In addition, the unique characteristics of another of Korean are its punctuation marks that use byte, and other languages are generally all used the double byte punctuation mark.So if what use in the DBCS character data is the data that the punctuate of byte is cut apart double byte all, then it is the possibility maximum of Korean.Though can not get rid of in some cases, other literal beyond the Korean also may use the punctuation mark of byte.But the byte punctuation mark only is that ordinary practice is used in the Korean.
Usually the most important punctuation mark that uses in statement is comma and fullstop, and the local value of the DBCS punctuation mark of simplified Chinese character, traditional Chinese and Japanese is different and unique mutually.Thus, if having the feature code-point of double byte punctuation mark to use in one section DBCS character data, because the seldom use in other character set of these feature code-points, so can judge the language and the used character set thereof of DBCS character data in view of the above.(please note here: though the double byte punctuate code-point of Korean is identical with simplified form of Chinese Character, the Korean has the notable feature of space-separated words.) such as, if one section character data has comprised the local value of simplified Chinese character comma or fullstop, then it may be Japanese or traditional Chinese content hardly, because the basic just not definition or usually need not in the local character set of Japanese or traditional Chinese of this local value.This is the Important Thought in order to identification simplified Chinese character, traditional Chinese, Japanese.This shows, for various language are selected suitable eigenwert and determined that corresponding characteristic value collection is extremely important.
Illustrate below various language eigenwert (or being called the feature code-point) select and gather determine.
Such as, the characteristic value collection of determining for Korean (in KSC) comprises following byte code-point: 0x20 (space), 0x21 (exclamation mark), 0x2c (comma), 0x2e (fullstop) and 0x3f (question mark).These spaces and other byte punctuates are used as the symbol of separating double-byte characters at Korean.Basis need not in the local character set of other languages for they.
The characteristic value collection of determining for simplified Chinese character (in GB) for example comprises following double byte code-point: 0xa3ac (overall with comma), 0xa1a2 (comma of expressing the meaning), 0xa1a3 (fullstop of expressing the meaning), 0xa3a1 (overall with exclamation mark) and 0xa3bf (overall with question mark).Definition and the operating position of these double byte values in other Languages listed in the following Table 1.Wherein<and Uxxxx〉be its corresponding Unicode code-point of expression.
Definition and the operating position of the eigenwert of table 1 simplified Chinese character in other Languages
Traditional Chinese (Windows-950) | Japanese | Korean (Windows-949) | |
xA3xAC | <U311B 〉, phonetic symbol usually need not | Not definition | <UFF0C 〉, the double byte symbol usually need not |
xA1xA2 | <UFE5C 〉, phonetic symbol usually need not | Not definition | <U3001 〉, the double byte symbol usually need not |
xA1xA3 | <UFE5D 〉, phonetic symbol usually need not | Not definition | <U3002 〉, the double byte symbol usually need not |
xA3xA1 | <U3110 〉, phonetic symbol usually need not | Not definition | <UFF01 〉, the double byte symbol usually need not |
xA3xBF | <U02CB 〉, usually need not | <UFF1F 〉, the double byte symbol usually need not |
The characteristic value collection of determining for traditional Chinese (in BIG5) for example comprises following double byte code-point: 0xa141 (overall with comma), 0xa142 (comma of expressing the meaning), 0xa144 (overall with fullstop), 0xa143 (fullstop of expressing the meaning), 0xa149 (overall with exclamation mark) and 0xa148 (overall with question mark).Definition and the operating position of these double byte values in other Languages listed in the following Table 2.Wherein<and Uxxxx〉be its corresponding Unicode code-point of expression.
Definition and the operating position of the eigenwert of table 2 traditional Chinese in other Languages
Simplified Chinese character (Windows-936) | Japanese | Korean (Windows-949) | |
xA1x41 | <UE4C7 〉, usually need not | Not definition | <UC8A5 〉, extended area usually need not |
xA1x42 | <UE4C8 〉, the duplicate of u6DD9 | Not definition | <UC8A6 〉, extended area usually need not |
xA1x44 | <UE4CA 〉, the duplicate of u6DAB | Not definition | <UC8A9 〉, extended area usually need not |
xA1x43 | <UE4C9 〉, the duplicate of u6E16 | Not definition | <UC8A7 〉, extended area usually need not |
xA1x49 | <UE4CF 〉, the duplicate of u6E4E | Not definition | <UC8AE 〉, extended area usually need not |
xA1x48 | <UE4CE 〉, the duplicate of u6E6E | Not definition | <UC8AD 〉, extended area usually need not |
The characteristic value collection of determining for Japanese (in S_JIS) for example comprises following double byte code-point: 0x8141 (comma of expressing the meaning), 0x8142 (fullstop of expressing the meaning), 0x8149 (overall with exclamation mark) and 0x8148 (overall with question mark).Definition and the operating position of these double byte values in other Languages listed in the following Table 3.Wherein<and Uxxxx〉be its corresponding Unicode code-point of expression.
Definition and the operating position of the eigenwert of table 3 Japanese in other Languages
Simplified Chinese character (Windows-936) | Traditional Chinese (Windows-950) | Korean (Windows-949) | |
x81x41 | <U4E04 〉, usually need not | <UEEB9 〉, the duplicate of u8D6F | <UAC02 〉, extended area usually need not |
x81x42 | <U4E05 〉, usually need not | <UEEBA 〉, the duplicate of u8E4E | <UAC03 〉, extended area usually need not |
x81x49 | <U4E21> | <UEEC1 〉, the duplicate of u8F40 | <UAC0F 〉, extended area usually need not |
x81x48 | <U4E20> | Not definition | <UAC0E 〉, extended area usually need not |
Following table 4 has been listed the operating position of above-mentioned these feature code-points in each character set more intuitively, and wherein black is represented to be in daily use, and grey is seldom used.
The selected feature code-point of table 4 is in the operating position of each character set
The selection of above-mentioned these eigenwerts (feature code-point) only is exemplary, and the present invention is not limited to this for the selection of the eigenwert of various language.For example under some special situation, the selection of feature code-point also can be done accommodation less, does some adjustment according to the particular data environment.Such as, in the time must comprising other specific character in the data of known batch input, can add these specific characters as the feature code-point.
And in the local character set that also is not limited to be lifted in the above-mentioned example for the selection of eigenwert, but can expand to other character set.Such as, the EUCJP on the Japanese unix system uses the coding different with S_JIS, and its punctuation mark is identical with Korean KSC with simplified Chinese character GB with the coding of Japanese ideogram part.At this time, the situation that still can utilize different punctuation marks to use, and add that whether pseudonymity is judged is Chinese, Japanese or Korean.In addition, GB2312-80, GBK and these three kinds of codings of GB18030-2000 of Chinese substantially all are to expand on the basis of GB2312, so all can use this method.
Describe the system of the language of discerning the DBCS character data according to an embodiment of the invention in detail below with reference to Fig. 1.In Fig. 1, system 100 comprises: storage unit 101, reading unit 102, tale unit 103, comparing unit 104, a plurality of totalizer 105 and speech recognition unit 106.
Storage unit 101 storages wherein for example have 0x20 (space), 0x21 (exclamation mark), 0x2c (comma), 0x2e (fullstop) and the 0x3f (question mark) of above-described Korean respectively for the determined characteristic value collection of each candidate language; The 0xa3ac of simplified Chinese character (overall with comma), 0xa1a2 (comma of expressing the meaning), 0xa1a3 (fullstop of expressing the meaning), 0xa3a1 (overall with exclamation mark) and 0xa3bf (overall with question mark); The 0xa141 of traditional Chinese (overall with comma), 0xa142 (comma of expressing the meaning), 0xa144 (overall with fullstop), 0xa143 (fullstop of expressing the meaning), 0xa149 (overall with exclamation mark) and 0xa148 (overall with question mark); And the 0x8141 of Japanese (comma of expressing the meaning), 0x8142 (fullstop of expressing the meaning), 0x8149 (overall with exclamation mark) and 0x8148 (overall with question mark).
Reading unit 102 word for word saves land and reads DBCS character data to be identified, and the data that read are provided to tale unit 103 and comparing unit 104 respectively.
The DBCS character data that is provided is provided in tale unit 103, and double byte is carried out stored count.
Comparing unit 104 will compare coupling from each eigenwert in reading unit 102 character data that provides and the characteristic value collection that is stored in every kind of language the storage unit 101, and output comparison match result is in a plurality of totalizers 105.Particularly, comparing unit 104 is provided by the single-byte character data that provide from reading unit 102, when find its with the Korean characteristic value collection in byte space value or other byte punctuation mark value when identical, then comparing unit 104 outputs to the comparison match result in a plurality of totalizers 105 and the corresponding totalizer of Korean, thereby can carry out accumulated counts with respect to byte space value or other byte punctuation mark value respectively with the corresponding totalizer of Korean in these a plurality of totalizers 105.Because the byte space and the punctuation mark that use in the Korean are used for separating double-byte characters, so when whether being the comparison of the byte space of Korean and punctuation mark, not only want competitive list byte character data itself whether identical, but also will consider that the front and back of these single-byte character data must be significant double word numeric character datas in Korean with selected eigenwert.Only when these conditions all satisfy, just in a plurality of totalizers 105, export matching result with the corresponding totalizer of Korean.Simultaneously, comparing unit 104 also is provided by two the continuous single-byte character data that provide from reading unit 102, i.e. double-byte characters data are so that relatively whether it is the eigenwert in the definite characteristic value collection of other Languages being relevant to Korean outside.When finding that it is identical with certain eigenwert in the characteristic value collection of one of simplified Chinese character, traditional Chinese or Japanese, then comparing unit 104 outputs to the comparison match result in a plurality of totalizers 105 and the corresponding totalizer of this language, thereby corresponding totalizer can carry out accumulated counts to the number of times that corresponding eigenwert occurs according to matching result in these a plurality of totalizers 105.The method of judging double byte is ripe, and general in these character set.Simply can judge according to the high position of first byte, also can be in conjunction with other method.
Thus,, can collect all features of input DBCS character data, and can totally obtain the number of times of each condition code point appearance by single pass according to the system 100 of present embodiment.
At last, the language of DBCS character data is discerned in speech recognition unit 106 according to the corresponding counts of each respective accumulators in the tale in the tale unit 103 and a plurality of totalizer 105.Speech recognition unit 106 checks that at first whether counting with respect to the byte space of Korean is more than or equal to 10% of double byte tale, if then speech recognition unit 106 is identified as Korean with the DBCS character data.As a rule, if the DBCS character data is the Korean literal, then byte space quantity accounts for the general meeting of ratio of DBCS double word symbol data greater than 20%.If this ratio is less than 10%, then speech recognition unit 106 is found out in a plurality of totalizers 105 corresponding to its counting the maximum in the respective accumulators of the outer various language of Korean, and the DBCS character data is identified as this corresponding language.Preferably, if wherein there is not maximum counting, then counting with respect to other byte punctuation mark of Korean can and then be checked greater than 0 in speech recognition unit 106 whether, if greater than 0, then speech recognition unit 106 is identified as Korean with the DBCS character data.When above-mentioned judgement identification is all failed, then mean the literal that not to comprise double byte in this section DBCS character data basically.Perhaps, though be the data that include double byte, it does not meet the ways of writing and the custom of natural language literal.For example, it is the code-point data that generate at random.
Fig. 2 is the process flow diagram of method of language that is used to discern the DBCS character data according to the embodiment of the invention.With reference to figure 2, in step 201, every kind of candidate's oriental language that the DBCS character data may be is determined a characteristic value collection, be described in detail in front about the definite of characteristic value collection, repeat no more here.
In step 202, read the tale of DBCS character data to be identified and accumulative total double byte.Then,, the DBCS character data that read and the eigenwert in each characteristic value collection are compared one by one, when the coupling of the eigenwert in DBCS character data and certain characteristic value collection, then carry out accumulated counts for the corresponding candidate language in step 203.
In step 204, judge corresponding to the counting of Korean whether more than or equal to 10% of the tale of double byte.If, then handle and proceed to step 205, thereby in step 205, the DBCS character data that is read is identified as Korean, processing finishes then.If in step 204, judge counting corresponding to Korean less than 10% of double byte tale, then handle and proceed to step 206, in step 206, the DBCS character data that is read is identified as the maximum language of its counting in the other Languages, processing finishes then.
Fig. 3 shows the flow process of the method for the language that is used to discern the DBCS character data according to another embodiment of the present invention.Similar to the step 201 in Fig. 2 embodiment method with step 202, in step 301, every kind of candidate's oriental language that the DBCS character data may be is determined a characteristic value collection, and, read the tale of DBCS character data to be identified and accumulative total double byte in step 302.
Then, in step 303, the DBCS character data that read and the eigenwert in each characteristic value collection are compared one by one, when the coupling of the eigenwert in DBCS character data and certain characteristic value collection, then carry out accumulated counts for this eigenwert of corresponding candidate language.
In step 304, judge corresponding to the counting in Korean byte space whether more than or equal to 10% of the tale of double byte.If, then handle and proceed to step 305, thereby in step 305, the DBCS character data that is read is identified as Korean, processing finishes then.
If in step 304, judge counting corresponding to Korean byte space less than 10% of double byte tale, then handle and proceed to step 306, in step 306, further judge in corresponding to the count value of other oriental language except that Korean, whether there is maximal value.If, then handle and proceed to step 307, thereby in step 307, the DBCS character data that is read is identified as the language of its count value maximum in the other Languages, processing finishes then; Otherwise, handle proceeding to step 308.
In step 308 and then whether judge counting corresponding to Korean byte punctuation mark greater than 0, if, then handle and proceed to the step 305 that the DBCS character data that is read is identified as Korean, processing finishes then; Otherwise, think recognition failures, in this case, may be the literal that does not comprise double byte in this section DBCS character data basically.Perhaps, though be the data that include double byte, it does not meet the ways of writing and the custom of natural language literal.
Though all to begin the carrying out example explanation from Korean, the present invention is not limited to this to the judgement cognitive phase of above-mentioned each embodiment after stored count.Also can begin from the maximum count value checking other languages, for example in some cases, before identification known DBCS character data to be identified very little might be Korean.
After the language that identifies the DBCS character data, also just be equivalent to determine basically the affiliated local character set of DBCS character data.Because usually, as previously mentioned, simplified Chinese character is corresponding to the character set of GB series, traditional Chinese is corresponding to BIG5, and Japanese is corresponding to S_JIS, and Korean is corresponding to KSC.
For the local character set under the DBCS character data of determining to have identified its language exactly, the present invention can also be further combined with other existing methods, by checking whether all DBCS character datas all form to determine affiliated character set by the character of character centralized definition undetermined.For example, under situation about identifying to Chinese character set, because GBK is the expansion of GB2312-80, and GB18030-2000 is the expansion of GBK, belongs to which character set so can differentiate it according to the coding section whether this section DBCS character data comprises expansion actually.
Therefore, according to the present invention, not only can identify the used language of DBCS character data simply, exactly but also can differentiate its affiliated local character set exactly.
Thereby,, can bring plurality of advantages by using the present invention.Such as from the server aspect, make it can accurately judge and handle the input data, serve multilingual client, and the simple and high-efficient service is provided, therefore improved the performance of server.And,, make it exempt manual selection, thereby made things convenient for the user by utilizing the present invention from the client aspect.In addition, also make the client can the programmed process multi-language data.
Although exemplarily illustrated in conjunction with Fig. 1 and realized the hardware configuration of system according to an embodiment of the invention, system of the present invention is not limited to this on hardware configuration.But each functional unit can segment further again or and other functional unit combine, as long as can realize the function that system of the present invention is required generally.For example be that total counter 103 can combine with a plurality of totalizers 105 and constitute counting unit that adds up.
In addition, the border of realizing hardware of the present invention and function is the problem of the customary design alternative of deviser fully.For example, can be at an easy rate with separate piece of hardware (for example, integrated circuit (IC)) or with multi-disc, cooperation operation IC carry out and read, comparison, counting and recognition function.Can carry out these functions with any combination of hardware, software, firmware or these three kinds of general Platform Types.Similarly, hardware between the storage unit among the embodiment shown in Fig. 1, reading unit, comparing unit, tale unit, a plurality of totalizer and the speech recognition unit and/or functional boundary only are illustrative.
Although illustrated and described several embodiments of the present invention, it should be appreciated by those skilled in the art, can make a change these embodiment under the situation that does not break away from principle of the present invention and spirit, scope of the present invention is defined by claims and equivalent thereof.Many changes and modification to exemplary embodiment are included within the scope of the present invention.
Claims (10)
1. one kind is used for may be the method that comprises the language of the double-byte character set character data of one of candidate language of Korean and other multiple oriental language, comprise the following steps: in computer system identification
A. determine a characteristic value collection for each candidate language respectively, wherein determined eigenwert does not define in the corresponding local character set of other candidate language or usually need not;
B. read the tale of double-byte character set character data to be identified and accumulative total double byte;
C. the characteristic value collection with double-byte character set character data and every kind of candidate language compares, and during one of eigenwert in the characteristic value collection of one of double-byte character set character data and described candidate language coupling, to the stored count of corresponding candidate language;
If d. corresponding to the counting of Korean more than or equal to 10% of tale, then double-byte character set character data to be identified is identified as Korean, otherwise, double-byte character set character data to be identified is identified as the maximum language of its counting in described other multiple oriental language.
2. the method for claim 1, wherein said other multiple oriental language is respectively simplified Chinese character, traditional Chinese and Japanese.
3. method as claimed in claim 1 or 2, wherein the eigenwert in the characteristic value collection of determined Korean comprises byte space value 0x20, and when being complementary with respect to double-byte character set character data to be identified and this byte space value and counting totally greater than tale 10% the time, described double-byte character set character data to be identified is identified as Korean.
4. method as claimed in claim 3, wherein the eigenwert in the characteristic value collection of determined Korean also comprises following byte punctuation mark value: exclamation mark 0x21, comma 0x2c, fullstop 0x2e and question mark 0x3f, when the recognition failures of described steps d, this method also comprises step e, be used for judging when being complementary with respect to double-byte character set character data to be identified and described byte punctuation mark value and the counting of accumulative total whether greater than 0, if then described double-byte character set character data to be identified is identified as Korean.
5. as claim 2 or 4 described methods, also be included in and identify double-byte character set character data, by checking whether all double-byte character set character datas are all by forming the step of determining affiliated local character set with the character of the corresponding local character centralized definition of this kind language for behind which kind of language.
6. method as claimed in claim 5 is GBK with the corresponding local character set of simplified Chinese character wherein, with the corresponding local character set of traditional Chinese be BIG5, with the corresponding local character set of Japanese be S_JIS, with the corresponding local character set of Korean be KSC.
7. method as claimed in claim 2, wherein the eigenwert in the characteristic value collection of determined simplified Chinese character comprises following double byte punctuation mark value: overall with comma 0xa3ac, the comma 0xa1a2 that expresses the meaning, the fullstop 0xa1a3 that expresses the meaning, overall with exclamation mark 0xa3a1, overall with question mark 0xa3bf.
8. method as claimed in claim 2, wherein the eigenwert in the characteristic value collection of determined traditional Chinese comprises following double byte punctuation mark value: overall with comma 0xa141, the comma 0xa142 that expresses the meaning, overall with fullstop 0xa144, the fullstop 0xa143 that expresses the meaning, overall with exclamation mark 0xa149, overall with question mark 0xa148.
9. method as claimed in claim 2, wherein the eigenwert in the characteristic value collection of determined Japanese comprises following double byte punctuation mark value: the comma 0x8141 that expresses the meaning, the fullstop 0x8142 that expresses the meaning, overall with exclamation mark 0x8149, overall with question mark 0x8148.
10. one kind is used for may be the system that comprises the language of the double-byte character set character data of one of candidate language of Korean and other multiple oriental language, comprise in computer system identification:
Storage unit is used for storage respectively for the determined characteristic value collection of each candidate language, and wherein determined eigenwert does not define in the corresponding local character set of other candidate language or usually need not;
Reading unit is used to read double-byte character set character data to be identified;
The tale unit is used for the tale of the double byte of the double-byte character set character data that read of accumulative total;
Comparing unit is used for the double-byte character set character data that will be read and the characteristic value collection of every kind of candidate language and compares and export comparative result;
A plurality of totalizers, it is respectively corresponding to every kind of candidate language, when the comparative result of being exported showed one of eigenwert in the characteristic value collection of one of described double-byte character set character data and described candidate language coupling, the totalizer corresponding to this candidate language in these a plurality of totalizers added up 1 with its counting;
The speech recognition unit, be used for according to the tale of described tale unit and the described language of discerning described double-byte character set character data respectively corresponding to the corresponding counts of a plurality of totalizers of every kind of candidate language, if wherein corresponding to the counting of Korean greater than 10% of tale, then double-byte character set character data to be identified is identified as Korean, otherwise, double-byte character set character data to be identified is identified as the maximum language of its counting in described other multiple oriental language.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510091971 CN1916888A (en) | 2005-08-15 | 2005-08-15 | Method and system of identifying language of double-byte character set character data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510091971 CN1916888A (en) | 2005-08-15 | 2005-08-15 | Method and system of identifying language of double-byte character set character data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN1916888A true CN1916888A (en) | 2007-02-21 |
Family
ID=37737886
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200510091971 Pending CN1916888A (en) | 2005-08-15 | 2005-08-15 | Method and system of identifying language of double-byte character set character data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1916888A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
SG155069A1 (en) * | 2008-02-14 | 2009-09-30 | Victor Company Of Japan | Method of language coding identification and data format therefor |
CN102479187A (en) * | 2010-11-23 | 2012-05-30 | 盛乐信息技术(上海)有限公司 | GBK character query system based on parity check and implementation method thereof |
CN102541822A (en) * | 2010-12-21 | 2012-07-04 | 航天信息股份有限公司 | Chinese character processing method and Chinese character processing device during communication |
-
2005
- 2005-08-15 CN CN 200510091971 patent/CN1916888A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
SG155069A1 (en) * | 2008-02-14 | 2009-09-30 | Victor Company Of Japan | Method of language coding identification and data format therefor |
CN102479187A (en) * | 2010-11-23 | 2012-05-30 | 盛乐信息技术(上海)有限公司 | GBK character query system based on parity check and implementation method thereof |
CN102479187B (en) * | 2010-11-23 | 2016-09-14 | 盛乐信息技术(上海)有限公司 | GBK character inquiry system based on even-odd check and its implementation |
CN102541822A (en) * | 2010-12-21 | 2012-07-04 | 航天信息股份有限公司 | Chinese character processing method and Chinese character processing device during communication |
CN102541822B (en) * | 2010-12-21 | 2014-07-02 | 航天信息股份有限公司 | Chinese character processing method and Chinese character processing device during communication |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102227724B (en) | Machine learning for transliteration | |
US8412517B2 (en) | Dictionary word and phrase determination | |
CN104750666B (en) | A kind of recognition methods of text character codes mode and system | |
CN101467125A (en) | Processing of query terms | |
CN101079031A (en) | Web page subject extraction system and method | |
CN102138142A (en) | Dictionary suggestions for partial user entries | |
CN1871607A (en) | Identifying related names | |
CN104008093A (en) | Method and system for chinese name transliteration | |
CN103605690A (en) | Device and method for recognizing advertising messages in instant messaging | |
CN101933017B (en) | Document search device, document search system, and document search method | |
CN111858905A (en) | Model training method, information identification method, device, electronic equipment and storage medium | |
CN111190873B (en) | Log mode extraction method and system for log training of cloud native system | |
CN112883730A (en) | Similar text matching method and device, electronic equipment and storage medium | |
CN103455572B (en) | Obtain the method and device of video display main body in webpage | |
CN102550049A (en) | Acquisition of out-of-vocabulary translations by dynamically learning extraction rules | |
CN103365934A (en) | Extracting method and device of complex named entity | |
CN114385167A (en) | Front-end page generation method, device, equipment and medium | |
KR101565367B1 (en) | Method for calculating plagiarism rate of documents by number normalization | |
CN1916888A (en) | Method and system of identifying language of double-byte character set character data | |
CN102024026A (en) | Method and system for processing query terms | |
CN100422987C (en) | Method and system of intelligent information processing in network | |
CN113553410B (en) | Long document processing method, processing device, electronic equipment and storage medium | |
CN102253983A (en) | Method and system for identifying Chinese high-risk words | |
CN113272799A (en) | Coded information extractor | |
CN108292307A (en) | With the quick operating prefix Burrow-Wheeler transformation to compressed data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20070221 |
|
C20 | Patent right or utility model deemed to be abandoned or is abandoned |