CN1786956A

CN1786956A - Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine

Info

Publication number: CN1786956A
Application number: CN 200510127958
Authority: CN
Inventors: 冯建康; 王宏源; 赵锋
Original assignee: 王宏源
Current assignee: Wang Fei
Priority date: 2005-12-09
Filing date: 2005-12-09
Publication date: 2006-06-14
Anticipated expiration: 2025-12-09
Also published as: CN1786956B

Abstract

The invention discloses a method for processing the conversion of variant forms of Unicode four-byte code-containing East-Asia expression ideographs in search engines. According to a table of variant forms of Chinese characters, the method adopts the idea of layered matching and realizes the matched search of variant forms of characters between various East-Asia Chinese characters, between the current frequently used characters and ancient writings and between ancient writings of different versions in the search engines. As searching, as long as any one of the variant forms of characters is inputted, the information containing other variant forms of characters will be searched. The invention makes the search engines able to more accurately search the user-needed information without considering the conversion problem between various variant forms of characters.

Description

Handle the method that contains the conversion of Unieode four byte code East Asia ideograph variant Chinese character in the search engine

Technical field

The present invention relates to handle in a kind of search engine the method that contains the conversion of Unicode four byte code East Asia ideograph variant Chinese character.

Background technology

Search engine can help the useful information that the user finds oneself in magnanimity information need.Along with informationalized continuous propelling, the mankind have accumulated increasing information data, and especially on the internet, the information of accumulation all increases with exponential every year.Search engine finds the needed information of user to play critical effect in vast as the open sea internet information.Because the accumulation of culture in 5,000 years of China and the uniqueness of Chinese language make that external English search engine can not the good treatment Chinese search engine.So the Chinese search engine of special disposal Chinese also occurred, for example Baidu.The Baidu search engine uses unique Chinese language treatment technology based on word and speech to handle the understanding problem of Chinese information, solved other preferably merely based on word or merely based on the shortcoming of the search engine of speech.Baidu's search engine is supported the Chinese character code standard of main flow, comprises GB2312, BIG5 etc., and can change between different codings, and this just makes that the result for retrieval of the simplified Chinese character and the complex form of Chinese characters can natural combination.

Yet the high speed accumulation of information not only is embodied in the adding of the fresh information on the existing common coding, and the Hard copy information of China's accumulation for thousands of years is also constantly by electronization.Occurred increasing digitized information to knowledge in ancient times in recent years, for example Chinese vast as the open sea document handed down from ancient times and the information of the unearthed documents such as unearthed simple silks, inscription on ancient bronze objects, the inscriptions on bones or tortoise shells of engaging in archaeological studies have much converted the e-text data to.Here just not only relate to simplified literal and traditional font literal, and can relate to uncommon literal in a lot of ancient times, this one of them phenomenon is a large amount of existence of variant Chinese character, and its form comprises ancient Chinese prose font or body word, the nonstandard forms of characters, taboo word or the like.And same word multiple different body also can occur owing to use the difference in zone.Here our said variant Chinese character though just the font of some Chinese character is different each other for various reasons, is identical in meaning and pronunciation." being " word in for example simplified just has following multiple literary style: " being " (simplified), " As " (traditional font, Taiwan), " being " (traditional font), (in the ancient Chinese prose is word),

(in the ancient Chinese prose is word) etc., wherein these different " being " words we be referred to as variant Chinese character; " Asia " word literary style in simplified Chinese character is that literary style is that literary style is " Asia " in " Ami ", the traditional font, Taiwan in " Asia ", the Japanese, and these " Asias " are also referred to as variant Chinese character; The taboo word literary style of " profound firelight or sunlight " is The nonstandard forms of characters literary style of " upright stone tablet " is

The nonstandard forms of characters literary style Wei “ Gaol of " elk " ".These synonyms not literal of similar shape all can use in regular period or certain territorial scope for a certain reason in a large number.

Two traditional byte code technology can only be handled Chinese character more than 20,000 at most, and Chinese character total amount in ancient books surpasses 50,000.This also do not comprise use in the unearthed literature research such as the inscriptions on bones or tortoise shells, inscription on ancient bronze objects, simple silks can't be subordinate to fixed Chinese character.Chinese character process development and the evolution in thousands of years, the total quantity that is handed down at present surpass lO ten thousand (" the variant Chinese character dictionary " of Taiwan publication received word up to 106230 words).In recent years because the effect of Unicode Unified coding work, people the coding and computer technology in conjunction with aspect done a lot of work, a large amount of rarely used words in the Chinese character have been placed in the code area of nybble and have determined Unified coding, the Unicode Unified coding also will be included in pictograph in ancient times such as the China inscriptions on bones or tortoise shells, inscription on ancient bronze objects, this human character that will make computing machine to manage increases greatly, in the OfficeXP of Microsoft simplified Chinese edition, also pre-install the character library of nybble, adopted the manageable Unicode character of platform of Microsoft to reach more than 70,000 at present.Beijing epoch vast hall Science and Technology Ltd. " the vast hall of dragon language ancient books and records database " system on this basis, adopted ancient books and records document digitizing constructing technology, realized handling and full-text search the true property of depositing of the handed down from ancient times and unearthed document that contains a large amount of rare Chinese characters is information-based based on the full-text search of Unicode four byte code natural language.

But, in the current technology, can only realize the electronization of ancient Chinese prose and, not solve the transfer problem between the different literary styles of same Chinese character fully the single word of rare Chinese character, the retrieval of speech.For example, " being " word is in current search engine, and for example Baidu and Google have realized that just simplified and traditional body between " being ", " As ", " As " and " Asia ", " Asia " changes mutually and mate.But for " Ami " word that relates in " Asia ", " Asia " and the Japanese, and

Between and conversion and coupling between they and " being ", " As ", " being ", just powerless.Just in the current search engine,, can't retrieve the relevant information in other country of East Asia or the ancient literature if only use the simplified Chinese character or the complex form of Chinese characters.

Summary of the invention

In view of the foregoing, fundamental purpose of the present invention provides the method for handling the variant Chinese character conversion that contains Unicode four byte code East Asia ideograph character in a kind of search engine.This method adopts the thought of layering and matching according to Chinese character variant Chinese character word table, has realized in search engine coupling between the variant Chinese character and retrieval.Wherein these variant Chinese character comprise that same word is owing to the word of the word that uses the not homomorphs that produce in the different regions, East Asia with the not homomorphs of same word in different version ancient Chinese prose fonts.Simple example, be exactly " being " (simplified), " As " (traditional font, Taiwan), " being " (traditional font),

(in the ancient Chinese prose is word), (in the ancient Chinese prose is word) " be referred to as " being " word variant Chinese character collection, the present invention has realized mutual coupling and the retrieval between these variant Chinese character collection.Adopt the method, need only in the input variant Chinese character any one when retrieval, the information that comprises other variant Chinese character in the information all can be hit.

When mapping was changed between the processing variant Chinese character, concrete implementation method was:

A, with the variant Chinese character word table according to the commonly used and uncommon sub-word table of two classes that is divided into, store respectively, the sub-word table of class wherein commonly used is meant the variant Chinese character set that the different editions of the current various Chinese text that is using in each place in the East Asia Region forms, and the sub-word table of uncommon class is meant the set of the various uncommon variant Chinese character composition that exists in document handed down from ancient times and unearthed document;

B, set up between the sub-word table of above-mentioned two classes and the inner different variant Chinese character of word table between mapping ruler;

C, different mapping ruler is hit type for synthetic three kinds according to concrete set of applications, when search, hit type, open corresponding mapping transformation rule according to the difference of different demand settings;

D, according to the word that hits in the retrieval word string of type and input, by the variant Chinese character character set after the output of the mapping ruler between the variant Chinese character conversion;

E, search engine will be according to searching for through the set of keywords after the variant Chinese character conversion.

Characteristics of the present invention:

1. make search engine search out information in the ancient Chinese prose according to Chinese characters in common use.Not only realize the simplified and traditional conversion between the Chinese character, and can realize the conversion between conversion, current literal commonly used and the ancient Chinese prose between the Chinese text font of current various East Asia, the conversion between the different version ancient Chinese prose font.

2. the use classifying rules makes the user to open corresponding transformation rule according to the demand of oneself, crosses to filter a large amount of unnecessary retrieving informations.

Description of drawings

Fig. 1 concerns synoptic diagram for variant Chinese character mapping ruler of the present invention

Fig. 2 is the present invention's variant Chinese character mapping conversion process schematic flow sheet in search engine

Embodiment

Fundamental purpose of the present invention provides and handles the method that contains the conversion of Unicode four byte code East Asia ideograph variant Chinese character in a kind of search engine.This method is according to Chinese character variant Chinese character word table, adopts the thought of layering and matching, realized in search engine, between the Chinese text font of current various East Asia, between current literal commonly used and the ancient Chinese prose, the coupling retrieval between the different version ancient Chinese prose.

Concrete implementation method is as follows

A, the variant Chinese character word table is divided into the sub-word table of two classes according to commonly used and ancient Chinese prose, stores respectively.For example " be " that (simplified Chinese character) " As " (traditional font, Taiwan), " Asia " (simplified Chinese character), " Ami " (Japanese), " Asia " (traditional font, Taiwan) wait these literal that is using in each area in East Asia to belong to the everyday character word table;

Literal Deng a large amount of uses in ancient times belong to ancient Chinese prose class word table.

B, set up between the sub-word table of above-mentioned two classes and the inner different variant Chinese character of word table between mapping ruler.

Definition variant Chinese character conversion mapping ruler is as follows:

Rule 1: the mapping between the class word table inside commonly used.For example " be ", can mutual mapping between " As " and " being ", can mutual mapping between " Asia ", " Ami ", " Asia ".

Rule 2: the mapping between the ancient Chinese prose class word table inside.For example With Between can mutual mapping.

Rule 3: class word table commonly used is to the mapping between the ancient Chinese prose class word table." be ", any one word in " As " and " being " can be mapped to With

Rule 4: ancient Chinese prose class word table is to the mapping between the class word table commonly used.For example With

In any one word can be mapped to " being ", " As " and " being ".

C, different mapping ruler is hit type for synthetic three kinds according to concrete set of applications.

Three kinds of mapping ruler regulations of hitting type are as follows:

Everyday character is hit: comprise rule 1.

Ancient Chinese prose hits: comprise rule 1, rule 2, rule 3.

Hit fully: comprise rule 1, rule 2, rule 3, rule 4.

D, basis are hit the search key of type and input, carry out variant Chinese character mapping conversion, the variant Chinese character character set after the output conversion by the mapping ruler between the variant Chinese character

Variant Chinese character mapping conversion is meant according to corresponding mapping ruler, with the mapping result output of input word.For example, be input as " being ", then be output as after the conversion according to rule 3 With

E, search engine will be according to searching for through the set of keywords after the variant Chinese character conversion.If be input as " being ", the output of conversion back

With

Then search engine search is comprised " being ", With Information.

Advantage of the present invention and technique effect:

The present invention has well solved between the Chinese text font of current various East Asia, between current literal commonly used and the ancient Chinese prose, the transfer problem between the different version ancient Chinese prose font, make search engine more can accurately retrieve the information that the user needs, and needn't consider the transfer problem between the various variant Chinese character.

Claims

1, handles the method that contains the conversion of Unicode four byte code East Asia ideograph variant Chinese character in a kind of search engine.This method adopts the thought of layering and matching according to Chinese character variant Chinese character word table, has realized the retrieval of the coupling between the variant Chinese character in search engine.These variant Chinese character comprise the not homomorphs of same word in the ideograph of various East Asia, the not homomorphs in the ancient Chinese prose of various version.As long as any one in the input variant Chinese character, the information that comprises other variant Chinese character in the information all can be hit in when retrieval.

2, as in the described search engine of claim 1, handling the method that contains the conversion of Unicode four byte code ideograph variant Chinese character, it is characterized in that: when mapping is changed between the processing variant Chinese character, the variant Chinese character word table according to commonly used and the uncommon sub-word table of two classes that is divided into, is stored respectively.Set up between the sub-word table of above-mentioned two classes and the inner different variant Chinese character of word table between mapping ruler.

3, as in claim 1, handle the method that contains the conversion of Unicode four byte code East Asia ideograph variant Chinese character in the 2 described search engines, it is characterized in that: different mapping rulers is hit type for synthetic three kinds according to concrete set of applications, the user hits type according to the difference of the demand setting of oneself when search, opens corresponding mapping transformation rule.

4, as in claim 1,2, during holding up, handle 3 described index the method that contains the conversion of Unicode four byte code East Asia ideograph variant Chinese character, it is characterized in that: according to the search key that hits type and input, by the variant Chinese character character set after the output of the mapping ruler between the variant Chinese character conversion.Search engine will be according to searching for through the set of keywords after the variant Chinese character conversion.