CN1786956A - Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine - Google Patents

Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine Download PDF

Info

Publication number
CN1786956A
CN1786956A CN 200510127958 CN200510127958A CN1786956A CN 1786956 A CN1786956 A CN 1786956A CN 200510127958 CN200510127958 CN 200510127958 CN 200510127958 A CN200510127958 A CN 200510127958A CN 1786956 A CN1786956 A CN 1786956A
Authority
CN
China
Prior art keywords
chinese character
variant
variant chinese
conversion
ideograph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200510127958
Other languages
Chinese (zh)
Other versions
CN1786956B (en
Inventor
冯建康
王宏源
赵锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wang Fei
Original Assignee
王宏源
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 王宏源 filed Critical 王宏源
Priority to CN 200510127958 priority Critical patent/CN1786956B/en
Publication of CN1786956A publication Critical patent/CN1786956A/en
Application granted granted Critical
Publication of CN1786956B publication Critical patent/CN1786956B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method for processing the conversion of variant forms of Unicode four-byte code-containing East-Asia expression ideographs in search engines. According to a table of variant forms of Chinese characters, the method adopts the idea of layered matching and realizes the matched search of variant forms of characters between various East-Asia Chinese characters, between the current frequently used characters and ancient writings and between ancient writings of different versions in the search engines. As searching, as long as any one of the variant forms of characters is inputted, the information containing other variant forms of characters will be searched. The invention makes the search engines able to more accurately search the user-needed information without considering the conversion problem between various variant forms of characters.

Description

Handle the method that contains the conversion of Unieode four byte code East Asia ideograph variant Chinese character in the search engine
Technical field
The present invention relates to handle in a kind of search engine the method that contains the conversion of Unicode four byte code East Asia ideograph variant Chinese character.
Background technology
Search engine can help the useful information that the user finds oneself in magnanimity information need.Along with informationalized continuous propelling, the mankind have accumulated increasing information data, and especially on the internet, the information of accumulation all increases with exponential every year.Search engine finds the needed information of user to play critical effect in vast as the open sea internet information.Because the accumulation of culture in 5,000 years of China and the uniqueness of Chinese language make that external English search engine can not the good treatment Chinese search engine.So the Chinese search engine of special disposal Chinese also occurred, for example Baidu.The Baidu search engine uses unique Chinese language treatment technology based on word and speech to handle the understanding problem of Chinese information, solved other preferably merely based on word or merely based on the shortcoming of the search engine of speech.Baidu's search engine is supported the Chinese character code standard of main flow, comprises GB2312, BIG5 etc., and can change between different codings, and this just makes that the result for retrieval of the simplified Chinese character and the complex form of Chinese characters can natural combination.
Yet the high speed accumulation of information not only is embodied in the adding of the fresh information on the existing common coding, and the Hard copy information of China's accumulation for thousands of years is also constantly by electronization.Occurred increasing digitized information to knowledge in ancient times in recent years, for example Chinese vast as the open sea document handed down from ancient times and the information of the unearthed documents such as unearthed simple silks, inscription on ancient bronze objects, the inscriptions on bones or tortoise shells of engaging in archaeological studies have much converted the e-text data to.Here just not only relate to simplified literal and traditional font literal, and can relate to uncommon literal in a lot of ancient times, this one of them phenomenon is a large amount of existence of variant Chinese character, and its form comprises ancient Chinese prose font or body word, the nonstandard forms of characters, taboo word or the like.And same word multiple different body also can occur owing to use the difference in zone.Here our said variant Chinese character though just the font of some Chinese character is different each other for various reasons, is identical in meaning and pronunciation." being " word in for example simplified just has following multiple literary style: " being " (simplified), " As " (traditional font, Taiwan), " being " (traditional font), (in the ancient Chinese prose is word),
Figure A20051012795800032
(in the ancient Chinese prose is word) etc., wherein these different " being " words we be referred to as variant Chinese character; " Asia " word literary style in simplified Chinese character is that literary style is that literary style is " Asia " in " Ami ", the traditional font, Taiwan in " Asia ", the Japanese, and these " Asias " are also referred to as variant Chinese character; The taboo word literary style of " profound firelight or sunlight " is The nonstandard forms of characters literary style of " upright stone tablet " is
Figure A20051012795800034
The nonstandard forms of characters literary style Wei “ Gaol of " elk " ".These synonyms not literal of similar shape all can use in regular period or certain territorial scope for a certain reason in a large number.
Two traditional byte code technology can only be handled Chinese character more than 20,000 at most, and Chinese character total amount in ancient books surpasses 50,000.This also do not comprise use in the unearthed literature research such as the inscriptions on bones or tortoise shells, inscription on ancient bronze objects, simple silks can't be subordinate to fixed Chinese character.Chinese character process development and the evolution in thousands of years, the total quantity that is handed down at present surpass lO ten thousand (" the variant Chinese character dictionary " of Taiwan publication received word up to 106230 words).In recent years because the effect of Unicode Unified coding work, people the coding and computer technology in conjunction with aspect done a lot of work, a large amount of rarely used words in the Chinese character have been placed in the code area of nybble and have determined Unified coding, the Unicode Unified coding also will be included in pictograph in ancient times such as the China inscriptions on bones or tortoise shells, inscription on ancient bronze objects, this human character that will make computing machine to manage increases greatly, in the OfficeXP of Microsoft simplified Chinese edition, also pre-install the character library of nybble, adopted the manageable Unicode character of platform of Microsoft to reach more than 70,000 at present.Beijing epoch vast hall Science and Technology Ltd. " the vast hall of dragon language ancient books and records database " system on this basis, adopted ancient books and records document digitizing constructing technology, realized handling and full-text search the true property of depositing of the handed down from ancient times and unearthed document that contains a large amount of rare Chinese characters is information-based based on the full-text search of Unicode four byte code natural language.
But, in the current technology, can only realize the electronization of ancient Chinese prose and, not solve the transfer problem between the different literary styles of same Chinese character fully the single word of rare Chinese character, the retrieval of speech.For example, " being " word is in current search engine, and for example Baidu and Google have realized that just simplified and traditional body between " being ", " As ", " As " and " Asia ", " Asia " changes mutually and mate.But for " Ami " word that relates in " Asia ", " Asia " and the Japanese, and
Figure A20051012795800041
Between and conversion and coupling between they and " being ", " As ", " being ", just powerless.Just in the current search engine,, can't retrieve the relevant information in other country of East Asia or the ancient literature if only use the simplified Chinese character or the complex form of Chinese characters.
Summary of the invention
In view of the foregoing, fundamental purpose of the present invention provides the method for handling the variant Chinese character conversion that contains Unicode four byte code East Asia ideograph character in a kind of search engine.This method adopts the thought of layering and matching according to Chinese character variant Chinese character word table, has realized in search engine coupling between the variant Chinese character and retrieval.Wherein these variant Chinese character comprise that same word is owing to the word of the word that uses the not homomorphs that produce in the different regions, East Asia with the not homomorphs of same word in different version ancient Chinese prose fonts.Simple example, be exactly " being " (simplified), " As " (traditional font, Taiwan), " being " (traditional font),
Figure A20051012795800043
(in the ancient Chinese prose is word), (in the ancient Chinese prose is word) " be referred to as " being " word variant Chinese character collection, the present invention has realized mutual coupling and the retrieval between these variant Chinese character collection.Adopt the method, need only in the input variant Chinese character any one when retrieval, the information that comprises other variant Chinese character in the information all can be hit.
When mapping was changed between the processing variant Chinese character, concrete implementation method was:
A, with the variant Chinese character word table according to the commonly used and uncommon sub-word table of two classes that is divided into, store respectively, the sub-word table of class wherein commonly used is meant the variant Chinese character set that the different editions of the current various Chinese text that is using in each place in the East Asia Region forms, and the sub-word table of uncommon class is meant the set of the various uncommon variant Chinese character composition that exists in document handed down from ancient times and unearthed document;
B, set up between the sub-word table of above-mentioned two classes and the inner different variant Chinese character of word table between mapping ruler;
C, different mapping ruler is hit type for synthetic three kinds according to concrete set of applications, when search, hit type, open corresponding mapping transformation rule according to the difference of different demand settings;
D, according to the word that hits in the retrieval word string of type and input, by the variant Chinese character character set after the output of the mapping ruler between the variant Chinese character conversion;
E, search engine will be according to searching for through the set of keywords after the variant Chinese character conversion.
Characteristics of the present invention:
1. make search engine search out information in the ancient Chinese prose according to Chinese characters in common use.Not only realize the simplified and traditional conversion between the Chinese character, and can realize the conversion between conversion, current literal commonly used and the ancient Chinese prose between the Chinese text font of current various East Asia, the conversion between the different version ancient Chinese prose font.
2. the use classifying rules makes the user to open corresponding transformation rule according to the demand of oneself, crosses to filter a large amount of unnecessary retrieving informations.
Description of drawings
Fig. 1 concerns synoptic diagram for variant Chinese character mapping ruler of the present invention
Fig. 2 is the present invention's variant Chinese character mapping conversion process schematic flow sheet in search engine
Embodiment
Fundamental purpose of the present invention provides and handles the method that contains the conversion of Unicode four byte code East Asia ideograph variant Chinese character in a kind of search engine.This method is according to Chinese character variant Chinese character word table, adopts the thought of layering and matching, realized in search engine, between the Chinese text font of current various East Asia, between current literal commonly used and the ancient Chinese prose, the coupling retrieval between the different version ancient Chinese prose.
Concrete implementation method is as follows
A, the variant Chinese character word table is divided into the sub-word table of two classes according to commonly used and ancient Chinese prose, stores respectively.For example " be " that (simplified Chinese character) " As " (traditional font, Taiwan), " Asia " (simplified Chinese character), " Ami " (Japanese), " Asia " (traditional font, Taiwan) wait these literal that is using in each area in East Asia to belong to the everyday character word table;
Figure A20051012795800053
Literal Deng a large amount of uses in ancient times belong to ancient Chinese prose class word table.
B, set up between the sub-word table of above-mentioned two classes and the inner different variant Chinese character of word table between mapping ruler.
Definition variant Chinese character conversion mapping ruler is as follows:
Rule 1: the mapping between the class word table inside commonly used.For example " be ", can mutual mapping between " As " and " being ", can mutual mapping between " Asia ", " Ami ", " Asia ".
Rule 2: the mapping between the ancient Chinese prose class word table inside.For example With Between can mutual mapping.
Rule 3: class word table commonly used is to the mapping between the ancient Chinese prose class word table." be ", any one word in " As " and " being " can be mapped to With
Rule 4: ancient Chinese prose class word table is to the mapping between the class word table commonly used.For example With
Figure A20051012795800059
In any one word can be mapped to " being ", " As " and " being ".
C, different mapping ruler is hit type for synthetic three kinds according to concrete set of applications.
Three kinds of mapping ruler regulations of hitting type are as follows:
Everyday character is hit: comprise rule 1.
Ancient Chinese prose hits: comprise rule 1, rule 2, rule 3.
Hit fully: comprise rule 1, rule 2, rule 3, rule 4.
D, basis are hit the search key of type and input, carry out variant Chinese character mapping conversion, the variant Chinese character character set after the output conversion by the mapping ruler between the variant Chinese character
Variant Chinese character mapping conversion is meant according to corresponding mapping ruler, with the mapping result output of input word.For example, be input as " being ", then be output as after the conversion according to rule 3 With
Figure A20051012795800062
E, search engine will be according to searching for through the set of keywords after the variant Chinese character conversion.If be input as " being ", the output of conversion back
Figure A20051012795800063
With
Figure A20051012795800064
Then search engine search is comprised " being ", With Information.
Advantage of the present invention and technique effect:
The present invention has well solved between the Chinese text font of current various East Asia, between current literal commonly used and the ancient Chinese prose, the transfer problem between the different version ancient Chinese prose font, make search engine more can accurately retrieve the information that the user needs, and needn't consider the transfer problem between the various variant Chinese character.

Claims (4)

1, handles the method that contains the conversion of Unicode four byte code East Asia ideograph variant Chinese character in a kind of search engine.This method adopts the thought of layering and matching according to Chinese character variant Chinese character word table, has realized the retrieval of the coupling between the variant Chinese character in search engine.These variant Chinese character comprise the not homomorphs of same word in the ideograph of various East Asia, the not homomorphs in the ancient Chinese prose of various version.As long as any one in the input variant Chinese character, the information that comprises other variant Chinese character in the information all can be hit in when retrieval.
2, as in the described search engine of claim 1, handling the method that contains the conversion of Unicode four byte code ideograph variant Chinese character, it is characterized in that: when mapping is changed between the processing variant Chinese character, the variant Chinese character word table according to commonly used and the uncommon sub-word table of two classes that is divided into, is stored respectively.Set up between the sub-word table of above-mentioned two classes and the inner different variant Chinese character of word table between mapping ruler.
3, as in claim 1, handle the method that contains the conversion of Unicode four byte code East Asia ideograph variant Chinese character in the 2 described search engines, it is characterized in that: different mapping rulers is hit type for synthetic three kinds according to concrete set of applications, the user hits type according to the difference of the demand setting of oneself when search, opens corresponding mapping transformation rule.
4, as in claim 1,2, during holding up, handle 3 described index the method that contains the conversion of Unicode four byte code East Asia ideograph variant Chinese character, it is characterized in that: according to the search key that hits type and input, by the variant Chinese character character set after the output of the mapping ruler between the variant Chinese character conversion.Search engine will be according to searching for through the set of keywords after the variant Chinese character conversion.
CN 200510127958 2005-12-09 2005-12-09 Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine Expired - Fee Related CN1786956B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510127958 CN1786956B (en) 2005-12-09 2005-12-09 Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510127958 CN1786956B (en) 2005-12-09 2005-12-09 Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine

Publications (2)

Publication Number Publication Date
CN1786956A true CN1786956A (en) 2006-06-14
CN1786956B CN1786956B (en) 2010-08-25

Family

ID=36784417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510127958 Expired - Fee Related CN1786956B (en) 2005-12-09 2005-12-09 Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine

Country Status (1)

Country Link
CN (1) CN1786956B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823874A (en) * 2014-02-27 2014-05-28 北京六间房科技有限公司 Special character search method and system
CN104679871A (en) * 2015-03-06 2015-06-03 北京语言大学 Chinese text searching method and Chinese text searching device
CN105224539A (en) * 2014-05-29 2016-01-06 腾讯科技(深圳)有限公司 The disposal route of pagefile and device
CN108108337A (en) * 2016-11-25 2018-06-01 北大方正集团有限公司 Simplified and traditional mutual shifting method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1038364A (en) * 1988-06-03 1989-12-27 李毅民 Letter complex form of Chinese characters compatible automatic conversion system for Chinese-character information processing
JPH08263478A (en) * 1995-03-24 1996-10-11 Matsushita Electric Ind Co Ltd Single/linked chinese character document converting device
CN1532729A (en) * 2003-03-19 2004-09-29 中国科学院计算机网络信息中心 Method for forming chinese character string in full complex form, full simplified form and other recative irregular form

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823874A (en) * 2014-02-27 2014-05-28 北京六间房科技有限公司 Special character search method and system
CN105224539A (en) * 2014-05-29 2016-01-06 腾讯科技(深圳)有限公司 The disposal route of pagefile and device
CN105224539B (en) * 2014-05-29 2021-05-11 腾讯科技(深圳)有限公司 Page file processing method and device
CN104679871A (en) * 2015-03-06 2015-06-03 北京语言大学 Chinese text searching method and Chinese text searching device
CN104679871B (en) * 2015-03-06 2018-03-30 北京语言大学 A kind of Chinese language text search method and Chinese language text retrieval device
CN108108337A (en) * 2016-11-25 2018-06-01 北大方正集团有限公司 Simplified and traditional mutual shifting method and device

Also Published As

Publication number Publication date
CN1786956B (en) 2010-08-25

Similar Documents

Publication Publication Date Title
Remsen The use and limits of scientific names in biological informatics
Khabsa et al. Ackseer: a repository and search engine for automatically extracted acknowledgments from digital libraries
WO2012159558A1 (en) Natural language processing method, device and system based on semantic recognition
CN107463571A (en) Web color method
Varatharajan et al. Digital library initiatives at higher education and research institutions in India
CN1786956A (en) Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine
Renouf et al. Filling the gaps: Using the WebCorp Linguist’s Search Engine to supplement existing text resources
Leilei et al. Approaches for source retrieval and text alignment of plagiarism detection
US20170017715A1 (en) Method for Semantic Indexing of Big Data Using a Multidimensional, Hierarchical Scheme
CN102722527B (en) Full-text search method supporting search request containing missing symbols
Juang et al. Resolving the unencoded character problem for Chinese digital libraries
CN105183844A (en) Method for obtaining rarely-used Chinese character library in basic geographic information data
Bjerring-Hansen et al. Mending Fractured Texts. A heuristic procedure for correcting OCR data
Fang et al. Creation and significance of database of Dictionary of Cognate Words
CN104281603B (en) Word frequency different size method and system
Dobranić et al. A Lightweight Approach to a Giga-Corpus of Historical Periodicals: The Story of a Slovenian Historical Newspaper Collection
Klein et al. Creating time capsules for historical research in the early modern period: Reconstructing trajectories of plant medicines
Li et al. Information retrieval services based on Lucene architecture
Neumann Deep Mining of the Collection of Old Prints ‘Kirchenslavica digital’
Jia et al. Ext-LOUDS: A Space Efficient Extended LOUDS Index for Superset Query
CN2476059Y (en) Keyboard for Jiang code input method
Rashid The design and implementation of AIDA: Ancient Inscription Database and Analytics system
Liu Construction of Parallel Corpus for Japanese Software Outsourcing Document Translation
Roe et al. Enlightenment Legacies: Sequence Alignment and Text-Reuse at Scale
CN86103506A (en) " a key diadic " keyboard and China and foreign countries' characters rapid input method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: WANG FEI

Free format text: FORMER OWNER: WANG HONGYUAN

Effective date: 20090515

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20090515

Address after: Beijing City, Chaoyang District Street heading for the small village compound No. 12 room 901 post encoding: 100020

Applicant after: Wang Fei

Address before: Beijing City, Chaoyang District Street heading for the small village compound No. 12 room 901 post encoding: 100020

Applicant before: Wang Hongyuan

C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100825

Termination date: 20171209