CN101814073A - Search engine method based on special word form information - Google Patents

Search engine method based on special word form information Download PDF

Info

Publication number
CN101814073A
CN101814073A CN200910046475A CN200910046475A CN101814073A CN 101814073 A CN101814073 A CN 101814073A CN 200910046475 A CN200910046475 A CN 200910046475A CN 200910046475 A CN200910046475 A CN 200910046475A CN 101814073 A CN101814073 A CN 101814073A
Authority
CN
China
Prior art keywords
switch process
conversion
character
chinese
simplified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910046475A
Other languages
Chinese (zh)
Inventor
邓晓涛
谢兵
杨杰
程健章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chuanxian Network Technology Shanghai Co Ltd
Original Assignee
Chuanxian Network Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chuanxian Network Technology Shanghai Co Ltd filed Critical Chuanxian Network Technology Shanghai Co Ltd
Priority to CN200910046475A priority Critical patent/CN101814073A/en
Publication of CN101814073A publication Critical patent/CN101814073A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a search engine method based on special word form information. A search engine system comprises a client side and a server side which are connected by communications, wherein the steps operated at the server side include text information obtaining, segmentation of words in the text, conversion, indexing and index file library establishing; the step of indexing is used for carrying out reverse order indexing on the output of the step of conversion; the step of index file library establishing is used for generating the index files according to the output of the step of indexing; the steps operated at the client side include inputting by the user, segmentation of words in the text, conversion, query and result returning; the step of conversion is used for converting the text information on which word segmentation is carried out in the step of segmentation of words in the text; the step of query is used for combining the terms output in the step of conversion with the query conditions input by the user, inquiring the index file library at the server side and outputting the query result. The method can be widely applied to retrieval of word information containing special forms, carry out search through other forms of the words and return the search result corresponding to the word information.

Description

Search engine method based on special word form information
Technical field
What the present invention relates to is a kind of text message search engine system, specifically is a kind of search engine method based on special word form information.
Background technology
Along with Internet development, search engine becomes one of people's retrieving information necessary tool.In the internet, if information spinner presents with the form of literal, and because the diversity of literal body, make the Word message of same meaning, the different forms of expression is arranged, this mainly is owing to the not homomorphs of people to the different Word messages that form such as the description custom of information, input tool, region, the abbreviation obform body.Special word form mainly contains character code difference, language difference, form difference.Search engine is to text information processing the time, usually original information being carried out participle (Word Segmentation) handles, information after the processing directly generates the file of falling the ranking index (Reverse Order Index File), its principle is, set up mapping relations between the text path at entry (Term) the corresponding informance place that produces by participle or the URL (Uniform Resource Location), when the user carries out information retrieval, by the entry that comprises in the phrase of input, find corresponding resource and return.If containing the entry of the obform body of this entry in the information of user's input just can not be retrieved out.
At present, search engine handle the obform body entry be with the obform body of this entry as entry independently, perhaps the obform body with this entry carries out repeat search as extra entry.In daily life, the form of the obform body of Word message is a lot, and these mainly are because region or user's use habit and input tool are relevant.Involved obform body has the full-shape of simplified and traditional font, character of Chinese character and half-angle, Chinese figure and arabic numeral, the form on date in the search engine method based on special word form information.
The difference of the letter of Chinese character, traditional font information is mainly reflected on the difference of region.Go back the input that some input tool possesses this simplified and traditional body in addition, also have user's personal interest to use the body that mixes.In the internet, Chinese character information exists with two kinds of bodies of simplified and traditional body, will have such problem so, when retrieving in conjunction with Chinese in the simplified and traditional font of input, may can not get result's (for example search " agricultural ") that we want.
Double byte character and half-angle character are the character set (for example the character code of " a " and " a " is different) that belongs to different in the computer character code set.In the internet, it also is ubiquitous that this coding mixes the phenomenon of using, and mainly embodies a kind of individual character of user.Because the difference of character set can be used as different characters to the character of full-shape and half-angle and carry out index when index, and during retrieval, search engine only can be retrieved corresponding entries, thereby has meaning character of the same race and can not retrieve.
Though Chinese figure and arabic numeral have purposes separately in information, in the description of the information description of some cardinal sum ordinal numbers and date etc., meaning is identical (for example " on July one, 1 " and " on July 1st, 1997 ").People are when using these numerals to carry out information description, and according to different occasions, the obform body of numeral uses and all has (for example " 999 roses " and " 999 roses ").And we at retrieving information are, in order to reduce input quantity, can directly import arabic numeral, and the information of describing with Chinese can not be retrieved (for example input " 999 " is searched for, and then " 999 " can not be retrieved) like this.
Date format also has a lot of different forms, except the Chinese described above date, also has the form (for example " 2007-07-01 " and " 20070701 ") on some use habits, these date formats just have difference in form, but from a kind of meaning of people's understanding angle expression.People are at the date format of the habitual standard of issue Word message constant practice, and use numeric string date format is retrieved when search, so also can exist with above-described problem, can not retrieve mutually.
In order to address this problem, when information is carried out word segmentation processing, raw information is adjusted, these all obform body formal transformations are become a certain body (for example all complex forms of Chinese characters being generated sort file with simplified Chinese character when the participle) of appointment, equally, when retrieving, the information of retrieval is retrieved to change into the body form that exists in the index, at last the inverted file series of this entry correspondence is returned, told the position of user profile by search engine system.
Summary of the invention
The objective of the invention is to deficiency, propose the search engine system that a kind of not homomorphs of ignoring the information performance carry out the content of text search at existing text search engine.This information is being carried out in the process of participle, at different special word form informations, design processor separately, these processing logics are embedded in the participle process, make behind participle, can obtain unified entry (for example " agricultural " and " Farming industry " all can carry out index with " agricultural ") for different obform bodies.Entry after handling can carry out index process by search engine system, after index process is finished, search engine can carry out participle to the key word of the inquiry of user's input, be divided into different entries by different processors equally, search engine system can retrieve the result at entry then, and the result is returned to the user.
The following technical scheme of the concrete employing of the present invention:
A kind of search engine method based on special word form information comprises step that runs on client and the step that runs on server end, wherein:
The described steps in sequence that runs on server end comprises:
The text message obtaining step is used to obtain text message, and text information can be that the user imports, and also can extract in the internet;
Text participle step is used for the text message that described text message obtaining step obtains is carried out word segmentation processing;
Switch process is used for the text message that described text participle step is carried out word segmentation processing is changed;
The index step is used for ranking index is fallen in the output of described switch process, and calculates weight;
Index file storehouse establishment step is used for generating index file according to the output of described index step;
The described steps in sequence that runs on client comprises:
User's input step is used to accept searching keyword and the querying condition that the user imports;
Text participle step is used for the searching keyword that described user's input step obtains is carried out word segmentation processing;
Switch process is used for the text message that described text participle step is carried out word segmentation processing is changed;
Query steps is used for the entry of described switch process output and the querying condition of user's input are made up, and inquires about the index file storehouse that described server end is set up, and the output Query Result;
The result returns step, is used to return the Query Result of described query steps.
Wherein, all correspondingly in the switch process of described server end and client comprise a plurality of or whole with in the down-converter:
The simplified and traditional body switch process of Chinese is used for the conversion of simplified Chinese character and traditional font;
The full half-angle switch process of character is used for the conversion of double byte character and half-angle character;
The Chinese figure switch process is used for the Arabic numeral of representing of digital format conversion that Chinese is represented;
The date format switch process is used to differentiate date format, and date format is converted to the consolidation form of definition.
Further, comprise a simplified and traditional body mapping table in the simplified and traditional body switch process of described Chinese, be stored with simplified character library, traditional font character library and simplified and traditional mapping relations, this step specifically comprises:
11) simplified and traditional body coding determining step is used for judging that whether text message behind the participle needs is the simplified and traditional body conversion of row, if then export step 12), if not, then directly output;
12) simplified and traditional body switch process is used to carry out simplified and traditional body conversion and output.
Further, the full half-angle switch process of described character comprises successively:
21) character full-shape half-angle determining step is used to judge whether the text message behind the participle needs to carry out character full-shape, half-angle conversion, if then export step 22 to), if not, then directly output;
22) character full-shape half-angle switch process is used for full-shape and the half-angle and the output of hand over word.
Further, comprise a digital mapping table in the described Chinese figure switch process, be stored with the mapping relations of Chinese figure character library, arabic numeral and Chinese figure and arabic numeral, specifically comprise:
31) Chinese figure conversion determining step is used to judge whether the text message behind the participle needs to carry out the conversion of character Chinese figure, if then export step 32 to), if not, then directly output;
32) Chinese figure switch process is used to carry out the conversion and the output of Chinese figure and arabic numeral.
Further, described date format switch process comprises successively:
41) date format definition step is used to define date format;
42) date format conversion determining step is used to judge whether the text message behind the participle needs to carry out the conversion of character date format, if then export step 43 to), if not, then directly output;
43) date format switch process is used for the date format of input is converted to the date format and the output of definition.
The present invention can be widely used in containing the retrieving text information of obform body, and can search for by other body of literal, and returns the Search Results of corresponding this literal information.As: when Word message was carried out index and user input query condition, simplified and traditional body converter carried out simplified to Chinese character and the traditional font conversion; Have nothing to do with the literal letter of user's input, numerous body in Query Result and the information.When Word message was carried out index and user input query condition, the full half-angle switch process of character carried out full-shape, half-angle conversion to character; Character full-shape, the half-angle of Query Result and information and user's input are irrelevant.When Word message was carried out index and user input query condition, Chinese figure escape device was changed Chinese figure; The Chinese figure and the arabic numeral of Query Result and information and user's input are irrelevant.When Word message was carried out index and user input query condition, the date format switch process was changed the date format text; The form on the date of importing with the user in Query Result and the information is irrelevant.
Further specify the present invention below in conjunction with drawings and Examples.
Description of drawings
Fig. 1 is the search engine method embodiment synoptic diagram that the present invention is based on special word form information;
Fig. 2 is the Chinese simplified and traditional body switch process synoptic diagram in the embodiment of the invention;
Fig. 3 is the full half-angle switch process of the character in an embodiment of the invention synoptic diagram;
Fig. 4 is the Chinese figure switch process synoptic diagram in the embodiment of the invention;
Fig. 5 is the date format switch process synoptic diagram in the embodiment of the invention.
Embodiment
As shown in Figure 1, a kind of search engine method based on special word form information comprises step that runs on client and the step that runs on server end, wherein:
The described steps in sequence that runs on server end comprises:
The text message obtaining step is used to obtain text message, and text information can be that the user imports, and also can extract in the internet;
Text participle step is used for the text message that described text message obtaining step obtains is carried out word segmentation processing;
Switch process is used for the text message that described text participle step is carried out word segmentation processing is changed;
The index step is used for ranking index is fallen in the output of described switch process, and calculates weight;
Index file storehouse establishment step is used for generating index file according to the output of described index step;
The described steps in sequence that runs on client comprises:
User's input step is used to accept searching keyword and the querying condition that the user imports;
Text participle step is used for the searching keyword that described user's input step obtains is carried out word segmentation processing;
Switch process is used for the text message that described text participle step is carried out word segmentation processing is changed;
Query steps is used for the entry of described switch process output and the querying condition of user's input are made up, and inquires about the index file storehouse that described server end is set up, and the output Query Result;
The result returns step, is used to return the Query Result of described query steps.
Wherein, all correspondingly in the switch process of described server end and client comprise a plurality of or whole with in the down-converter:
The simplified and traditional body switch process of Chinese is used for the conversion of simplified Chinese character and traditional font;
The full half-angle switch process of character is used for the conversion of double byte character and half-angle character;
The Chinese figure switch process is used for the Arabic numeral of representing of digital format conversion that Chinese is represented;
The date format switch process is used to differentiate date format, and date format is converted to the consolidation form of definition.
Wherein, the simplified and traditional body switch process of described Chinese comprising a simplified and traditional body mapping table, is stored with simplified character library, traditional font character library and simplified and traditional mapping relations as shown in Figure 2, and this step specifically comprises:
11) simplified and traditional body coding determining step is used for judging that whether text message behind the participle needs is the simplified and traditional body conversion of row, if then export step 12), if not, then directly output;
12) simplified and traditional body switch process is used to carry out simplified and traditional body conversion and output.
Further, the full half-angle switch process of described character comprises as shown in Figure 3 successively:
21) character full-shape half-angle determining step is used to judge whether the text message behind the participle needs to carry out character full-shape, half-angle conversion, if then export step 22 to), if not, then directly output;
22) character full-shape half-angle switch process is used for full-shape and the half-angle and the output of hand over word.
Wherein, described Chinese figure switch process comprising a digital mapping table, is stored with the mapping relations of Chinese figure character library, arabic numeral and Chinese figure and arabic numeral as shown in Figure 4, specifically comprises:
31) Chinese figure conversion determining step is used to judge whether the text message behind the participle needs to carry out the conversion of character Chinese figure, if then export step 32 to), if not, then directly output;
32) Chinese figure switch process is used to carry out the conversion and the output of Chinese figure and arabic numeral.
Wherein, described date format switch process comprises as shown in Figure 5 successively:
41) date format definition step is used to define date format;
42) date format conversion determining step is used to judge whether the text message behind the participle needs to carry out the conversion of character date format, if then export step 43 to), if not, then directly output;
43) date format switch process is used for the date format of input is converted to the date format and the output of definition.

Claims (5)

1. the search engine method based on special word form information comprises step that runs on client and the step that runs on server end, it is characterized in that:
The described steps in sequence that runs on server end comprises:
The text message obtaining step is used to obtain text message, and text information can be that the user imports, and also can extract in the internet;
Text participle step is used for the text message that described text message obtaining step obtains is carried out word segmentation processing;
Switch process is used for the text message that described text participle step is carried out word segmentation processing is changed;
The index step is used for ranking index is fallen in the output of described switch process, and calculates weight;
Index file storehouse establishment step is used for generating index file according to the output of described index step;
The described steps in sequence that runs on client comprises:
User's input step is used to accept searching keyword and the querying condition that the user imports;
Text participle step is used for the searching keyword that described user's input step obtains is carried out word segmentation processing;
Switch process is used for the text message that described text participle step is carried out word segmentation processing is changed;
Query steps is used for the entry of described switch process output and the querying condition of user's input are made up, and inquires about the index file storehouse that described server end is set up, and the output Query Result;
The result returns step, is used to return the Query Result of described query steps.
Wherein, all correspondingly in the switch process of described server end and client comprise a plurality of or whole with in the down-converter:
The simplified and traditional body switch process of Chinese is used for the conversion of simplified Chinese character and traditional font;
The full half-angle switch process of character is used for the conversion of double byte character and half-angle character;
The Chinese figure switch process is used for the Arabic numeral of representing of digital format conversion that Chinese is represented;
The date format switch process is used to differentiate date format, and date format is converted to the consolidation form of definition.
2. the search engine method based on special word form information according to claim 1, it is characterized in that: comprise a simplified and traditional body mapping table in the simplified and traditional body switch process of described Chinese, be stored with simplified character library, traditional font character library and simplified and traditional mapping relations, this step specifically comprises:
11) simplified and traditional body coding determining step is used for judging that whether text message behind the participle needs is the simplified and traditional body conversion of row, if then export step 12), if not, then directly output;
12) simplified and traditional body switch process is used to carry out simplified and traditional body conversion and output.
3. the search engine method based on special word form information according to claim 2 is characterized in that: the full half-angle switch process of described character comprises successively:
21) character full-shape half-angle determining step is used to judge whether the text message behind the participle needs to carry out character full-shape, half-angle conversion, if then export step 22 to), if not, then directly output;
22) character full-shape half-angle switch process is used for full-shape and the half-angle and the output of hand over word.
4. the search engine method based on special word form information according to claim 3, it is characterized in that: comprise a digital mapping table in the described Chinese figure switch process, be stored with the mapping relations of Chinese figure character library, arabic numeral and Chinese figure and arabic numeral, specifically comprise:
31) Chinese figure conversion determining step is used to judge whether the text message behind the participle needs to carry out the conversion of character Chinese figure, if then export step 32 to), if not, then directly output;
32) Chinese figure switch process is used to carry out the conversion and the output of Chinese figure and arabic numeral.
5. the search engine method based on special word form information according to claim 4 is characterized in that: described date format switch process comprises successively:
41) date format definition step is used to define date format;
42) date format conversion determining step is used to judge whether the text message behind the participle needs to carry out the conversion of character date format, if then export step 43 to), if not, then directly output;
43) date format switch process is used for the date format of input is converted to the date format and the output of definition.
CN200910046475A 2009-02-23 2009-02-23 Search engine method based on special word form information Pending CN101814073A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910046475A CN101814073A (en) 2009-02-23 2009-02-23 Search engine method based on special word form information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910046475A CN101814073A (en) 2009-02-23 2009-02-23 Search engine method based on special word form information

Publications (1)

Publication Number Publication Date
CN101814073A true CN101814073A (en) 2010-08-25

Family

ID=42621330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910046475A Pending CN101814073A (en) 2009-02-23 2009-02-23 Search engine method based on special word form information

Country Status (1)

Country Link
CN (1) CN101814073A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855089A (en) * 2011-06-30 2013-01-02 沈阳晨讯希姆通科技有限公司 Mobile phone soft keyboard and date inputting method thereof
CN103885941A (en) * 2012-12-24 2014-06-25 鸿富锦精密工业(深圳)有限公司 Patent application document conversion system and method
CN104679871A (en) * 2015-03-06 2015-06-03 北京语言大学 Chinese text searching method and Chinese text searching device
CN104978314A (en) * 2014-04-01 2015-10-14 深圳市腾讯计算机系统有限公司 Media content recommendation method and device
CN105404615A (en) * 2015-11-05 2016-03-16 腾讯科技(深圳)有限公司 Word retrieval method and apparatus
CN105989057A (en) * 2015-02-06 2016-10-05 北京中搜网络技术股份有限公司 Conversion method of numeral type search string based on string operation
CN106503130A (en) * 2016-10-20 2017-03-15 深圳铂睿智恒科技有限公司 The application searches method of application market, system and application market
CN112199576A (en) * 2020-10-20 2021-01-08 山东浪潮商用系统有限公司 Method and system for realizing Chinese pinyin search

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855089A (en) * 2011-06-30 2013-01-02 沈阳晨讯希姆通科技有限公司 Mobile phone soft keyboard and date inputting method thereof
CN103885941A (en) * 2012-12-24 2014-06-25 鸿富锦精密工业(深圳)有限公司 Patent application document conversion system and method
CN104978314A (en) * 2014-04-01 2015-10-14 深圳市腾讯计算机系统有限公司 Media content recommendation method and device
US10248715B2 (en) 2014-04-01 2019-04-02 Tencent Technology (Shenzhen) Company Limited Media content recommendation method and apparatus
CN104978314B (en) * 2014-04-01 2019-05-14 深圳市腾讯计算机系统有限公司 Media content recommendations method and device
CN105989057A (en) * 2015-02-06 2016-10-05 北京中搜网络技术股份有限公司 Conversion method of numeral type search string based on string operation
CN104679871A (en) * 2015-03-06 2015-06-03 北京语言大学 Chinese text searching method and Chinese text searching device
CN104679871B (en) * 2015-03-06 2018-03-30 北京语言大学 A kind of Chinese language text search method and Chinese language text retrieval device
CN105404615A (en) * 2015-11-05 2016-03-16 腾讯科技(深圳)有限公司 Word retrieval method and apparatus
CN105404615B (en) * 2015-11-05 2020-02-11 腾讯科技(深圳)有限公司 Word retrieval method and device
CN106503130A (en) * 2016-10-20 2017-03-15 深圳铂睿智恒科技有限公司 The application searches method of application market, system and application market
CN112199576A (en) * 2020-10-20 2021-01-08 山东浪潮商用系统有限公司 Method and system for realizing Chinese pinyin search

Similar Documents

Publication Publication Date Title
CN101814073A (en) Search engine method based on special word form information
CN101647020B (en) Searching structured geographical data
US9069857B2 (en) Per-document index for semantic searching
CN101467125B (en) Processing of query terms
JP5389186B2 (en) System and method for matching entities
JP5138046B2 (en) Search system, search method and program
CN102663016A (en) System and method for implementing input information extension on input candidate box on electronic device
CN101452453A (en) Input method web site navigation method and input method system
US10078672B2 (en) Search device, search method, and computer program product
CN102855252B (en) A kind of need-based data retrieval method and device
CN111428494A (en) Intelligent error correction method, device and equipment for proper nouns and storage medium
CN101794307A (en) Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea
CN102200974A (en) Unified information retrieval intelligent agent system and method for search engine
CN101706790A (en) Clustering method of WEB objects in search engine
CN101751434A (en) Meta search engine ranking method and Meta search engine
CN103885985A (en) Real-time microblog search method and device
CN102314461A (en) Navigation prompt method and system
CN114064851A (en) Multi-machine retrieval method and system for government office documents
CN102024026B (en) Method and system for processing query terms
CN201421609Y (en) Search engine system based on abnormal character form information
CN101963991A (en) Accurate searching method of picture
CN1786956B (en) Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine
CN102508920B (en) Information retrieval method based on Boosting sorting algorithm
CN102567121B (en) Realize the method and apparatus of converged communication
CN103886093A (en) Method for processing synonyms of electronic commerce search engine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: TRANSMISSION LINE NETWORK TECHNOLOGY (SHANGHAI) CO

Free format text: FORMER OWNER: WEIXU NETWORK TECHNOLOGY (SHANGHAI) CO., LTD.

Effective date: 20140409

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 200003 HUANGPU, SHANGHAI TO: 200241 MINHANG, SHANGHAI

TA01 Transfer of patent application right

Effective date of registration: 20140409

Address after: 200241 Shanghai City, Dongchuan Road, No. 555, floor floor, room f, F, F, F, F, No. 02, Minhang District

Applicant after: WEIXU NETWORK TECHNOLOGY (SHANGHAI) CO., LTD.

Address before: 200003 gate 1305, 6 South Suzhou Road, Shanghai

Applicant before: Weixu Network Technology (Shanghai) Co., Ltd.

C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20100825