Search engine method based on special word form information
Technical field
What the present invention relates to is a kind of text message search engine system, specifically is a kind of search engine method based on special word form information.
Background technology
Along with Internet development, search engine becomes one of people's retrieving information necessary tool.In the internet, if information spinner presents with the form of literal, and because the diversity of literal body, make the Word message of same meaning, the different forms of expression is arranged, this mainly is owing to the not homomorphs of people to the different Word messages that form such as the description custom of information, input tool, region, the abbreviation obform body.Special word form mainly contains character code difference, language difference, form difference.Search engine is to text information processing the time, usually original information being carried out participle (Word Segmentation) handles, information after the processing directly generates the file of falling the ranking index (Reverse Order Index File), its principle is, set up mapping relations between the text path at entry (Term) the corresponding informance place that produces by participle or the URL (Uniform Resource Location), when the user carries out information retrieval, by the entry that comprises in the phrase of input, find corresponding resource and return.If containing the entry of the obform body of this entry in the information of user's input just can not be retrieved out.
At present, search engine handle the obform body entry be with the obform body of this entry as entry independently, perhaps the obform body with this entry carries out repeat search as extra entry.In daily life, the form of the obform body of Word message is a lot, and these mainly are because region or user's use habit and input tool are relevant.Involved obform body has the full-shape of simplified and traditional font, character of Chinese character and half-angle, Chinese figure and arabic numeral, the form on date in the search engine method based on special word form information.
The difference of the letter of Chinese character, traditional font information is mainly reflected on the difference of region.Go back the input that some input tool possesses this simplified and traditional body in addition, also have user's personal interest to use the body that mixes.In the internet, Chinese character information exists with two kinds of bodies of simplified and traditional body, will have such problem so, when retrieving in conjunction with Chinese in the simplified and traditional font of input, may can not get result's (for example search " agricultural ") that we want.
Double byte character and half-angle character are the character set (for example the character code of " a " and " a " is different) that belongs to different in the computer character code set.In the internet, it also is ubiquitous that this coding mixes the phenomenon of using, and mainly embodies a kind of individual character of user.Because the difference of character set can be used as different characters to the character of full-shape and half-angle and carry out index when index, and during retrieval, search engine only can be retrieved corresponding entries, thereby has meaning character of the same race and can not retrieve.
Though Chinese figure and arabic numeral have purposes separately in information, in the description of the information description of some cardinal sum ordinal numbers and date etc., meaning is identical (for example " on July one, 1 " and " on July 1st, 1997 ").People are when using these numerals to carry out information description, and according to different occasions, the obform body of numeral uses and all has (for example " 999 roses " and " 999 roses ").And we at retrieving information are, in order to reduce input quantity, can directly import arabic numeral, and the information of describing with Chinese can not be retrieved (for example input " 999 " is searched for, and then " 999 " can not be retrieved) like this.
Date format also has a lot of different forms, except the Chinese described above date, also has the form (for example " 2007-07-01 " and " 20070701 ") on some use habits, these date formats just have difference in form, but from a kind of meaning of people's understanding angle expression.People are at the date format of the habitual standard of issue Word message constant practice, and use numeric string date format is retrieved when search, so also can exist with above-described problem, can not retrieve mutually.
In order to address this problem, when information is carried out word segmentation processing, raw information is adjusted, these all obform body formal transformations are become a certain body (for example all complex forms of Chinese characters being generated sort file with simplified Chinese character when the participle) of appointment, equally, when retrieving, the information of retrieval is retrieved to change into the body form that exists in the index, at last the inverted file series of this entry correspondence is returned, told the position of user profile by search engine system.
Summary of the invention
The objective of the invention is to deficiency, propose the search engine system that a kind of not homomorphs of ignoring the information performance carry out the content of text search at existing text search engine.This information is being carried out in the process of participle, at different special word form informations, design processor separately, these processing logics are embedded in the participle process, make behind participle, can obtain unified entry (for example " agricultural " and " Farming industry " all can carry out index with " agricultural ") for different obform bodies.Entry after handling can carry out index process by search engine system, after index process is finished, search engine can carry out participle to the key word of the inquiry of user's input, be divided into different entries by different processors equally, search engine system can retrieve the result at entry then, and the result is returned to the user.
The following technical scheme of the concrete employing of the present invention:
A kind of search engine method based on special word form information comprises step that runs on client and the step that runs on server end, wherein:
The described steps in sequence that runs on server end comprises:
The text message obtaining step is used to obtain text message, and text information can be that the user imports, and also can extract in the internet;
Text participle step is used for the text message that described text message obtaining step obtains is carried out word segmentation processing;
Switch process is used for the text message that described text participle step is carried out word segmentation processing is changed;
The index step is used for ranking index is fallen in the output of described switch process, and calculates weight;
Index file storehouse establishment step is used for generating index file according to the output of described index step;
The described steps in sequence that runs on client comprises:
User's input step is used to accept searching keyword and the querying condition that the user imports;
Text participle step is used for the searching keyword that described user's input step obtains is carried out word segmentation processing;
Switch process is used for the text message that described text participle step is carried out word segmentation processing is changed;
Query steps is used for the entry of described switch process output and the querying condition of user's input are made up, and inquires about the index file storehouse that described server end is set up, and the output Query Result;
The result returns step, is used to return the Query Result of described query steps.
Wherein, all correspondingly in the switch process of described server end and client comprise a plurality of or whole with in the down-converter:
The simplified and traditional body switch process of Chinese is used for the conversion of simplified Chinese character and traditional font;
The full half-angle switch process of character is used for the conversion of double byte character and half-angle character;
The Chinese figure switch process is used for the Arabic numeral of representing of digital format conversion that Chinese is represented;
The date format switch process is used to differentiate date format, and date format is converted to the consolidation form of definition.
Further, comprise a simplified and traditional body mapping table in the simplified and traditional body switch process of described Chinese, be stored with simplified character library, traditional font character library and simplified and traditional mapping relations, this step specifically comprises:
11) simplified and traditional body coding determining step is used for judging that whether text message behind the participle needs is the simplified and traditional body conversion of row, if then export step 12), if not, then directly output;
12) simplified and traditional body switch process is used to carry out simplified and traditional body conversion and output.
Further, the full half-angle switch process of described character comprises successively:
21) character full-shape half-angle determining step is used to judge whether the text message behind the participle needs to carry out character full-shape, half-angle conversion, if then export step 22 to), if not, then directly output;
22) character full-shape half-angle switch process is used for full-shape and the half-angle and the output of hand over word.
Further, comprise a digital mapping table in the described Chinese figure switch process, be stored with the mapping relations of Chinese figure character library, arabic numeral and Chinese figure and arabic numeral, specifically comprise:
31) Chinese figure conversion determining step is used to judge whether the text message behind the participle needs to carry out the conversion of character Chinese figure, if then export step 32 to), if not, then directly output;
32) Chinese figure switch process is used to carry out the conversion and the output of Chinese figure and arabic numeral.
Further, described date format switch process comprises successively:
41) date format definition step is used to define date format;
42) date format conversion determining step is used to judge whether the text message behind the participle needs to carry out the conversion of character date format, if then export step 43 to), if not, then directly output;
43) date format switch process is used for the date format of input is converted to the date format and the output of definition.
The present invention can be widely used in containing the retrieving text information of obform body, and can search for by other body of literal, and returns the Search Results of corresponding this literal information.As: when Word message was carried out index and user input query condition, simplified and traditional body converter carried out simplified to Chinese character and the traditional font conversion; Have nothing to do with the literal letter of user's input, numerous body in Query Result and the information.When Word message was carried out index and user input query condition, the full half-angle switch process of character carried out full-shape, half-angle conversion to character; Character full-shape, the half-angle of Query Result and information and user's input are irrelevant.When Word message was carried out index and user input query condition, Chinese figure escape device was changed Chinese figure; The Chinese figure and the arabic numeral of Query Result and information and user's input are irrelevant.When Word message was carried out index and user input query condition, the date format switch process was changed the date format text; The form on the date of importing with the user in Query Result and the information is irrelevant.
Further specify the present invention below in conjunction with drawings and Examples.
Description of drawings
Fig. 1 is the search engine method embodiment synoptic diagram that the present invention is based on special word form information;
Fig. 2 is the Chinese simplified and traditional body switch process synoptic diagram in the embodiment of the invention;
Fig. 3 is the full half-angle switch process of the character in an embodiment of the invention synoptic diagram;
Fig. 4 is the Chinese figure switch process synoptic diagram in the embodiment of the invention;
Fig. 5 is the date format switch process synoptic diagram in the embodiment of the invention.
Embodiment
As shown in Figure 1, a kind of search engine method based on special word form information comprises step that runs on client and the step that runs on server end, wherein:
The described steps in sequence that runs on server end comprises:
The text message obtaining step is used to obtain text message, and text information can be that the user imports, and also can extract in the internet;
Text participle step is used for the text message that described text message obtaining step obtains is carried out word segmentation processing;
Switch process is used for the text message that described text participle step is carried out word segmentation processing is changed;
The index step is used for ranking index is fallen in the output of described switch process, and calculates weight;
Index file storehouse establishment step is used for generating index file according to the output of described index step;
The described steps in sequence that runs on client comprises:
User's input step is used to accept searching keyword and the querying condition that the user imports;
Text participle step is used for the searching keyword that described user's input step obtains is carried out word segmentation processing;
Switch process is used for the text message that described text participle step is carried out word segmentation processing is changed;
Query steps is used for the entry of described switch process output and the querying condition of user's input are made up, and inquires about the index file storehouse that described server end is set up, and the output Query Result;
The result returns step, is used to return the Query Result of described query steps.
Wherein, all correspondingly in the switch process of described server end and client comprise a plurality of or whole with in the down-converter:
The simplified and traditional body switch process of Chinese is used for the conversion of simplified Chinese character and traditional font;
The full half-angle switch process of character is used for the conversion of double byte character and half-angle character;
The Chinese figure switch process is used for the Arabic numeral of representing of digital format conversion that Chinese is represented;
The date format switch process is used to differentiate date format, and date format is converted to the consolidation form of definition.
Wherein, the simplified and traditional body switch process of described Chinese comprising a simplified and traditional body mapping table, is stored with simplified character library, traditional font character library and simplified and traditional mapping relations as shown in Figure 2, and this step specifically comprises:
11) simplified and traditional body coding determining step is used for judging that whether text message behind the participle needs is the simplified and traditional body conversion of row, if then export step 12), if not, then directly output;
12) simplified and traditional body switch process is used to carry out simplified and traditional body conversion and output.
Further, the full half-angle switch process of described character comprises as shown in Figure 3 successively:
21) character full-shape half-angle determining step is used to judge whether the text message behind the participle needs to carry out character full-shape, half-angle conversion, if then export step 22 to), if not, then directly output;
22) character full-shape half-angle switch process is used for full-shape and the half-angle and the output of hand over word.
Wherein, described Chinese figure switch process comprising a digital mapping table, is stored with the mapping relations of Chinese figure character library, arabic numeral and Chinese figure and arabic numeral as shown in Figure 4, specifically comprises:
31) Chinese figure conversion determining step is used to judge whether the text message behind the participle needs to carry out the conversion of character Chinese figure, if then export step 32 to), if not, then directly output;
32) Chinese figure switch process is used to carry out the conversion and the output of Chinese figure and arabic numeral.
Wherein, described date format switch process comprises as shown in Figure 5 successively:
41) date format definition step is used to define date format;
42) date format conversion determining step is used to judge whether the text message behind the participle needs to carry out the conversion of character date format, if then export step 43 to), if not, then directly output;
43) date format switch process is used for the date format of input is converted to the date format and the output of definition.