Embodiment
For the demand that ranking results is close to the users more, the embodiment of the invention provides a kind of search result ordering method and device based on search engine, below respectively the brief overview.
A kind of search result ordering method based on search engine that the embodiment of the invention provides sets in advance having carried out some, user's inputted search speech, and determined to treat after the sorting network resource, referring to shown in Figure 1, carry out following key step:
S1, the search word of user input is carried out word segmentation processing (this step also can be treated to carry out before the sorting network resource determining).
S2, in keyword index, search respectively, to determine that described search word is in the keyword weight of respectively treating in the sorting network resource (include but not limited to web page resources and downloaded resources, below repeat no more) with the participle of word segmentation processing gained.
S3, determine the total weight of described search word in respectively treating the sorting network resource.
S4, described Internet resources respectively to be sorted are sorted, and present to the user according to total weight.
Before user's inputted search speech is searched for, in advance step is set, specifically comprise:
The step of customization keyword dictionary: as basic structure, comprise the attribute of each effective speech and each effective speech correspondence with the attribute of speech and speech in the keyword dictionary of customization, and the attribute of each invalid speech and each invalid speech correspondence.The set of the set of described invalid speech and effective speech is mutex relation each other, and the character that effective speech comprises of character covering that invalid speech comprises.The attribute of institute's predicate is with the character type numeral, and a kind of attribute of institute's predicate represented respectively in each character.
Extract the step of keyword:, the subject information of each Internet resources is carried out word segmentation processing by maximum match principle according to the keyword dictionary; Attribute according to word segmentation processing gained participle filters this participle, with the keyword of the subject information that extracts each Internet resources.Wherein,, perhaps from the content of webpage, extract the subject information of this webpage with the title of webpage subject information as this webpage, perhaps with the information of describing downloaded resources as subject information etc.
Set up the step of keyword index: each keyword to the subject information of each Internet resources adopts basic dictionary for word segmentation to carry out word segmentation processing respectively, and sets up the keyword index of each participle of keyword to Internet resources.
Set up the step of resource index: according to basic dictionary for word segmentation the subject information of Internet resources is carried out word segmentation processing, and set up the resource index of each participle of Internet resources to Internet resources.
The step of configure weights: each the participle speech length according to keyword accounts for the long ratio of this keyword speech, for each participle disposes the participle weight respectively; Perhaps (include but not limited to: viewed number of times and/or the situation and/or be downloaded number of times and/or file layout of being cited according to the information of Internet resources, below repeat no more), be this Internet resources configuring static weight, and account for the long ratio of this keyword speech according to each participle speech length of keyword, for each participle disposes the participle weight respectively.The weight of configuration can be recorded in above-mentioned resource index and the keyword index.After the configure weights in S2, can in keyword index, search respectively search word being carried out word segmentation processing gained participle, to determine the participle weight of each participle in the keyword of the subject information of respectively treating the sorting network resource, and, treat keyword weight in the sorting network resource at this as search word with the participle weight addition of each participle in the same subject information for the treatment of the sorting network resource.In S3, desirable search word is in the current total weight for the treatment of in the sorting network resource of keyword weight conduct; Also desirable according to the current information configuration for the treatment of the sorting network resource static weight and search word in the current keyword weight for the treatment of in the sorting network resource, and with this static weight and the synthetic current total weight for the treatment of the sorting network resource of keyword set of weights; Perhaps with other associated weight and the synthetic current total weight for the treatment of the sorting network resource of keyword set of weights.
After user's inputted search speech is searched for, determine to treat that the sorting network resource specifically searches respectively search word is carried out word segmentation processing gained participle in resource index, to determine the set of the Internet resources under each participle respectively; Get each described intersection of sets collection, as Internet resources to be sorted.
The embodiment of the invention also provides a kind of Search Results collator based on search engine, and referring to shown in Figure 2, it comprises: participle unit, keyword weight determining unit, total weight determining unit, sequencing unit and display unit.
The participle unit is used for the search word of user's input is carried out word segmentation processing.
Keyword weight determining unit is used for searching in keyword index respectively with word segmentation processing gained participle, to determine the keyword weight of described search word in respectively treating the sorting network resource.
Total weight determining unit is used for definite described search word and is respectively treating total weight of sorting network resource.
Sequencing unit is used for according to total weight described Internet resources respectively to be sorted being sorted.
Display unit is used for presenting ranking results to the user.
Further in order to provide said units required information, referring to shown in Figure 3, described device also comprises: customization units, extraction unit, keyword index are set up the unit, resource index is set up unit, determining unit and dispensing unit.
Customization units is used for attribute with speech and speech as basic structure, customization keyword dictionary; The attribute that comprises each effective speech and each effective speech correspondence in the keyword dictionary of customization, and the attribute of each invalid speech and each invalid speech correspondence.
Extraction unit is used for according to the keyword dictionary, by maximum match principle the subject information of each Internet resources is carried out word segmentation processing; Attribute according to word segmentation processing gained participle filters this participle, with the keyword of the subject information that extracts each Internet resources.
Keyword index is set up the unit, be used for respectively each keyword of the subject information of each Internet resources being carried out word segmentation processing according to basic dictionary for word segmentation, and set up the keyword index of each participle of keyword to Internet resources, call in order to keyword weight determining unit.
Resource index is set up the unit, is used for according to basic dictionary for word segmentation the subject information of Internet resources being carried out word segmentation processing, and sets up the resource index of each participle of Internet resources to Internet resources.
Determining unit is searched in resource index respectively search word is carried out word segmentation processing gained participle, to determine the set of the Internet resources under each participle respectively; Get each described intersection of sets collection, as Internet resources to be sorted.
Dispensing unit is used for accounting for the long ratio of this keyword speech according to each participle speech length of keyword, for each participle disposes the participle weight respectively; Perhaps, be this Internet resources configuring static weight, and account for the long ratio of this keyword speech, for each participle disposes the participle weight respectively according to each participle speech length of keyword according to the information of Internet resources.After the dispensing unit configure weights, keyword weight determining unit can be searched in keyword index respectively search word being carried out word segmentation processing gained participle, to determine the participle weight of each participle in the keyword of the subject information of respectively treating the sorting network resource, and, treat keyword weight in the sorting network resource at this as search word with the participle weight addition of each participle in the same subject information for the treatment of the sorting network resource.Total desirable search word of weight determining unit is in the current total weight for the treatment of in the sorting network resource of keyword weight conduct; Also desirable according to the current information configuration for the treatment of the sorting network resource static weight and search word in the current keyword weight for the treatment of in the sorting network resource, and with this static weight and the synthetic current total weight for the treatment of the sorting network resource of keyword set of weights; Perhaps with other associated weight and the synthetic current total weight for the treatment of the sorting network resource of keyword set of weights.
So far, the method for the embodiment of the invention and the general introduction of device are finished.Below by 1 embodiment the present invention is described in further detail.
Embodiment 1, present embodiment comprise the step that step is set, determines to treat the sorting network resource, step, the ordered steps of calculating weight, and rendering step.Step wherein is set to be comprised: the customization substep of keyword dictionary, the extraction substep of keyword, set up keyword index substep, set up the substep of resource index, and weight configuration substep.
101, the customization of keyword dictionary.
Keyword promptly can identify the vocabulary of the subject information of Internet resources (web page resources or downloaded resources).For example, in search engine, the user is through regular meeting's Input Software title+" download ", movie name+phrases such as " high-resolutions ", and dbase here and movie name just can be defined as the keyword of these phrases.
In order effectively to extract the keyword of the subject information of Internet resources, at first need to set up a keyword dictionary.According to user's daily search habit statistical, in video display search engine, music searching engine and universal search engine, the user usually can import vocabulary such as video display name, song title, singer's name as search word.Therefore, can set up the keyword dictionary according to information such as at present popular film, TV play, song, singer, performers.The basic structure of this dictionary is: (speech, attribute).Wherein, attribute description the validity and the classification of speech, as whether effective, whether be movie name, title of the song, software name etc.
Present embodiment (but being not limited to this mode) is in the following ways described attribute: attribute information is described in the character type numeral step-by-step with a byte, and totally 8, each represents a kind of attribute of this speech, and 1 for having this attribute, and 0 for not having this attribute.As " hero " not only can be movie name but also can be the TV play name, and its attribute just can be expressed as 11100000, and every attribute information ginseng is shown in Table 1:
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
Validity |
Video display |
TV play |
Title of the song |
The singer |
The director |
The performer |
The software name |
Table 1
Wherein the attribute definition of most significant digit (i.e. the 7th shown in the table 1) is as follows: this has write down effective attribute of speech in the keyword dictionary, and invalid set of words and effective set of words be mutex relation each other.Speech A in the invalid set of words can comprise certain the speech B in effective set of words on literal, be effective speech as this speech of certain movie name " east ", and " east ", " east gate " etc. are invalid speech.The preferential of invalid speech determines that principle is: comprise certain effective speech on literal, but do not belong to effective set of words, and be not the vocabulary that certain movie name, title of the song etc. can be used as keyword.
102, the extraction of keyword.
At each Internet resources in the search engine database, need extract corresponding keyword for its subject information.
At first adopt the keyword dictionary, by maximum match principle the subject information of Internet resources is carried out participle, the result filters according to its attribute with the participle gained.Removing attribute is invalid vocabulary, and reserved property is effective vocabulary, and with the vocabulary that the keeps keyword as the subject information of these Internet resources.
For example, have in the keyword dictionary with next group speech:
East 1,100 0000
East 0,000 0000
High Road to China 1,010 0000
Northeast 0,000 0000
……
To extraction result be as next web pages title:
The titbit in film east------east
High Road to China high-resolution version-----High Road to China
The path in northeast-----
For vertical search engine, as to the video display search engine, the last of keyword determined and can also further be filtered according to other attributes of the keyword that extracts.As the keyword to web page title " dragon and tiger door Zhen Zi is red to be acted the leading role " extraction is " dragon and tiger door " and " Zhen Zidan ", but " Zhen Zidan " is not a video display vocabulary, but a name, just should filter " Zhen Zidan " this speech this moment.This filter type can be determined according to the concrete search category of search engine.
103, set up keyword index.
Adopt basic dictionary for word segmentation (but being not limited to), each keyword to the subject information of each Internet resources carries out word segmentation processing respectively, and sets up the keyword index of each participle of keyword to Internet resources.
For example just like the subject information of next group Internet resources:
Doc1: ineffable secret complete or collected works' Chinese subtitle;
Doc2: ineffable secret complete or collected works;
Doc3: iron triangle DVD Chinese subtitle;
Doc4: iron triangle complete or collected works;
Doc5: iron triangle (acting the leading role unconventional and unrestrained China);
Doc6: secret complete or collected works;
Their keyword is respectively:
Doc1: ineffable secret;
Doc2: ineffable secret;
Doc3: iron triangle;
Doc4: iron triangle;
Doc5: iron triangle;
Doc6: secret.
Each keyword is carried out word segmentation processing, obtains following participle: can not, say,, secret, iron triangle.
Keyword index to set up situation as follows:
" can not " related Doc1 and Doc2; " say " related Doc1 and Doc2; " " related Doc1 and Doc2; " secret " related Doc1, Doc2 and Doc6; " iron triangle " related Doc3, Doc4 and Doc5.
104, set up resource index (and set up between the keyword index in no particular order).
According to basic dictionary for word segmentation (but being not limited to) subject information of Internet resources is carried out word segmentation processing, and set up the resource index of each participle of Internet resources to Internet resources.
For example just like the subject information of next group Internet resources:
Doc1: ineffable secret complete or collected works' Chinese subtitle;
Doc2: ineffable secret complete or collected works;
Doc3: iron triangle DVD Chinese subtitle;
Doc4: iron triangle complete or collected works;
Doc5: iron triangle (acting the leading role unconventional and unrestrained China);
Doc6: secret complete or collected works;
After the word segmentation processing resource index to set up situation as follows:
" can not " related Doc1, Doc2; " say " related Doc1, Doc2; " " related Doc1, Doc2; " secret " related Doc1, Doc2, Doc6; " complete or collected works " related Doc1, Doc2, Doc4, Doc6; " Chinese " related Doc1, Doc3; " captions " related Doc1, Doc3; " iron triangle " related Doc3, Doc4, Doc5; " DVD " related Doc3; " protagonist " related Doc5; " unconventional and unrestrained China " related Doc5.
105, weight configuration.
Weight configuration comprises: dispose this two parts to the static weight configuration of Internet resources and to the weight of each participle in the keyword.
Wherein, the static weight of web page resources is quoted information such as situation and is determined by number of visits, web page source, the webpage of webpage; The static weight of downloaded resources is determined by information such as the download time of resource, file size, file layouts.For example:, can determine that the static weight of this downloaded resources is W1 according to the download time of docid1, the information such as size of docid1 for certain downloaded resources docid1.
Wherein, weight configuration to each participle in the keyword comprises the following steps: at first according to basic dictionary for word segmentation (but being not limited to) keyword to be carried out participle, be divided into four speech as keyword " ineffable secret ", promptly word segmentation result is: can not, say,, secret.Next supposes that the weight of each keyword is weight=1, then word1 " can not " pairing weight is W11, word2 " says " that pairing weight is W21, word3 " " pairing weight is W31, the pairing weight of word4 " secret " is W41, and W11=W41=1/3, W21=W31=1/4, promptly each participle weight is determined in the ratio that participle speech length accounts for keyword speech length.
The weight of each participle can join in above-mentioned resource index and the keyword index in the static weight of configuration and the keyword.Referring to shown in Figure 4, the static weight information of all-network resource all is recorded in together in specific implementation, and is index with the docid of Internet resources correspondence.Word1, Word2...Wordn have write down the participle weight of this speech in the keyword of the subject information that is equipped with Internet resources respectively, and are index with the docid of the subject information correspondence of keyword belonging network resource.
106, determine to treat the sorting network resource.
Referring to shown in Figure 5, when the user imports certain speech word and searches for as search word, at first adopt basic dictionary for word segmentation to carry out word segmentation processing to search word word, obtain segmentation sequence word1, word2 ..., wordn.In resource index shown in Figure 4, find out participle wordk, k=1,2 then, ..., the common factor of the pairing docid sequence of n is as docid2, docid4, docid5 etc., and with the common factor of the Internet resources of the common factor correspondence of docid sequence as treating the sorting network resource.
107, calculate weight.
Calculating the total weight respectively treat the sorting network resource, below is example with docid2.
Referring to shown in Figure 6, in keyword index (referring to shown in Figure 4), search word1 respectively, word2 ..., the participle weight of wordn in the pairing subject information for the treatment of the sorting network resource of docid2, take out participle weights W 12, W22 ..., Wn2 adds up, obtain the keyword weight of search word in the pairing subject information for the treatment of the sorting network resource of docid2, i.e. Wk (docid)=∑ Wmn.If do not contain docid2 among the pairing docid of certain wordk, then its corresponding weights is Wk2=0, and promptly this speech is not the keyword participle of subject information of the Internet resources of docid2 correspondence.
And in resource index shown in Figure 4, get the static weight Ws (docid) of the Internet resources of docid2 correspondence.
Calculate total weights W (docid) of the Internet resources of docid2 correspondence at last.Can determine Ws (docid) and Wk (docid) the shared ratio of difference in W (docid) as the case may be, as: Ws (docid) accounts for q1, and Wk (docid) accounts for q2, then W (docid)=q1*Ws (docid)+q2*Wk (docid).
108, ordering.
After calculating total weight of respectively treating the sorting network resource, the described sorting network resource of respectively treating is sorted according to total weight order from high to low.
After adopting such scheme, can obtain more satisfactory Search Results to the Search Results ordering.For example, when user search " secret trailer ", if in the Search Results web page title 1-" secret trailer " is arranged, web page title 2-" ineffable secret trailer ", then the weight of " secret trailer " will be greater than the weight of " ineffable secret trailer ".This is that the keyword of " ineffable secret trailer " is " an ineffable secret " because the keyword of " secret trailer " is " secret ", and " trailer " is invalid keyword.After to the keyword participle, " ineffable secret " will be divided into " can not, say,, secret " four speech.In keyword index, " secret " weight in the keyword of web page title 1 is weight, and the weight in the keyword of web page title 2 is weight/3.
109, present ranking results to the user.
The highest Internet resources of the total weight of reality are come the foremost, thus the demand that ranking results is close to the users more.
As can be seen, q1 and q2 are adjustable from embodiment 1.Under special circumstances, owing to the reason of extracting keyword, sometimes work as the user and import a word, and when this word is a movie name, for example " east ", this search word may have many results and be keyword " east ", at this moment can cause too simplification of Search Results, the result shows that whole page or leaf all is the films in relevant " east ", may certain gap be arranged with the actual result who wants of user like this.Can reduce the q2 and the q1 that raises, with at these special circumstances.
In sum, the search word to user's input carries out word segmentation processing in the embodiment of the invention; Participle with the word segmentation processing gained is searched in keyword index respectively, with the definite keyword weight of described search word in respectively treating the sorting network resource, and determines the total weight of described search word in respectively treating the sorting network resource.Owing to considered the situations such as coupling of search word and keyword in total weight, so described Internet resources respectively to be sorted are sorted and present to user, the demand that can be close to the users more according to total weight.
Further, provide the step that step is set, determines to treat the sorting network resource, step, the ordered steps of calculating weight in the embodiment of the invention, and the specific implementation of rendering step.Step wherein is set to be comprised: the customization substep of keyword dictionary, the extraction substep of keyword, set up keyword index substep, set up the substep of resource index, and weight configuration substep.Better supported the present invention.
Further, q1 and q2 scalable in the embodiment of the invention 1 so can adjust as the case may be, satisfy user's various demands.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.