CN102073684B - Method and device for excavating search log and page search method and device - Google Patents

Method and device for excavating search log and page search method and device Download PDF

Info

Publication number
CN102073684B
CN102073684B CN201010600713.3A CN201010600713A CN102073684B CN 102073684 B CN102073684 B CN 102073684B CN 201010600713 A CN201010600713 A CN 201010600713A CN 102073684 B CN102073684 B CN 102073684B
Authority
CN
China
Prior art keywords
query
word
ageing
threshold value
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201010600713.3A
Other languages
Chinese (zh)
Other versions
CN102073684A (en
Inventor
辜斯缪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201010600713.3A priority Critical patent/CN102073684B/en
Publication of CN102073684A publication Critical patent/CN102073684A/en
Application granted granted Critical
Publication of CN102073684B publication Critical patent/CN102073684B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a search log excavating method, a timeliness requirement identification method and a corresponding device. By the method for excavating the search log, the timeliness probabilities of types corresponding to queries can be counted and can reflect the timeliness requirements of the queries, so that whether a query input by a user has the timeliness requirement or not is identified and a search result corresponding to the query input by the user is optimized when the query has the timeliness requirement in the page search method, namely the sorting weight of a time attribute in the search result is improved; therefore, the user can quickly and accurately find the required page from the search result, and the timeliness requirement of the user on the search result is met.

Description

Method for digging, the ageing demand of search daily record are known method for distinguishing and corresponding intrument
[technical field]
The invention belongs to Internet technical field, be specifically related to a kind of method for digging of daily record, ageing demand knowledge method for distinguishing and corresponding intrument searched for.
[background technology]
Along with the development of Internet technology and the continuous expansion of information, people are more and more higher for the user demand of the network information, and search engine becomes the important tool that people obtain the network information.After user's inputted search word (query), search engine conventionally can be included in the page that comprises this search word and in Search Results, return to user.
Yet, in existing search technique, cannot identify the ageing demand of query that user inputs, for example user wants to obtain the relevant information of firm generation event soon, but search engine can't be understood this ageing demand of user, the Search Results returning is only based on search history in the past, and according to predefined each attribute weights, Search Results sorted, and user possibly cannot find the page of demand rapidly and accurately from Search Results.For example, user wants to obtain the network information of firm generation Hebei explosive incident soon, the query of input " Hebei blast ", because this event just occurs soon, Internet resources are also less, in Search Results, the page that Hebei explosive incident occurs in the recent period may be submerged in magnanimity and Hebei and explode in the page of relevant historical events, and user cannot find the page of demand rapidly and accurately from Search Results.
[summary of the invention]
The invention provides a kind of method for digging of daily record, ageing demand searched for and know method for distinguishing and corresponding intrument, so that the ageing demand of user query is identified, for meeting user, to the ageing demand of Search Results, provide basis.
Concrete technical scheme is as follows:
A method for digging of searching for daily record, comprising: the search word query grabbing from search daily record is performed step respectively to A1 and step C1:
A1, the described query grabbing is carried out to word segmentation processing, execution step B1;
B1, according to the attribute of each word obtaining after word segmentation processing, each word is marked, according to annotation results by the combination of word in same query, or the combination of the attribute of word, or the combination of the attribute of word and word is as the type of summarizing, go to step D1, wherein, described in the type the summarized distribution probability in described search daily record surpass default type distribution probability threshold value;
C1, from the described query grabbing, filter out user in corresponding Search Results and click the query that the page ratio of issuing time within nearest the 3rd time period in the page surpasses default the 3rd proportion threshold value, form ageing query set, other query form non-ageing query set, or, filter out the page of issuing time within nearest the 4th time period in corresponding Search Results and account for the ratio of Search Results over the query of the 4th default proportion threshold value, form ageing query set, other query form non-ageing query set, or, the clicking rate that filters out corresponding Search Results surpasses the query of default clicking rate burst threshold, form ageing query set, other query form non-ageing query set, execution step D1,
The all types of ageing query set that filter out at step C1 that D1, statistic procedure B1 obtain and the distribution in non-ageing query set, utilize statistics to calculate the ageing probability of all types of correspondences, and the corresponding relation between all types of and ageing probability is stored in ageing probability tables.
In step B1, the Attribute Recognition process of each word is specially: the distribution probability in different attribute according to word in advance, set up part of speech statistical form; Utilize each word obtaining after word segmentation processing to search described part of speech statistical form, determine the highest attribute of the corresponding distribution probability of described each word.
Particularly, from search, capture the crawl strategy that query adopts daily record and comprise a kind of or combination in any in following strategy:
Capture strategy 1: capture user in corresponding Search Results and click the ratio that the page in very first time section recently of issuing time in the page accounts for all pages that this user clicks and surpass the query that presets the first proportion threshold value;
Capture strategy 2: capture the page of issuing time within nearest the second time period in corresponding Search Results and account for the ratio of Search Results over the query of the second default proportion threshold value;
Capture strategy 3: capture all querys of issuing time within a period of time recently that exist user to click the page in corresponding Search Results.
If capture strategy, adopt described crawl strategy 1, duration and described the 3rd proportion threshold value that the duration of described the 3rd time period equals described very first time section are greater than described the first proportion threshold value, or, duration and described the 3rd proportion threshold value that the duration of described the 3rd time period is less than described very first time section equal described the first proportion threshold value, or duration and described the 3rd proportion threshold value that the duration of described the 3rd time period is less than described very first time section are greater than described the first proportion threshold value;
If capture strategy, adopt described crawl strategy 2, the duration of described the 4th time period equals duration and described the 4th proportion threshold value of described the second time period and is greater than described the second proportion threshold value, or duration and described the 4th proportion threshold value that the duration of described the 4th time period is less than described the second time period equal described the second proportion threshold value, or the duration of described the 4th time period is less than duration and described the 4th proportion threshold value of described the second time period and is greater than described the second proportion threshold value.
Ageing demand is known a method for distinguishing, and the method comprises:
A2, the search word query that user is inputted carry out word segmentation processing;
B2, according to the attribute of each word obtaining after word segmentation processing, each word is marked, according to annotation results by the combination of word in same query, or the combination of the attribute of word, or the combination of the attribute of word and word is as the type of summarizing, wherein, the distribution probability of the type of summarizing described in described search daily record surpasses default type distribution probability threshold value;
C2, search the ageing probability tables that utilizes the method for digging of above-mentioned search daily record to form, the ageing probability corresponding to type of summarizing in determining step B2;
If the mxm. of the ageing probability that D2 step C2 determines surpasses default ageing probability threshold value, determine that described query possesses ageing demand.
In step B2, the Attribute Recognition process of each word is specially: the distribution probability in different attribute according to word in advance, set up part of speech statistical form; Utilize each word obtaining after word segmentation processing to search described part of speech statistical form, determine the highest attribute of the corresponding distribution probability of described each word.
Further, after described step D2, also comprise:
E2, improve the weight order of time attribute in the Search Results that described query is corresponding.
Described step e 2 is specially: the weight order by time attribute in Search Results corresponding to described query is brought up to setting weights; Or,
Weight order by time attribute in Search Results corresponding to described query improves sets step-length.
Search for an excavating gear for daily record, this excavating gear comprises: placement unit, first participle unit, first kind determining unit, screening unit and probability calculation unit;
Described placement unit, for capturing search word query from search daily record;
Described first participle unit, carries out word segmentation processing for the query that described placement unit is grabbed;
Described first kind determining unit comprises: for the attribute of each word according to obtaining after described word segmentation processing, the first mark subelement that each word is marked, and for according to described first mark subelement annotation results, by the combination of word in same query, or the combination of the attribute of word, or the combination of the attribute of word and word is concluded subelement as first of the type of summarizing, wherein, the distribution probability of the type of summarizing described in described search daily record surpasses default type distribution probability threshold value;
Described screening unit, for the query grabbing from described placement unit, filter out user in corresponding Search Results and click the query that the page ratio of issuing time within nearest the 3rd time period in the page surpasses default the 3rd proportion threshold value, form ageing query set, other query form non-ageing query set, or, filter out the page of issuing time within nearest the 4th time period in corresponding Search Results and account for the ratio of Search Results over the query of the 4th default proportion threshold value, form ageing query set, other query form non-ageing query set, or, the clicking rate that filters out corresponding Search Results surpasses the query of default clicking rate burst threshold, form ageing query set, other query form non-ageing query set,
Described probability calculation unit, the distribution of gathering for adding up ageing query set that type that described first kind determining unit summarizes selects at described screening sieve unit and non-ageing query, utilize statistics to calculate the ageing probability of all types of correspondences, and the corresponding relation between all types of and ageing probability is stored in ageing probability tables.
Further, described first kind determining unit also comprises: the first Attribute Recognition subelement, for utilizing each word obtaining after described word segmentation processing to search part of speech statistical form, determine the highest attribute of the corresponding distribution probability of each word, wherein, described part of speech statistical form be in advance according to word the distribution probability in different attribute set up.
Particularly, the crawl strategy that described placement unit adopts comprises a kind of or combination in any in following strategy:
Capture strategy 1: capture user in corresponding Search Results and click the ratio that the page in very first time section recently of issuing time in the page accounts for all pages that this user clicks and surpass the query that presets the first proportion threshold value;
Capture strategy 2: capture the page of issuing time within nearest the second time period in corresponding Search Results and account for the ratio of Search Results over the query of the second default proportion threshold value;
Capture strategy 3: capture all querys of issuing time within a period of time recently that exist user to click the page in corresponding Search Results.
If described placement unit adopts described crawl strategy 1, duration and described the 3rd proportion threshold value that the duration of described the 3rd time period equals described very first time section are greater than described the first proportion threshold value, or, duration and described the 3rd proportion threshold value that the duration of described the 3rd time period is less than described very first time section equal described the first proportion threshold value, or duration and described the 3rd proportion threshold value that the duration of described the 3rd time period is less than described very first time section are greater than described the first proportion threshold value;
If described placement unit adopts described crawl strategy 2, the duration of described the 4th time period equals duration and described the 4th proportion threshold value of described the second time period and is greater than described the second proportion threshold value, or duration and described the 4th proportion threshold value that the duration of described the 4th time period is less than described the second time period equal described the second proportion threshold value, or the duration of described the 4th time period is less than duration and described the 4th proportion threshold value of described the second time period and is greater than described the second proportion threshold value.
A device for ageing demand identification, the device of this ageing demand identification comprises: the second participle unit, Second Type determining unit, lookup unit and ageing determining unit;
Described the second participle unit, for carrying out word segmentation processing to the search word query of user's input;
Described Second Type determining unit comprises: for the attribute of each word according to obtaining after described word segmentation processing, the second mark subelement that each word is marked, and for according to described second mark subelement annotation results, by the combination of word in same query, or the combination of the attribute of word, or the combination of the attribute of word and word is concluded subelement as second of the type of summarizing, wherein, the distribution probability of the type of summarizing described in described search daily record surpasses default type distribution probability threshold value;
Described lookup unit, the ageing probability tables forming for searching above-mentioned excavating gear, determines ageing probability corresponding to type that described Second Type determining unit is summarized;
Described ageing determining unit, while surpassing default ageing probability threshold value for the mxm. of the ageing probability determined in described lookup unit, determines that described query possesses ageing demand.
Further, described Second Type determining unit also comprises: the second Attribute Recognition subelement, for utilizing each word obtaining after described word segmentation processing to search part of speech statistical form, determine the highest attribute of the corresponding distribution probability of each word, wherein, described part of speech statistical form be in advance according to word the distribution probability in different attribute set up.
More preferably, the device of this ageing demand identification can further include:
Unit is optimized in search, for when described ageing determining unit determines that described query possesses ageing demand, improves the weight order of time attribute in the Search Results that described query is corresponding.
Described search optimize unit specifically by time attribute the weight order in Search Results corresponding to described query bring up to setting weights; Or,
Weight order by time attribute in Search Results corresponding to described query improves sets step-length.
As can be seen from the above technical solutions, the method and apparatus of the method for digging of search daily record provided by the invention and device and the identification of ageing demand, can count all types of ageing probability that query is corresponding, the ageing demand that can reflect query by this ageing probability, so that when the query that identifies user's input possesses ageing demand, for the Search Results corresponding to query of user's input is optimized to process, provide basis, meet the ageing demand of user to Search Results.Improve the sequencing weight of time attribute in Search Results, make user can from Search Results, find rapidly and accurately the page of demand.
[accompanying drawing explanation]
The method for digging process flow diagram of the search daily record that Fig. 1 provides for the embodiment of the present invention one;
The method flow diagram that Fig. 2 searches for for the page that the embodiment of the present invention two provides;
The excavating gear structural drawing of the search daily record that Fig. 3 provides for the embodiment of the present invention three;
The structure drawing of device that Fig. 4 searches for for the page that the embodiment of the present invention four provides.
[embodiment]
In order to make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments, describe the present invention.
First the method for digging of search daily record is described, by the excavation that search daily record is carried out, forms the ageing probability tables of query type, to facilitate, query is carried out to ageing identification, below by a pair of the method for embodiment, be described.
Embodiment mono-,
Fig. 1 is the method for digging process flow diagram to search daily record provided by the invention, and as shown in Figure 1, the method can comprise the following steps:
Step 101: the query grabbing from search daily record is carried out to word segmentation processing.
While capturing query from search daily record, capture strategy and can adopt one of following strategy or combination in any:
Capture strategy 1: capture user in corresponding Search Results and click the page in very first time section recently of issuing time in the page and account for all page ratios that this user clicks and surpass the query that presets the first proportion threshold value.For example, suppose that nearest very first time section is in nearly 2 days, the first default proportion threshold value is 50%, if the ratio that in the Search Results of certain query, the page of the issuing time of the page that user clicks within nearly 2 days accounts for total page that this user clicks is 70%, can capture this query.Again for example, if the issuing time of the page that user clicks is within nearly 2 days in the Search Results of certain query, that is to say, ratio is 100%, can capture this query.
Capture strategy 2: capture the page of issuing time within nearest the second time period in corresponding Search Results and account for the ratio of Search Results over the query of the second default proportion threshold value.For example, suppose that the second time period was within nearly 2 days, the second proportion threshold value is 60%, if the page of issuing time within nearly 2 days accounts for 65% of Search Results in Search Results corresponding to certain query, captures this query.
Capture strategy 3: capture all querys of issuing time within a period of time recently that exist user to click the page in corresponding Search Results.Under this strategy, corresponding certain query, for example, as long as the page that user clicks in its Search Results comprises issuing time page of (in nearly 2 days) within a period of time recently, captures this query.
In this step, each query grabbing is carried out after word segmentation processing, each query just obtains at least one word (term), for example, query for " Hebei blast ", carries out after word segmentation processing, can obtain two words: " Hebei " and " blast ".For " Hebei XX business went phut ", carry out after word segmentation processing, can obtain four words " Hebei ", " XX ", " company ", " bankruptcy ".
The participle processing method adopting can include but not limited to: the segmenting method of string matching, meaning of a word segmenting method, the statistic method, etc.Because participle processing method is prior art, at this, be not described in detail.
Step 102: utilize each word and/or the attribute formation combination of each word and the distribution probability of each combination that obtain after word segmentation processing, summarize type (pattern).
This step can specifically be divided into two sub-steps:
1), according to the attribute of each word obtaining after word segmentation processing, each word is marked.
In this step, first according to the attribute of each word, each word is carried out to basis mark, be labeled as: noun, verb, adjective etc.Further, can adopt less granularity that each word is carried out to senior mark, such as can further specifically marking each word, be: name, place name, time, mechanism's name etc.
Wherein, for the Attribute Recognition of each word, be that distribution probability statistics based on is in advance carried out, the distribution probability in different attribute according to word in advance, sets up part of speech statistical form.Query is being carried out after word segmentation processing, utilizing each word obtaining after word segmentation processing to search part of speech statistical form, determining the highest attribute of the corresponding distribution probability of this word.Conventionally, for the Attribute Recognition of word, be that context based on each word carries out, for example, for " Hebei ", " XX ", " company " these three nouns, during with " Hebei " beginning, " company " ending, the probability of a noun of common formation is the highest, therefore, " Hebei XX company " can be labeled as to a noun, more small grain size can be labeled as Yi Ge mechanism name.The Attribute Recognition of word belongs to existing basic algorithm, at this, also no longer specifically describes.
2) according to the mark of each word in query, by the combination of word in same query, or the combination of the attribute of word, or the combination of the attribute of word and word is as the type of summarizing, and the distribution probability of the type of wherein summarizing in search daily record surpasses default type distribution probability threshold value.
For example, when " place name+[blast] " (this combination belongs to the combination of attribute and the word of word), this is combined in the distribution probability occurring daily record from search and surpasses default type distribution probability threshold value, " place name+[blast] " can be set as to a type; When " [Hebei]+[blast] " (this combination belongs to the combination of word), this is combined in the distribution probability that search occurs in daily record and surpasses default type distribution probability threshold value, " [Hebei]+[blast] " can be set as to a type; When " place name+verb ", this is combined in the distribution probability that search occurs in daily record and surpasses default type distribution probability threshold value, can be set as to a type " place name+verb " (this combination belongs to the combination of the attribute of word).Wherein [] is for identification of words.
More accurately, in the type of summarizing, can also comprise the positional information of word in combination or the positional information of attribute institute corresponding word.For example, " place name+[blast] (ending) " as a type, wherein " (ending) " is the positional information of " [blast] " this word.
All types of can being stored in type list of determining.
Step 103: the query that search daily record is grabbed screens, obtains ageing query set and non-ageing query set.
The screening strategy of taking in this step can include but not limited to a kind of or combination in any in following strategy:
Screening strategy 1: filter out user in corresponding Search Results and click the page of issuing time within nearest the 3rd time period in the page and account for the query that all page ratios that this user clicks surpass default the 3rd proportion threshold value, form ageing query set, other query form non-ageing query set.Wherein, if capture strategy, adopt crawl strategy 1, duration and the 3rd proportion threshold value that the duration of the 3rd time period equals very first time section are greater than the first proportion threshold value, or, duration and the 3rd proportion threshold value that the duration of the 3rd time period is less than very first time section equal the first proportion threshold value, or duration and the 3rd proportion threshold value that the duration of the 3rd time period is less than very first time section are greater than the first proportion threshold value.
Give one example, suppose when capturing query, what capture is the issuing time of the page that user clicks in the corresponding Search Results page within the nearly 2 days ratio that accounts for total page that this user clicks over 50% query, while carrying out in this step query screening, the ratio that the page of the issuing time that can filter out the page that user clicks in corresponding Search Results within nearly 2 days accounts for total page that this user clicks surpasses 80% query, form ageing query set, other query form non-ageing query set.
Screening strategy 2: filter out the page of issuing time within nearest the 4th time period in corresponding Search Results and account for the ratio of Search Results over the query of the 4th default proportion threshold value, form ageing query set, other query form non-ageing query set.Wherein, if capture strategy, adopt crawl strategy 2, the duration of the 4th time period equals duration and the 4th proportion threshold value of the second time period and is greater than the second proportion threshold value, or duration and the 4th proportion threshold value that the duration of the 4th time period was less than for the second time period equal the second proportion threshold value, or the duration of the 4th time period is less than duration and the 4th proportion threshold value of the second time period and is greater than the second proportion threshold value.
Give one example, suppose when capturing query, what capture is that the ratio that in corresponding Search Results, the page of issuing time within nearly 2 days accounts for Search Results surpasses 60% query, while carrying out in this step query screening, can filter out the ratio that the page of issuing time within nearly 2 days in corresponding Search Results account for Search Results and surpass 80% query, form ageing query set, other query form non-ageing query set.
Screening strategy 3: the clicking rate that filters out corresponding Search Results surpasses the query of default clicking rate burst threshold, forms ageing query set, and other query form non-ageing query set.For example, for certain query, if the clicking rate of its Search Results surpasses default clicking rate burst threshold, illustrate that the event that this query is corresponding may be accident, possess certain ageingly, should include in ageing query set.
It should be noted that, step 101 and step 103 do not have fixing sequencing, belong to two different execution branches, and the query grabbing from search daily record delivers to respectively step 101 and step 103 and processes, two steps can successively be carried out with random order, also can carry out simultaneously.
Step 104: all types of ageing query set that filter out in step 103 that statistic procedure 102 obtains and the distribution in non-ageing query set, utilize statistics calculate the ageing probability of all types of correspondences and be stored as ageing probability tables.
The number of times occurring in all types of ageing query set that filter out in step 103 in measurement type table and non-ageing query set respectively, utilizes the number of times occurring to carry out variance calculating, thereby obtains the ageing probability of all types of correspondences.
Suppose, by after this step, to determine that ageing probability corresponding to " place name+[blast] " this type is 30%, ageing probability corresponding to " place name+verb " this type is 5%, and ageing probability corresponding to " [Hebei]+[blast] " this type is 50%.
The ageing probability of all types of correspondences can be stored as to ageing probability tables, as shown in table 1, to the query of user's input is carried out to ageing identification, inquiry is used.
Table 1
Type Ageing probability
Place name+[blast] 30%
Place name+verb 5%
[Hebei]+[blast] 50%
Ageing probability tables based on setting up by said process, can realize ageing identification to the query of user input, below by the process of bis-pairs of searching pages of embodiment, is described.
Embodiment bis-,
Fig. 2 is the method flow diagram of page search provided by the invention, and as shown in Figure 2, the method can comprise the following steps:
Step 201: the query to user's input carries out word segmentation processing.
Step 202: the combination that utilizes the attribute of each word of obtaining after word segmentation processing and/or each word to form, and distribution probability of each combination, conclude the type that this query is corresponding.
Step 201 is identical to grabbing the processing mode of query in to the processing mode of user being inputted to query in step 202 and step 101 to step 102, and at this, it is no longer repeated.
Step 203: search ageing probability tables, the ageing probability corresponding to type of summarizing in determining step 202.
Step 204: if the mxm. in the ageing probability of determining surpasses default ageing probability threshold value, determine that this query possesses ageing demand.
Can preset in embodiments of the present invention an ageing probability threshold value, if in the corresponding type of query of user's input, ageing probability corresponding to any type surpasses this ageing probability threshold value, the query that this user's input is described has ageing demand, need to provide the ageing higher page for this user as far as possible, the page of issuing in the recent period, realizes by step 205.
If the mxm. in the ageing probability of determining does not surpass default ageing probability threshold value, determine that this query does not possess ageing demand, without Search Results is done to special processing, process ends.
Step 205: improve this query the weight order of time attribute in corresponding Search Results.
In this step, time attribute can be brought up to some setting weights by the weight order in result for retrieval, or improve the step-length of some settings, thereby in Search Results, embody aging characteristic as far as possible, by the newer page priority ordering in Search Results of issuing time.
Give one example: the query of user's input is " Hebei blast ", after the word segmentation processing of step 201, obtain " Hebei " and " blast ", in step 202, " Hebei " can be labeled as to noun, and further be labeled as place name, " blast " is labeled as to verb.Suppose that the type of summarizing after step 202 is:
Class1: place name+[blast];
Type 2: place name+verb;
Type 3:[Hebei]+[blast].
Determine that above-mentioned three types is the corresponding type of query of " Hebei blast ", that is to say, a query of user's input may summarize more than one type in step 202.
Search the ageing probability tables of setting up by flow process shown in Fig. 1, determine that the ageing probable value of Class1, type 2 and type 3 correspondences is respectively 30%, 5% and 50%.Suppose that default ageing probability threshold value is 40%, the ageing probability mxm. 50% of determining surpasses default ageing probability threshold value, illustrates that the query of this " Hebei blast " possesses ageing demand, may be recent event.Therefore, when Search Results corresponding to the query that returns to " Hebei blast ", improve the weight order of time attribute in Search Results, by the newer page priority ordering in Search Results of issuing time, user just can get the page about Hebei explosive incident of recent issue rapidly and accurately so as far as possible.
It is more than the detailed description that method provided by the present invention is carried out.Below by embodiment tri-and tetra-pairs of devices provided by the present invention of embodiment, be described in detail.
Embodiment tri-,
The excavating gear structural drawing of the search daily record that Fig. 3 provides for the embodiment of the present invention, as shown in Figure 3, this excavating gear can comprise: placement unit 300, first participle unit 310, first kind determining unit 320, screening unit 330 and probability calculation unit 340.
Placement unit 300, for capturing query from search daily record.
First participle unit 310, carries out word segmentation processing for the query that placement unit 300 is grabbed.
The participle processing method that this first participle unit 310 adopts can include but not limited to: the segmenting method of string matching, meaning of a word segmenting method, the statistic method.
First kind determining unit 320, for utilizing combination that the attribute of each word of obtaining after first participle unit 310 word segmentation processing and/or each word forms and the distribution probability of each combination, summarizes type.
Screening unit 330, screens for the query that placement unit 300 is grabbed, and obtains ageing query set and non-ageing query set.
Probability calculation unit 340, the distribution of gathering for adding up ageing query set that type that first kind determining unit 320 summarizes filters out in screening unit 330 and non-ageing query, utilize statistics to calculate the ageing probability of all types of correspondences, and the corresponding relation between all types of and ageing probability is stored in ageing probability tables.
Wherein, first kind determining unit 320 specifically comprises: the first mark subelement 321 and first is concluded subelement 322.
The first mark subelement 321, for the attribute of each word according to obtaining after word segmentation processing, marks each word.
When marking, can each word be carried out to basis mark first according to the attribute of each word, be labeled as: noun, verb, adjective etc.Further, adopt less granularity that each word is carried out to senior mark, such as marking each word, be: name, place name, time, mechanism's name etc.
First concludes subelement 322, be used for according to the annotation results of the first mark subelement 321, by the combination of word in same query, or the combination of the attribute of word, or the combination of the attribute of word and word is as the type of summarizing, wherein, the distribution probability of the type of summarizing in search daily record surpasses default type distribution probability threshold value.
In the type of summarizing, except the combination of the attribute of the combination of above-mentioned word, the combination of the attribute of word or word and word, can also further comprise the positional information of the positional information of word or the attribute of word.
Further, first kind determining unit 320 can also comprise: the first Attribute Recognition subelement 323, for utilizing each word obtaining after word segmentation processing to search part of speech statistical form, determine the highest attribute of the corresponding distribution probability of each word, wherein, part of speech statistical form be in advance according to word the distribution probability in different attribute set up.
Particularly, the crawl strategy that above-mentioned placement unit 300 adopts can comprise a kind of or combination in any in following strategy:
Capture strategy 1: capture user in corresponding Search Results and click the ratio that the page in very first time section recently of issuing time in the page accounts for all pages that this user clicks and surpass the query that presets the first proportion threshold value.
Capture strategy 2: capture the page of issuing time within nearest the second time period in corresponding Search Results and account for the ratio of Search Results over the query of the second default proportion threshold value.
Capture strategy 3: capture all querys of issuing time within a period of time recently that exist user to click the page in corresponding Search Results.
The screening strategy that above-mentioned screening unit 330 adopts can comprise a kind of or combination in any in following strategy:
Screening strategy 1: filter out user in corresponding Search Results and click the page of issuing time within nearest the 3rd time period in the page and account for the query that all page ratios that this user clicks surpass default the 3rd proportion threshold value, form ageing query set, other query form non-ageing query set; Wherein, if adopting, placement unit 300 captures strategy 1, duration and the 3rd proportion threshold value that the duration of the 3rd time period equals very first time section are greater than the first proportion threshold value, or, duration and the 3rd proportion threshold value that the duration of the 3rd time period is less than very first time section equal the first proportion threshold value, or duration and the 3rd proportion threshold value that the duration of the 3rd time period is less than very first time section are greater than the first proportion threshold value.
Screening strategy 2: filter out the page of issuing time within nearest the 4th time period in corresponding Search Results and account for the ratio of Search Results over the query of the 4th default proportion threshold value, form ageing query set, other query form non-ageing query set; Wherein, if adopting, placement unit 300 captures strategy 2, the duration of the 4th time period equals duration and the 4th proportion threshold value of the second time period and is greater than the second proportion threshold value, or duration and the 4th proportion threshold value that the duration of the 4th time period was less than for the second time period equal the second proportion threshold value, or the duration of the 4th time period is less than duration and the 4th proportion threshold value of the second time period and is greater than the second proportion threshold value.
Screening strategy 3: the clicking rate that filters out corresponding Search Results surpasses the query of default clicking rate burst threshold, forms ageing query set, and other query form non-ageing query set.
Embodiment tetra-,
The structure drawing of device that Fig. 4 searches for for the page that the embodiment of the present invention four provides, as shown in Figure 4, the device of this page search can comprise: the second participle unit 400, Second Type determining unit 410, lookup unit 420 and ageing determining unit 430.
The second participle unit 400, for carrying out word segmentation processing to the query of user's input.
Identical with the first participle unit 310 in embodiment tri-, the participle processing method that the second participle unit 400 adopts can include but not limited to: the segmenting method of string matching, meaning of a word segmenting method, the statistic method.
Second Type determining unit 410, for utilizing combination that the attribute of each word of obtaining after the second participle unit 400 word segmentation processing and/or each word forms and the distribution probability of each combination, summarizes the type that query is corresponding.
Lookup unit 420, the ageing probability tables forming for the excavating gear of searching described in embodiment tri-, determines ageing probability corresponding to type that Second Type determining unit 410 is summarized.
Ageing determining unit 430, while surpassing default ageing probability threshold value for the mxm. of the ageing probability determined in lookup unit 420, determines that query possesses ageing demand; Otherwise determine that query does not possess ageing demand.
Wherein, Second Type determining unit 410 can specifically comprise: the second mark subelement 411 and second is concluded subelement 412.
The second mark subelement 411, the attribute for each word according to obtaining after the second participle unit 400 word segmentation processing, marks each word.
Second concludes subelement 412, be used for according to the annotation results of the second mark subelement 411, by the combination of word in same query, or the combination of the attribute of word, or the combination of the attribute of word and word is as the type of summarizing, wherein, the distribution probability of the type of summarizing in search daily record surpasses default type distribution probability threshold value.
Based on this structure, Second Type determining unit can further include: the second Attribute Recognition subelement 413, for utilizing each word obtaining after the second participle unit 400 word segmentation processing to search part of speech statistical form, determine the highest attribute of the corresponding distribution probability of each word, wherein, part of speech statistical form be in advance according to word the distribution probability in different attribute set up.
The processing mode that the second mark subelement 411, second is concluded subelement 412 and the second Attribute Recognition subelement 413 is identical with the processing mode of the first mark subelement 321, the first conclusion subelement 322 and the first Attribute Recognition subelement 323 in embodiment tri-respectively, does not repeat them here.
Further, the device of this page search can also be optimized adjustment to Search Results when definite query possesses ageing demand, and now the device of this page search can also comprise:
Unit 440 is optimized in search, for when ageing determining unit 430 determines that query possesses ageing demand, improves the weight order of time attribute in the Search Results that query is corresponding.
If ageing determining unit 430 determines that query does not possess ageing demand, without Search Results is optimized to adjustment.
At this, unit 440 is optimized in search can send to search engine by the instruction that improves the weight order of time attribute in the Search Results that query is corresponding, search engine returns to Search Results according to this instruction, thereby as far as possible by the newer page priority ordering in Search Results of issuing time, make user can get rapidly and accurately the page of the dependent event of recent issue.
Improving particularly mode can be: the weight order by time attribute in Search Results corresponding to query is brought up to setting weights; Or the weight order by time attribute in Search Results corresponding to query improves sets step-length.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims (12)

1. a method for digging of searching for daily record, is characterized in that, the search word query grabbing from search daily record is performed step respectively to A1 and step C1:
A1, the described query grabbing is carried out to word segmentation processing, execution step B1;
B1, according to the attribute of each word obtaining after word segmentation processing, each word is marked, according to annotation results by the combination of word in same query, or the combination of the attribute of word, or the combination of the attribute of word and word is as the type of summarizing, go to step D1, wherein, described in the type the summarized distribution probability in described search daily record surpass default type distribution probability threshold value; The identifying of the attribute of described each word is specially: the distribution probability in different attribute according to word in advance, set up part of speech statistical form; Utilize each word obtaining after word segmentation processing to search described part of speech statistical form, determine the highest attribute of the corresponding distribution probability of described each word;
C1, from the described query grabbing, filter out user in corresponding Search Results and click the query that the page ratio of issuing time within nearest the 3rd time period in the page surpasses default the 3rd proportion threshold value, form ageing query set, other query form non-ageing query set, or, filter out the page of issuing time within nearest the 4th time period in corresponding Search Results and account for the ratio of Search Results over the query of the 4th default proportion threshold value, form ageing query set, other query form non-ageing query set, or, the clicking rate that filters out corresponding Search Results surpasses the query of default clicking rate burst threshold, form ageing query set, other query form non-ageing query set, execution step D1,
The all types of ageing query set that filter out at step C1 that D1, statistic procedure B1 obtain and the distribution in non-ageing query set, utilize statistics to calculate the ageing probability of all types of correspondences, and the corresponding relation between all types of and ageing probability is stored in ageing probability tables.
2. method according to claim 1, is characterized in that, captures the crawl strategy that query adopts daily record comprise a kind of or combination in any in following strategy from search:
Capture strategy 1: capture user in corresponding Search Results and click the ratio that the page in very first time section recently of issuing time in the page accounts for all pages that this user clicks and surpass the query that presets the first proportion threshold value;
Capture strategy 2: capture the page of issuing time within nearest the second time period in corresponding Search Results and account for the ratio of Search Results over the query of the second default proportion threshold value;
Capture strategy 3: capture all querys of issuing time within a period of time recently that exist user to click the page in corresponding Search Results.
3. method according to claim 2, it is characterized in that, if capture strategy, adopt described crawl strategy 1, duration and described the 3rd proportion threshold value that the duration of described the 3rd time period equals described very first time section are greater than described the first proportion threshold value, or, duration and described the 3rd proportion threshold value that the duration of described the 3rd time period is less than described very first time section equal described the first proportion threshold value, or duration and described the 3rd proportion threshold value that the duration of described the 3rd time period is less than described very first time section are greater than described the first proportion threshold value;
If capture strategy, adopt described crawl strategy 2, the duration of described the 4th time period equals duration and described the 4th proportion threshold value of described the second time period and is greater than described the second proportion threshold value, or duration and described the 4th proportion threshold value that the duration of described the 4th time period is less than described the second time period equal described the second proportion threshold value, or the duration of described the 4th time period is less than duration and described the 4th proportion threshold value of described the second time period and is greater than described the second proportion threshold value.
4. ageing demand is known a method for distinguishing, it is characterized in that, the method comprises:
A2, the search word query that user is inputted carry out word segmentation processing;
B2, according to the attribute of each word obtaining after word segmentation processing, each word is marked, according to annotation results by the combination of word in same query, or the combination of the attribute of word, or the combination of the attribute of word and word is as the type of summarizing, wherein, the distribution probability of the type of summarizing described in described search daily record surpasses default type distribution probability threshold value; The identifying of the attribute of described each word is specially: the distribution probability in different attribute according to word in advance, set up part of speech statistical form; Utilize each word obtaining after word segmentation processing to search described part of speech statistical form, determine the highest attribute of the corresponding distribution probability of described each word;
C2, search and utilize the ageing probability tables that method forms described in claim 1, the ageing probability corresponding to type of summarizing in determining step B2;
If the mxm. of the ageing probability that D2 step C2 determines surpasses default ageing probability threshold value, determine that described query possesses ageing demand.
5. method according to claim 4, is characterized in that, after described step D2, also comprises:
E2, improve the weight order of time attribute in the Search Results that described query is corresponding.
6. method according to claim 5, is characterized in that, described step e 2 is specially: the weight order by time attribute in Search Results corresponding to described query is brought up to setting weights; Or,
Weight order by time attribute in Search Results corresponding to described query improves sets step-length.
7. an excavating gear of searching for daily record, is characterized in that, this excavating gear comprises: placement unit, first participle unit, first kind determining unit, screening unit and probability calculation unit;
Described placement unit, for capturing search word query from search daily record;
Described first participle unit, carries out word segmentation processing for the query that described placement unit is grabbed;
Described first kind determining unit comprises: for the attribute of each word according to obtaining after described word segmentation processing, the first mark subelement that each word is marked, and for according to described first mark subelement annotation results, by the combination of word in same query, or the combination of the attribute of word, or the combination of the attribute of word and word is concluded subelement as first of the type of summarizing, wherein, the distribution probability of the type of summarizing described in described search daily record surpasses default type distribution probability threshold value;
Described first kind determining unit also comprises: the first Attribute Recognition subelement, for utilizing each word obtaining after described word segmentation processing to search part of speech statistical form, determine the highest attribute of the corresponding distribution probability of each word, wherein, described part of speech statistical form be in advance according to word the distribution probability in different attribute set up;
Described screening unit, for the query grabbing from described placement unit, filter out user in corresponding Search Results and click the query that the page ratio of issuing time within nearest the 3rd time period in the page surpasses default the 3rd proportion threshold value, form ageing query set, other query form non-ageing query set, or, filter out the page of issuing time within nearest the 4th time period in corresponding Search Results and account for the ratio of Search Results over the query of the 4th default proportion threshold value, form ageing query set, other query form non-ageing query set, or, the clicking rate that filters out corresponding Search Results surpasses the query of default clicking rate burst threshold, form ageing query set, other query form non-ageing query set,
Described probability calculation unit, the distribution of gathering for adding up ageing query set that type that described first kind determining unit summarizes selects at described screening sieve unit and non-ageing query, utilize statistics to calculate the ageing probability of all types of correspondences, and the corresponding relation between all types of and ageing probability is stored in ageing probability tables.
8. excavating gear according to claim 7, is characterized in that, the crawl strategy that described placement unit adopts comprises a kind of or combination in any in following strategy:
Capture strategy 1: capture user in corresponding Search Results and click the ratio that the page in very first time section recently of issuing time in the page accounts for all pages that this user clicks and surpass the query that presets the first proportion threshold value;
Capture strategy 2: capture the page of issuing time within nearest the second time period in corresponding Search Results and account for the ratio of Search Results over the query of the second default proportion threshold value;
Capture strategy 3: capture all querys of issuing time within a period of time recently that exist user to click the page in corresponding Search Results.
9. excavating gear according to claim 8, is characterized in that,
If described placement unit adopts described crawl strategy 1, duration and described the 3rd proportion threshold value that the duration of described the 3rd time period equals described very first time section are greater than described the first proportion threshold value, or, duration and described the 3rd proportion threshold value that the duration of described the 3rd time period is less than described very first time section equal described the first proportion threshold value, or duration and described the 3rd proportion threshold value that the duration of described the 3rd time period is less than described very first time section are greater than described the first proportion threshold value;
If described placement unit adopts described crawl strategy 2, the duration of described the 4th time period equals duration and described the 4th proportion threshold value of described the second time period and is greater than described the second proportion threshold value, or duration and described the 4th proportion threshold value that the duration of described the 4th time period is less than described the second time period equal described the second proportion threshold value, or the duration of described the 4th time period is less than duration and described the 4th proportion threshold value of described the second time period and is greater than described the second proportion threshold value.
10. a device for ageing demand identification, is characterized in that, the device of this ageing demand identification comprises: the second participle unit, Second Type determining unit, lookup unit and ageing determining unit;
Described the second participle unit, for carrying out word segmentation processing to the search word query of user's input;
Described Second Type determining unit comprises: for the attribute of each word according to obtaining after described word segmentation processing, the second mark subelement that each word is marked, and for according to described second mark subelement annotation results, by the combination of word in same query, or the combination of the attribute of word, or the combination of the attribute of word and word is concluded subelement as second of the type of summarizing, wherein, the distribution probability of the type of summarizing described in described search daily record surpasses default type distribution probability threshold value;
Described Second Type determining unit also comprises: the second Attribute Recognition subelement, for utilizing each word obtaining after described word segmentation processing to search part of speech statistical form, determine the highest attribute of the corresponding distribution probability of each word, wherein, described part of speech statistical form be in advance according to word the distribution probability in different attribute set up;
Described lookup unit, for searching the ageing probability tables that excavating gear forms described in claim 7, determines ageing probability corresponding to type that described Second Type determining unit is summarized;
Described ageing determining unit, while surpassing default ageing probability threshold value for the mxm. of the ageing probability determined in described lookup unit, determines that described query possesses ageing demand.
The device of 11. ageing demand identifications according to claim 10, is characterized in that, the device of this ageing demand identification also comprises:
Unit is optimized in search, for when described ageing determining unit determines that described query possesses ageing demand, improves the weight order of time attribute in the Search Results that described query is corresponding.
The device of 12. ageing demands identifications according to claim 11, is characterized in that, described search optimize unit specifically by time attribute the weight order in Search Results corresponding to described query bring up to setting weights; Or,
Weight order by time attribute in Search Results corresponding to described query improves sets step-length.
CN201010600713.3A 2010-12-22 2010-12-22 Method and device for excavating search log and page search method and device Active CN102073684B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010600713.3A CN102073684B (en) 2010-12-22 2010-12-22 Method and device for excavating search log and page search method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010600713.3A CN102073684B (en) 2010-12-22 2010-12-22 Method and device for excavating search log and page search method and device

Publications (2)

Publication Number Publication Date
CN102073684A CN102073684A (en) 2011-05-25
CN102073684B true CN102073684B (en) 2014-08-13

Family

ID=44032223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010600713.3A Active CN102073684B (en) 2010-12-22 2010-12-22 Method and device for excavating search log and page search method and device

Country Status (1)

Country Link
CN (1) CN102073684B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955829B (en) * 2011-08-30 2017-11-03 北京百度网讯科技有限公司 For the method being ranked up to resource items, device and equipment
US10210262B2 (en) 2014-06-09 2019-02-19 Ebay Inc. Systems and methods to identify a filter set in a query comprised of keywords
US10839441B2 (en) * 2014-06-09 2020-11-17 Ebay Inc. Systems and methods to seed a search
CN105095434B (en) * 2015-07-23 2019-03-29 百度在线网络技术(北京)有限公司 The recognition methods of timeliness demand and device
CN105159938B (en) * 2015-08-03 2018-11-30 百度在线网络技术(北京)有限公司 Search method and device
CN106341291B (en) * 2016-09-08 2019-11-15 北京小米移动软件有限公司 It is connected to the network the test method and device of stability
CN107180093B (en) * 2017-05-15 2020-05-19 北京奇艺世纪科技有限公司 Information searching method and device and timeliness query word identification method and device
CN111241379B (en) * 2018-11-28 2023-04-25 阿里巴巴集团控股有限公司 Search result processing method and device, electronic equipment and computer readable medium
CN110110191B (en) * 2019-03-28 2021-05-25 北京奇艺世纪科技有限公司 Search processing method and apparatus, and computer-readable storage medium
CN110263004A (en) * 2019-05-08 2019-09-20 北京字节跳动网络技术有限公司 Log recording method, device, electronic equipment and storage medium
CN110489525B (en) * 2019-08-09 2022-02-25 腾讯科技(深圳)有限公司 Search result acquisition method and device, storage medium and electronic device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1983255A (en) * 2006-05-17 2007-06-20 唐红春 Internet searching method
CN101369275A (en) * 2008-09-10 2009-02-18 浙江大学 Product attribute excavation method of non-structured text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1983255A (en) * 2006-05-17 2007-06-20 唐红春 Internet searching method
CN101369275A (en) * 2008-09-10 2009-02-18 浙江大学 Product attribute excavation method of non-structured text

Also Published As

Publication number Publication date
CN102073684A (en) 2011-05-25

Similar Documents

Publication Publication Date Title
CN102073684B (en) Method and device for excavating search log and page search method and device
CN103955505B (en) A kind of event method of real-time and system based on microblogging
CN107145445A (en) The automatic analysis method and system of the daily record that reports an error of software automated testing
CN104077407B (en) A kind of intelligent data search system and method
CN102591880A (en) Information providing method and device
CN101819573A (en) Self-adaptive network public opinion identification method
CN103927398A (en) Microblog hype group discovering method based on maximum frequent item set mining
CN104699737A (en) Method and system for managing a search
CN1822000A (en) Method for automatic detecting news event
CN103873601A (en) Addressing class query word mining method and system
CN103412940B (en) The method of detection swindle phone
CN106649578A (en) Public opinion analysis method and system based on social network platform
CN103455758A (en) Method and device for identifying malicious website
CN101101599A (en) Method for extracting advertisement main information from web page
CN103136219A (en) Method and device for requirement mining and based on timeliness
CN102156746A (en) Method for evaluating performance of search engine
CN103136212B (en) The method for digging of one kind neologisms and device
CN105159884A (en) Method and device for establishing industry dictionary and industry identification method and device
CN102364475A (en) System and method for sequencing search results based on identity recognition
CN102654875B (en) Method and device for automatically processing inner link of web text
CN102737045A (en) Method and device for relevancy computation
CN103092838B (en) A kind of method and device for obtaining English words
CN103136256B (en) One realizes method for information retrieval and system in a network
CN117408249A (en) User-defined word segmentation optimization method and system based on distributed search
CN103955192B (en) A kind of curve form data sampling method for sewage work

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant