CN102999520A - Method and device for identifying search request - Google Patents

Method and device for identifying search request Download PDF

Info

Publication number
CN102999520A
CN102999520A CN2011102733272A CN201110273327A CN102999520A CN 102999520 A CN102999520 A CN 102999520A CN 2011102733272 A CN2011102733272 A CN 2011102733272A CN 201110273327 A CN201110273327 A CN 201110273327A CN 102999520 A CN102999520 A CN 102999520A
Authority
CN
China
Prior art keywords
query
search results
gram
demand type
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011102733272A
Other languages
Chinese (zh)
Other versions
CN102999520B (en
Inventor
黄际洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110273327.2A priority Critical patent/CN102999520B/en
Publication of CN102999520A publication Critical patent/CN102999520A/en
Application granted granted Critical
Publication of CN102999520B publication Critical patent/CN102999520B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method and device for identifying a search request. The method comprises the steps of: S, obtaining a query to be identified; S2, obtaining a search result of the query to be identified and determining each n-gram of a search result text and determining the weight of each n-gram based on condition that each n-gram is in a search result text to obtain a core word vector of the query to be identified; and S3, respectively calculating the similarity between the core word vector of the query to be identified and the core word vector of each request type predetermined, and determining the request type of the query to be identified according to the calculating result of the similarity. According to the invention, the accuracy of identifying the search request can be improved.

Description

A kind of method and apparatus of search need identification
[technical field]
The present invention relates to field of computer technology, particularly a kind of method and apparatus of search need identification.
[background technology]
Along with internet developing rapidly and maturation in the world, the information resources on the network are enriched constantly, and information data amount has become the major way of modern's obtaining information also in expansion at full speed by the search engine obtaining information.For provide more convenient to the user, inquiry service is that search engine technique is in the current and following developing direction accurately.
In search engine technique, it is an important ring that improves searching accuracy and validity that user's search need is identified, and effect is remarkable in the structuring search especially.Existing search need recognition method adopts the core word vector with query and each demand type to calculate respectively similarity usually, determines the demand type of query according to similarity result of calculation.The demand type that for example similarity is come top n is identified as the demand type of this query, perhaps, according to the value of similarity, determines that this query is in the demand levels of each demand type.But because query itself is shorter, available information is few, and the similarity iff between the core word vector that relies on query and directly calculate query and demand type may cause the semantic similarity deviation larger, thus the accuracy that causes search need to identify.
[summary of the invention]
The invention provides a kind of method and apparatus of search need identification, so that improve the accuracy of search need identification.
Concrete technical scheme is as follows:
A kind of method of search need identification, the method comprises:
S1, obtain query to be identified;
S2, obtain the Search Results of described query to be identified, determine each n phrase n-gram of unit of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of described query to be identified;
S3, calculate similarity between the core word vector of the core word vector of described query to be identified and predetermined each demand type respectively, determine the demand type of described query to be identified according to the result of calculation of similarity.
According to one preferred embodiment of the present invention, the Search Results that obtains described query to be identified among the step S2 is: obtain and come front N1 Search Results in the Search Results of described query to be identified, described N1 is default positive integer.
According to one preferred embodiment of the present invention, determine that based on the appearance situation of each n-gram in the Search Results text weight of each n-gram specifically comprises described in the step S2:
Give weight according to word frequency TF and the corresponding n value of n-gram in the Search Results text for n-gram; Perhaps,
The sentence number that in the Search Results text, occurs according to n-gram, with the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in the Search Results text and the reverse document frequency IDF of n-gram be that n-gram gives weight.
According to one preferred embodiment of the present invention, described Search Results text comprises: the web page title of Search Results perhaps comprises the sentence of described query to be identified in the webpage of Search Results.
According to one preferred embodiment of the present invention, the core word vector of determining the demand type comprises:
S31, determine the seed query set of this demand type;
S32, utilize each the seed query in the seed query set to search for, from the Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in the Search Results text, obtain the core word vector of this demand type.
According to one preferred embodiment of the present invention, definite mode of the seed query of demand type set comprises:
Dispose by artificial mode; Perhaps
Adopt artificial mode in the search daily record, to mark; Perhaps,
From the search daily record of this demand type vertical search, obtain query that searching times is higher than preset first threshold value and consist of the seed query set of this demand type; Perhaps,
From the search daily record of the Webpage search of this demand type, obtain corresponding to the website of having clicked this demand type or clicked the query of the title that comprises this demand type Feature Words, and the query that searching times among the query that obtains is higher than default Second Threshold is consisted of the seed query set of this demand type.
According to one preferred embodiment of the present invention, described step S32 specifically comprises:
Utilize each the seed query in the seed query set of this demand type to search for, determine each n-gram in the Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of this demand type; Perhaps,
Utilize each the seed query in the seed query set of this demand type to search for, after the Search Results text carried out word segmentation processing and remove stop words, the TF of resulting each word behind the statistics removal stop words, determine that TF is higher than the word of default word frequency threshold value and determines weight based on word frequency for each word, obtains the core word vector of this demand type; Perhaps,
Utilize each the seed query in the seed query set of this demand type to search for, after the Search Results text carried out word segmentation processing and remove stop words, statistics is removed TF and the IDF of each word that obtains behind the stop words, determine that the TF-IDF value is higher than the word of default TF-IDF threshold value and determines weight based on TF-IDF for each word of determining, obtains the core word vector of this demand type; Perhaps,
Utilize each the seed query in the seed query set of this demand type to search for, after the Search Results text carried out word segmentation processing and remove stop words, be respectively each word according to the IDF that removes sentence number with corresponding seed query co-occurrence of sentence number that each word of obtaining behind the stop words occurs, each word, sentence number that seed query occurs and each word in the Search Results text in the Search Results text and give weight, select weighted value to be higher than the word of default weight threshold, the core word that obtains this demand type is vectorial.
According to one preferred embodiment of the present invention, describedly determine that based on the appearance situation of each n-gram in the Search Results text weight of each n-gram comprises:
Give weight according to TF and the corresponding n value of each n-gram in the Search Results text for each n-gram; Perhaps,
The sentence number of the sentence number that occurs in the Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, the sentence number that seed query occurs in the Search Results text and the IDF of n-gram are that n-gram gives weight.
According to one preferred embodiment of the present invention, determine that according to the result of calculation of similarity the demand type of described query to be identified comprises described in the step S3:
The similarity value is come the demand type that the demand type of front N2 or demand type that the similarity value surpasses default similarity threshold are defined as described query to be identified, and described N2 is default positive integer; Perhaps,
According to default similarity value and the corresponding relation between the similarity grade, determine that similarity grade corresponding to similarity value of calculating among the described step S3 is that described query to be identified is in the demand levels of corresponding demand type.
A kind of device of search need identification, this device comprises:
The identifying object acquiring unit is used for obtaining query to be identified;
The primary vector determining unit, be used for obtaining the Search Results of described query to be identified, determine each n phrase n-gram of unit of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of described query to be identified;
Demand type determining unit for the similarity between the core word vector of the core word vector that calculates respectively described query to be identified and predetermined each demand type, is determined the demand type of described query to be identified according to the result of calculation of similarity.
According to one preferred embodiment of the present invention, described primary vector determining unit is specifically obtained and is come front N1 Search Results in the Search Results of described query to be identified when obtaining the Search Results of described query to be identified, and described N1 is default positive integer.
According to one preferred embodiment of the present invention, described primary vector determining unit is given weight according to word frequency TF and the corresponding n value of n-gram in the Search Results text for n-gram when determining the weight of each n-gram; Perhaps,
The sentence number that in the Search Results text, occurs according to n-gram, with the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in the Search Results text and the reverse document frequency IDF of n-gram be that n-gram gives weight.
According to one preferred embodiment of the present invention, described Search Results text comprises: the web page title of Search Results perhaps comprises the sentence of described query to be identified in the webpage of Search Results.
According to one preferred embodiment of the present invention, this device also comprises: the secondary vector determining unit;
Described secondary vector determining unit specifically comprises:
Seed query determines subelement, is used for determining the seed query set of demand type;
The core word vector forms subelement, Search Results for each the seed query that obtains seed query set, from the Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in the Search Results text, obtain the core word vector of this demand type.
According to one preferred embodiment of the present invention, described seed query determines that subelement obtains the seed query set of the demand type that disposes by artificial mode; Perhaps,
Obtain the seed query set of adopting the demand type that artificial mode marks in the search daily record; Perhaps,
From the search daily record of demand type vertical search, obtain query that searching times is higher than preset first threshold value and consist of the seed query set of this demand type; Perhaps,
From the search daily record of the Webpage search of demand type, obtain corresponding to the website of having clicked this demand type or clicked the query of the title that comprises this demand type Feature Words, and the query that searching times among the query that obtains is higher than default Second Threshold is consisted of the seed query set of this demand type.
According to one preferred embodiment of the present invention, the Search Results of each seed query during the seed query that described core word vector formation subelement obtains this demand type gathers, determine each n-gram in the Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of this demand type; Perhaps,
Obtain the Search Results of each the seed query in the seed query set of this demand type, after the Search Results text carried out word segmentation processing and remove stop words, the TF of resulting each word behind the statistics removal stop words, determine that TF is higher than the word of default word frequency threshold value and determines weight based on word frequency for each word, obtains the core word vector of this demand type; Perhaps,
Obtain the Search Results of each the seed query in the seed query set of this demand type, after the Search Results text carried out word segmentation processing and remove stop words, statistics is removed TF and the IDF of each word that obtains behind the stop words, determine that the TF-IDF value is higher than the word of default TF-IDF threshold value and determines weight based on TF-IDF for each word of determining, obtains the core word vector of this demand type; Perhaps,
Obtain the Search Results of each the seed query in the seed query set of this demand type, after the Search Results text carried out word segmentation processing and remove stop words, be respectively each word according to the IDF that removes sentence number with corresponding seed query co-occurrence of sentence number that each word of obtaining behind the stop words occurs, each word, sentence number that seed query occurs and each word in the Search Results text in the Search Results text and give weight, select weighted value to be higher than the word of default weight threshold, the core word that obtains this demand type is vectorial.
According to one preferred embodiment of the present invention, described core word vector forms subelement when determining the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, specifically gives weight according to TF and the corresponding n value of each n-gram in the Search Results text for each n-gram; Perhaps,
The sentence number of the sentence number that occurs in the Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, the sentence number that seed query occurs in the Search Results text and the IDF of n-gram are that n-gram gives weight.
According to one preferred embodiment of the present invention, the similarity value is come the individual demand type of front N2 to described demand type determining unit or the similarity value surpasses the demand type that the demand type of presetting similarity threshold is defined as described query to be identified, and described N2 is default positive integer; Perhaps,
According to default similarity value and the corresponding relation between the similarity grade, determine that similarity grade corresponding to similarity value of calculating is that described query to be identified is in the demand levels of corresponding demand type.
As can be seen from the above technical solutions, the present invention adopt query to be identified the Search Results text n-gram and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of query to be identified, the core word of the query to be identified that utilization obtains vector further calculates the similarity with the core word vector of each demand type, thereby identifies the demand type of query to be identified.As seen, the present invention has utilized the more abundant information of query to be identified itself of comparing, and namely the n-gram of the Search Results text of query to be identified expresses the semanteme of query to be identified more fully, thereby improves the accuracy of search need identification.
[description of drawings]
The method flow diagram that Fig. 1 provides for the embodiment of the invention one;
The webpage synoptic diagram of the sentence that comprises query to be identified that Fig. 2 provides for the embodiment of the invention one;
The structure drawing of device that Fig. 3 provides for the embodiment of the invention two;
The search need identification that Fig. 4 provides for the embodiment of the invention is used for the instance graph of large search ordering;
The search need identification that Fig. 5 provides for the embodiment of the invention is used for the instance graph of vertical search.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
Embodiment one,
The method flow diagram that Fig. 1 provides for the embodiment of the invention one, as shown in Figure 1, the method can may further comprise the steps:
Step 101: obtain query to be identified.
Step 102: the Search Results that obtains this query to be identified, determine each n unit's phrase (n-gram) in the text of Search Results and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of query to be identified.
Because usually there is larger correlativity in Search Results and the query based on query, therefore, the Search Results that obtains after utilizing in this step query to be identified to search for carries out the extraction of core word vector.
In addition, search engine is when searching for for query to be identified, Search Results is to sort according to the correlativity with query to be identified, therefore, in order to raise the efficiency, reduce calculated amount, can choose and come front N1 Search Results, determine n-gram from the text of this front N1 Search Results, wherein N1 is default positive integer.
Owing to may have bulk information in the page of Search Results, much may with query to be identified in that semantically correlativity is less, the text of the Search Results that therefore, utilizes when determining n-gram can be: the sentence that comprises this query to be identified in web page title or the webpage.
Take the sentence that from webpage, comprises query to be identified as example, suppose that query to be identified is " home cooking ", after utilizing this query to be identified to search for, suppose one of them Search Results of returning as shown in Figure 2, the sentence that comprises query to be identified in the webpage is:
It is complete works of that home cooking _ menu is made in the way of home cooking _ home cooking _ home cooking menu _
Home cooking is requisite during we live
The way of home cooking is various, such as the northeast home cooking, and Guo Lin home cooking etc., it is how the simplest that cook the home cooking menu
The cuisines outstanding person enriches simple home cooking menu complete works for you provide
Then from above four sentences, determine n-gram.
So-called n-gram is exactly n the combination that word occurs in order of minimum particle size, and wherein n is default one or more positive integers.Take " home cooking is requisite during we live " as example, if n is 1,2,3 or 4, the n-gram that obtains so is:
1-gram: home cooking, be, we, the life, in, essential
2-gram: home cooking is, be us, in our life, life, in essential
3-gram: home cooking is for we, be that we live, in our life, essential in the life
4-gram: home cooking is that we live, are during we live, essential in our life
Wherein " " be filtered in the process of determining n-gram as stop words.
When determining the weight of each n-gram, can include but not limited to following dual mode:
Mode one, give weight according to word frequency (TF) and the corresponding n value of each n-gram in the Search Results text for each n-gram.Usually the TF of n-gram in the Search Results text is higher, illustrates that the significance level of this n-gram is higher, and, the n value is larger, and the quantity of information that this n-gram comprises is larger, and corresponding weight value also should be higher, therefore, can adopt TF*n in this mode is that n-gram gives weight.
Mode two, the sentence number that in the Search Results text, occurs according to n-gram, give weight with the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs and the reverse document frequency (IDF) of n-gram for n-gram in the Search Results text.This mode is take information theory as the basis, and formula can be shown in formula (1).
Centrality ( w ) = log ( Co ( w , q ) + 1 ) log ( sf ( w ) + 1 ) + log ( sf ( q ) + 1 ) × log ( idf ( w ) + 1 ) ; - - - ( 1 )
Wherein, w is n-gram, q is query to be identified, Centrality (w) is the weight of n-gram, Co (w, q) is the sentence number of n-gram and query co-occurrence to be identified, the sentence number that sf (w) occurs in the Search Results text for n-gram, the sentence number that sf (q) occurs in the Search Results text for query to be identified, idf (w) are the reverse document frequency of n-gram.
Need to prove that above-mentioned formula (1) only is the example that the embodiment of the invention provides, the simple modification of doing according to this formula and be equal to replacement and enumerate no longer one by one is all in limited range of the present invention.
Step 103: calculate respectively the similarity of core word vector with the core word vector of each demand type of query to be identified, determine the demand type of query to be identified according to the result of calculation of similarity.
Pre-determine out in the present invention the core word vector of each demand type, the core word vector of this demand type determine method can for: determine the seed query set of this demand type; Utilize each the seed query in the seed query set to search for, from the text of Search Results, extract core word and determine the weight of each core word based on the appearance situation of core word in the Search Results text, obtain the core word vector of this demand type.
The seed query of the seed query set of formation demand type embodies the demand of corresponding preset kind, and these seeds query set can be disposed by artificial mode, perhaps adopts artificial mode to mark in the search daily record.More preferably, also can from the search daily record, excavate seed query, for example from the search daily record of this demand type vertical search, obtain searching times and be higher than the query of preset first threshold value as the seed query of this demand type, perhaps, from the search daily record of the Webpage search of this demand type, obtain corresponding to the website of having clicked this demand type or clicked the query of the title that comprises this demand type Feature Words, and searching times among the query that obtains is higher than the query of default Second Threshold as the seed query of this demand type, etc.
For example, the seed query in the seed query of the game class set can comprise: " download of standalone version mobile phone trivial games ", " precious prompt fast lp608 mobile phone games are downloaded ", " World of Warcraft's download ", " World of Warcraft " etc.
After utilizing each seed query in each seed query set to search for, the mode that extracts core word can adopt following several:
First kind of way: determine each n-gram in the text of Search Results and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of this demand type.
Because search engine is when the Search Results for seed query sorts, normally sort according to the correlativity with seed query, therefore, in order to raise the efficiency, reduce calculated amount, can choose and come front N3 Search Results, determine n-gram from the text of this front N3 Search Results, wherein N3 is default positive integer.
Owing to may have bulk information in the page of Search Results, much may with seed query in that semantically correlativity is less, therefore, the text of the Search Results that utilizes when determining n-gram can be: the sentence that comprises this seed query in web page title or the webpage, below all be so in several modes, repeat no more.
When determining the weight of each n-gram, can include but not limited to following dual mode:
Mode 1, give weight according to TF and the corresponding n value of each n-gram in the Search Results text for each n-gram.Usually the TF of n-gram in the Search Results text is higher, illustrates that the significance level of this n-gram is higher, and, the n value is larger, and the quantity of information that this n-gram comprises is larger, and corresponding weight value also should be higher, therefore, can adopt TF*n in this mode is that n-gram gives weight.
Mode 2, the sentence number that in the Search Results text, occurs according to n-gram, be that n-gram gives weight with the sentence number of corresponding seed query co-occurrence, sentence number that seed query occurs in the Search Results text and the IDF of n-gram.This mode is take information theory as the basis, and formula can be shown in formula (2).
Centrality ( w ) = log ( Co ( w , q ) + 1 ) log ( sf ( w ) + 1 ) + log ( sf ( q ) + 1 ) × log ( idf ( w ) + 1 ) ; - - - ( 2 )
Wherein, w is n-gram, q is corresponding seed query, Centrality (w) is the weight of n-gram, Co (w, q) is the sentence number of n-gram and this seed query co-occurrence, the sentence number that sf (w) occurs in the Search Results text for n-gram, the sentence number that sf (q) occurs in the Search Results text for this seed query, idf (w) are the reverse document frequency of n-gram.
Need to prove that above-mentioned formula (2) only is the example that the embodiment of the invention provides, the simple modification of doing according to this formula and be equal to replacement and enumerate no longer one by one is all in limited range of the present invention.
The second way: after the text of Search Results carried out word segmentation processing and remove stop words, obtain the word frequency of each word behind the statistics removal stop words, determine that word frequency is higher than the word of default word frequency threshold value and determines weight based on word frequency for each word of determining, obtains the core word vector of this demand type.
Wherein, the weight that the higher word of word frequency is corresponding is larger.
The third mode: after the text of Search Results carried out participle and remove stop words, statistics is removed TF and the IDF of each word that obtains behind the stop words, determine that the TF-IDF value is higher than the word of default TF-IDF threshold value and determines weight based on TF-IDF for each word of determining, obtains the core word vector of this demand type.
Wherein, the weight that the larger word of TF-IDF value is corresponding is larger.
The 4th kind of mode: after the text of Search Results carried out participle and remove stop words, according to removing sentence number that each word of obtaining behind the stop words occurs in the Search Results text, give weight with the sentence number of corresponding seed query co-occurrence, sentence number that seed query occurs and the IDF of each word for each word in the Search Results text, select weighted value to be higher than the word of default weight threshold, the core word that obtains this demand type is vectorial.
The computing formula of weighted value is shown in formula (3).
Centrality ( w ) = log ( Co ( w , q ) + 1 ) log ( sf ( w ) + 1 ) + log ( sf ( q ) + 1 ) × log ( idf ( w ) + 1 ) ; - - - ( 3 )
Wherein, w is for removing the word that obtains behind the stop words, q is corresponding seed query, Centrality (w) is the weight of word w, Co (w, q) is the sentence number of word w and this seed query co-occurrence, the sentence number that sf (w) occurs in the Search Results text for word w, the sentence number that sf (q) occurs in the Search Results text for this seed query, idf (w) are the reverse document frequency of word w.
When the core word vector of the core word vector sum demand type of calculating query to be identified, can adopt the computing method of cosine similarity.Table 1 is as an example of several query to be identified example and each demand type similarity.
Table 1
Query to be identified With the game class similarity With the software class similarity With novel class similarity
Network game repair sieve legend 0.0026 0 0.4431
The novel of DNF 0.0050 0.0001 0.3467
Story of a play or opera task in the DNF 0.3616 0.0128 0
Swordsman's love standalone version 3 attack strategys 0.1631 0 0.0063
Swordsman's love is read non-cigarette of step in full 0 0 0.1205
After determining similarity, the similarity value can be come front N2 demand type, perhaps the similarity value surpasses the demand type that the demand type of presetting similarity threshold is identified as query to be identified, and wherein N2 is default positive integer.The situation that example is as shown in table 1 supposes that N2 is 1, then can identify " novel of DNF " and be novel class demand, and " swordsman's love standalone version 3 attack strategys " are the game class demand.
Also can according to default similarity value and the corresponding relation between the similarity grade, according to the core word vector of query to be identified and the value of the similarity of the core word vector of each demand type, identify query to be identified in the demand levels of each demand type.For example, set in advance similarity more than 0.3 for strong demand levels, similarity is weak demand levels between 0.1 to 0.3, similarity is being without the demand grade below 0.1.Then in the table 1, " novel of DNF " has strong demand in novel class demand, on game class and software class without demand; " swordsman's love standalone version 3 attack strategys " have weak demand at game class, on software class and novel class without demand.
More than be the detailed description that the method for search need identification provided by the present invention is carried out, the device of identifying below by two pairs of search needs provided by the invention of embodiment is described in detail.
Embodiment two,
The structure drawing of device that Fig. 3 provides for the embodiment of the invention two, as shown in Figure 3, this device can comprise: identifying object acquiring unit 300, primary vector determining unit 310 and demand type determining unit 320.
Identifying object acquiring unit 300 obtains query to be identified.
Primary vector determining unit 310 is obtained the Search Results of query to be identified, determines each n-gram of Search Results text and determines the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtains the core word vector of query to be identified.
Because there is larger correlativity in the Search Results based on query with query usually; therefore primary vector determining unit 310 can offer search engine with query to be identified, obtains the Search Results that search engine returns and the core word vector that is further used for extracting query to be identified.
Search engine is when searching for for query to be identified, Search Results is to sort according to the correlativity with query to be identified, therefore, in order to raise the efficiency, reduce calculated amount, primary vector determining unit 310 is specifically obtained and is come front N1 Search Results in the Search Results of query to be identified when obtaining the Search Results of query to be identified, and wherein N1 is default positive integer.
Primary vector determining unit 310 can adopt following dual mode when determining the weight of each n-gram:
First kind of way: give weight for n-gram according to TF and the corresponding n value of n-gram in the Search Results text.Usually the TF of n-gram in the Search Results text is higher, illustrates that the significance level of this n-gram is higher, and, the n value is larger, and the quantity of information that this n-gram comprises is larger, and corresponding weight value also should be higher, therefore, can adopt TF*n in this mode is that n-gram gives weight.
The second way: the sentence number that in the Search Results text, occurs according to n-gram, with the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in the Search Results text and the IDF of n-gram be that n-gram gives weight.This mode is take information theory as the basis, and formula can shown in the formula among the embodiment one (1), not repeat them here.
Owing to may have bulk information in the page of Search Results, much with query to be identified in that semantically correlativity is less, therefore, the mentioned above searching results text can comprise: the web page title of Search Results perhaps comprises the sentence of query to be identified in the webpage of Search Results.
Demand type determining unit 320 is calculated respectively the similarity between the core word vector of the core word vector of query to be identified and predetermined each demand type, determines the demand type of query to be identified according to the result of calculation of similarity.
Because need to pre-determine the core word vector of each demand type, therefore, this device can also comprise: secondary vector determining unit 330.
Secondary vector determining unit 330 can specifically comprise: seed query determines that subelement 331 and core word vector form subelement 332.
Seed query determines the seed query set of subelement 331 definite demand types.Particularly, can obtain in the following manner:
First kind of way: the seed query set of obtaining the demand type that disposes by artificial mode.
The second way: obtain the seed query set of adopting the demand type that artificial mode marks in the search daily record.
The third mode: from the search daily record of demand type vertical search, obtain the seed query set that query that searching times is higher than preset first threshold value consists of this demand type.
The 4th kind of mode: from the search daily record of the Webpage search of demand type, obtain corresponding to the website of having clicked this demand type or clicked the query of the title that comprises this demand type Feature Words, and the query that searching times among the query that obtains is higher than default Second Threshold is consisted of the seed query set of this demand type.
The core word vector forms the Search Results that subelement 332 obtains each the seed query in the seed query set, from the Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in the Search Results text, obtain the core word vector of this demand type.Be after core word vector forms subelement 332 each seed query is offered respectively search engine and searches for, to obtain the Search Results that search engine returns.
Particularly, core word vector formation subelement 332 can adopt following four kinds of modes to obtain the core word vector of this demand type:
Mode one, obtain the Search Results of each the seed query in the seed query set of this demand type, determine each n-gram in the Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of this demand type.
Wherein, when determining the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, specifically can give weight for each n-gram according to TF and the corresponding n value of each n-gram in the Search Results text; Perhaps, the sentence number of the sentence number that occurs in the Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, the sentence number that seed query occurs in the Search Results text and the IDF of n-gram are that n-gram gives weight, specifically can adopt the formula (2) among the embodiment one, not repeat them here.
Mode two, obtain the Search Results of each the seed query in the seed query set of this demand type, after the Search Results text carried out word segmentation processing and remove stop words, the TF of resulting each word behind the statistics removal stop words, determine that TF is higher than the word of default word frequency threshold value and determines weight based on word frequency for each word, obtains the core word vector of this demand type.Wherein, the weight that the higher word of word frequency is corresponding is larger.
Mode three, obtain the Search Results of each the seed query in the seed query set of this demand type, after the Search Results text carried out word segmentation processing and remove stop words, statistics is removed TF and the IDF of each word that obtains behind the stop words, determine that the TF-IDF value is higher than the word of default TF-IDF threshold value and determines weight based on TF-IDF for each word of determining, obtains the core word vector of this demand type.Wherein, the weight that the larger word of TF-IDF value is corresponding is larger.
Mode four, obtain the Search Results of each the seed query in the seed query set of this demand type, after the Search Results text carried out word segmentation processing and remove stop words, be respectively each word according to the IDF that removes sentence number with corresponding seed query co-occurrence of sentence number that each word of obtaining behind the stop words occurs, each word, sentence number that seed query occurs and each word in the Search Results text in the Search Results text and give weight, select weighted value to be higher than the word of default weight threshold, the core word that obtains this demand type is vectorial.Can adopt the formula (3) among the embodiment one when giving weight for each word, not repeat them here.
After determining similarity, demand type determining unit 320 can come the similarity value the individual demand type of front N2 or the similarity value surpasses the demand type that the demand type of presetting similarity threshold is defined as query to be identified, and N2 is default positive integer; Perhaps, according to default similarity value and the corresponding relation between the similarity grade, determine that similarity grade corresponding to similarity value of calculating is that query to be identified is in the demand levels of corresponding demand type.
After the said method that adopts the embodiment of the invention to provide or device identify the demand type, can be used for but be not limited to following application scenarios:
1) is used for the ordering of large search.After the user inputted query, the said method by the embodiment of the invention and device can identify the demand type of this query, with in the Search Results of large search to the page-ranking of demand type that should query in advance.
For example, when the user inputs query " home cooking high definition ", can in large search, identify this query and have the video class demand, the associated video information that in for the results page of this large search, can have " home cooking " this TV play, obtaining of this partial video information can be that the video vertical search provides and inserts in the Search Results of large search, like this in the Search Results of large search, the page of this video class can be come the front of Search Results, as shown in Figure 4, so that user's satisfaction and search experience all will be greatly improved.
2) be used for vertical search.After the user inputs query, said method by the embodiment of the invention and device can identify the demand type of this query, this query is distributed to optimum content resource or application provider's processing, the final accurate result that the user is complementary that returns to efficiently.
For example, and when user's input " from Baidu's mansion to five road junctions ", can identify this query and have the map class demand, this query is offered the map vertical search, carried out the calculating of bus routes by the map vertical search, then directly show bus trip map and relevant bus information from Baidu's mansion to five road junctions, as shown in Figure 5.
3) be used for information recommendation.After the user inputted query, the said method by the embodiment of the invention and device can identify the demand type of this query, based on this demand type the user are carried out information recommendation, recommended such as recommendation, the query of advertisement recommendation, knowledge question platform etc.
For example, the user inputs query " cheap MP3 player " and identifies its demand type and be the shopping class, then can recommend the advertisement relevant with the MP3 player at Search Results, and advertisement and user's actual demand matching degree is just very high like this.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (18)

1. the method for search need identification is characterized in that the method comprises:
S1, obtain query to be identified;
S2, obtain the Search Results of described query to be identified, determine each n phrase n-gram of unit of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of described query to be identified;
S3, calculate similarity between the core word vector of the core word vector of described query to be identified and predetermined each demand type respectively, determine the demand type of described query to be identified according to the result of calculation of similarity.
2. method according to claim 1 is characterized in that, the Search Results that obtains described query to be identified among the step S2 is: obtain and come front N1 Search Results in the Search Results of described query to be identified, described N1 is default positive integer.
3. method according to claim 1 is characterized in that, determines that based on the appearance situation of each n-gram in the Search Results text weight of each n-gram specifically comprises described in the step S2:
Give weight according to word frequency TF and the corresponding n value of n-gram in the Search Results text for n-gram; Perhaps,
The sentence number that in the Search Results text, occurs according to n-gram, with the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in the Search Results text and the reverse document frequency IDF of n-gram be that n-gram gives weight.
4. according to claim 1,2 or 3 described methods, it is characterized in that described Search Results text comprises: the web page title of Search Results perhaps comprises the sentence of described query to be identified in the webpage of Search Results.
5. method according to claim 1 is characterized in that, determines that the core word vector of demand type comprises:
S31, determine the seed query set of this demand type;
S32, utilize each the seed query in the seed query set to search for, from the Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in the Search Results text, obtain the core word vector of this demand type.
6. method according to claim 5 is characterized in that, definite mode of the seed query set of demand type comprises:
Dispose by artificial mode; Perhaps
Adopt artificial mode in the search daily record, to mark; Perhaps,
From the search daily record of this demand type vertical search, obtain query that searching times is higher than preset first threshold value and consist of the seed query set of this demand type; Perhaps,
From the search daily record of the Webpage search of this demand type, obtain corresponding to the website of having clicked this demand type or clicked the query of the title that comprises this demand type Feature Words, and the query that searching times among the query that obtains is higher than default Second Threshold is consisted of the seed query set of this demand type.
7. method according to claim 5 is characterized in that, described step S32 specifically comprises:
Utilize each the seed query in the seed query set of this demand type to search for, determine each n-gram in the Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of this demand type; Perhaps,
Utilize each the seed query in the seed query set of this demand type to search for, after the Search Results text carried out word segmentation processing and remove stop words, the TF of resulting each word behind the statistics removal stop words, determine that TF is higher than the word of default word frequency threshold value and determines weight based on word frequency for each word, obtains the core word vector of this demand type; Perhaps,
Utilize each the seed query in the seed query set of this demand type to search for, after the Search Results text carried out word segmentation processing and remove stop words, statistics is removed TF and the IDF of each word that obtains behind the stop words, determine that the TF-IDF value is higher than the word of default TF-IDF threshold value and determines weight based on TF-IDF for each word of determining, obtains the core word vector of this demand type; Perhaps,
Utilize each the seed query in the seed query set of this demand type to search for, after the Search Results text carried out word segmentation processing and remove stop words, be respectively each word according to the IDF that removes sentence number with corresponding seed query co-occurrence of sentence number that each word of obtaining behind the stop words occurs, each word, sentence number that seed query occurs and each word in the Search Results text in the Search Results text and give weight, select weighted value to be higher than the word of default weight threshold, the core word that obtains this demand type is vectorial.
8. method according to claim 7 is characterized in that, describedly determines that based on the appearance situation of each n-gram in the Search Results text weight of each n-gram comprises:
Give weight according to TF and the corresponding n value of each n-gram in the Search Results text for each n-gram; Perhaps,
The sentence number of the sentence number that occurs in the Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, the sentence number that seed query occurs in the Search Results text and the IDF of n-gram are that n-gram gives weight.
9. method according to claim 1 is characterized in that, determines that according to the result of calculation of similarity the demand type of described query to be identified comprises described in the step S3:
The similarity value is come the demand type that the demand type of front N2 or demand type that the similarity value surpasses default similarity threshold are defined as described query to be identified, and described N2 is default positive integer; Perhaps,
According to default similarity value and the corresponding relation between the similarity grade, determine that similarity grade corresponding to similarity value of calculating among the described step S3 is that described query to be identified is in the demand levels of corresponding demand type.
10. the device of search need identification is characterized in that this device comprises:
The identifying object acquiring unit is used for obtaining query to be identified;
The primary vector determining unit, be used for obtaining the Search Results of described query to be identified, determine each n phrase n-gram of unit of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of described query to be identified;
Demand type determining unit for the similarity between the core word vector of the core word vector that calculates respectively described query to be identified and predetermined each demand type, is determined the demand type of described query to be identified according to the result of calculation of similarity.
11. device according to claim 10, it is characterized in that, described primary vector determining unit is specifically obtained and is come front N1 Search Results in the Search Results of described query to be identified when obtaining the Search Results of described query to be identified, and described N1 is default positive integer.
12. device according to claim 10 is characterized in that, described primary vector determining unit is given weight according to word frequency TF and the corresponding n value of n-gram in the Search Results text for n-gram when determining the weight of each n-gram; Perhaps,
The sentence number that in the Search Results text, occurs according to n-gram, with the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in the Search Results text and the reverse document frequency IDF of n-gram be that n-gram gives weight.
13. according to claim 10,11 or 12 described devices, it is characterized in that described Search Results text comprises: the web page title of Search Results perhaps comprises the sentence of described query to be identified in the webpage of Search Results.
14. device according to claim 10 is characterized in that, this device also comprises: the secondary vector determining unit;
Described secondary vector determining unit specifically comprises:
Seed query determines subelement, is used for determining the seed query set of demand type;
The core word vector forms subelement, Search Results for each the seed query that obtains seed query set, from the Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in the Search Results text, obtain the core word vector of this demand type.
15. device according to claim 14 is characterized in that, described seed query determines that subelement obtains the seed query set of the demand type that disposes by artificial mode; Perhaps,
Obtain the seed query set of adopting the demand type that artificial mode marks in the search daily record; Perhaps,
From the search daily record of demand type vertical search, obtain query that searching times is higher than preset first threshold value and consist of the seed query set of this demand type; Perhaps,
From the search daily record of the Webpage search of demand type, obtain corresponding to the website of having clicked this demand type or clicked the query of the title that comprises this demand type Feature Words, and the query that searching times among the query that obtains is higher than default Second Threshold is consisted of the seed query set of this demand type.
16. device according to claim 14, it is characterized in that, the Search Results of each seed query during the seed query that described core word vector formation subelement obtains this demand type gathers, determine each n-gram in the Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, obtain the core word vector of this demand type; Perhaps,
Obtain the Search Results of each the seed query in the seed query set of this demand type, after the Search Results text carried out word segmentation processing and remove stop words, the TF of resulting each word behind the statistics removal stop words, determine that TF is higher than the word of default word frequency threshold value and determines weight based on word frequency for each word, obtains the core word vector of this demand type; Perhaps,
Obtain the Search Results of each the seed query in the seed query set of this demand type, after the Search Results text carried out word segmentation processing and remove stop words, statistics is removed TF and the IDF of each word that obtains behind the stop words, determine that the TF-IDF value is higher than the word of default TF-IDF threshold value and determines weight based on TF-IDF for each word of determining, obtains the core word vector of this demand type; Perhaps,
Obtain the Search Results of each the seed query in the seed query set of this demand type, after the Search Results text carried out word segmentation processing and remove stop words, be respectively each word according to the IDF that removes sentence number with corresponding seed query co-occurrence of sentence number that each word of obtaining behind the stop words occurs, each word, sentence number that seed query occurs and each word in the Search Results text in the Search Results text and give weight, select weighted value to be higher than the word of default weight threshold, the core word that obtains this demand type is vectorial.
17. device according to claim 16, it is characterized in that, described core word vector forms subelement when determining the weight of each n-gram based on the appearance situation of each n-gram in the Search Results text, specifically gives weight according to TF and the corresponding n value of each n-gram in the Search Results text for each n-gram; Perhaps,
The sentence number of the sentence number that occurs in the Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, the sentence number that seed query occurs in the Search Results text and the IDF of n-gram are that n-gram gives weight.
18. device according to claim 10, it is characterized in that, the similarity value is come the individual demand type of front N2 to described demand type determining unit or the similarity value surpasses the demand type that the demand type of presetting similarity threshold is defined as described query to be identified, and described N2 is default positive integer; Perhaps,
According to default similarity value and the corresponding relation between the similarity grade, determine that similarity grade corresponding to similarity value of calculating is that described query to be identified is in the demand levels of corresponding demand type.
CN201110273327.2A 2011-09-15 2011-09-15 A kind of method and apparatus of search need identification Active CN102999520B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110273327.2A CN102999520B (en) 2011-09-15 2011-09-15 A kind of method and apparatus of search need identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110273327.2A CN102999520B (en) 2011-09-15 2011-09-15 A kind of method and apparatus of search need identification

Publications (2)

Publication Number Publication Date
CN102999520A true CN102999520A (en) 2013-03-27
CN102999520B CN102999520B (en) 2016-04-27

Family

ID=47928094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110273327.2A Active CN102999520B (en) 2011-09-15 2011-09-15 A kind of method and apparatus of search need identification

Country Status (1)

Country Link
CN (1) CN102999520B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794251A (en) * 2015-05-19 2015-07-22 苏州工讯科技有限公司 Search result utility analysis-based industrial product vertical search engine arranging technology
CN106951422A (en) * 2016-01-07 2017-07-14 腾讯科技(深圳)有限公司 The method and apparatus of webpage training, the method and apparatus of search intention identification
CN107092621A (en) * 2016-11-24 2017-08-25 北京小度信息科技有限公司 Information search method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101820592A (en) * 2009-02-27 2010-09-01 华为技术有限公司 Method and device for mobile search
CN102096717A (en) * 2011-02-15 2011-06-15 百度在线网络技术(北京)有限公司 Search method and search engine
WO2011079415A1 (en) * 2009-12-30 2011-07-07 Google Inc. Generating related input suggestions

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101820592A (en) * 2009-02-27 2010-09-01 华为技术有限公司 Method and device for mobile search
WO2011079415A1 (en) * 2009-12-30 2011-07-07 Google Inc. Generating related input suggestions
CN102096717A (en) * 2011-02-15 2011-06-15 百度在线网络技术(北京)有限公司 Search method and search engine

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
桑艳艳 等: "拟合用户偏好的个性化搜索", 《情报科学》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794251A (en) * 2015-05-19 2015-07-22 苏州工讯科技有限公司 Search result utility analysis-based industrial product vertical search engine arranging technology
CN106951422A (en) * 2016-01-07 2017-07-14 腾讯科技(深圳)有限公司 The method and apparatus of webpage training, the method and apparatus of search intention identification
CN107092621A (en) * 2016-11-24 2017-08-25 北京小度信息科技有限公司 Information search method and device

Also Published As

Publication number Publication date
CN102999520B (en) 2016-04-27

Similar Documents

Publication Publication Date Title
CN100557612C (en) A kind of search result ordering method and device based on search engine
CN102693279B (en) Method, device and system for fast calculating comment similarity
CN105893444A (en) Sentiment classification method and apparatus
CN102200975B (en) Vertical search engine system using semantic analysis
CN105426539A (en) Dictionary-based lucene Chinese word segmentation method
CN102999521B (en) A kind of method and device identifying search need
CN102360383A (en) Method for extracting text-oriented field term and term relationship
CN103020066A (en) Method and device for recognizing search demand
US8515731B1 (en) Synonym verification
CN103336766A (en) Short text garbage identification and modeling method and device
CN103885937A (en) Method for judging repetition of enterprise Chinese names on basis of core word similarity
CN103294693A (en) Searching method, server and system
CN103186556A (en) Method for obtaining and searching structural semantic knowledge and corresponding device
CN103324745A (en) Text garbage identifying method and system based on Bayesian model
CN102760142A (en) Method and device for extracting subject label in search result aiming at searching query
CN102033919A (en) Method and system for extracting text key words
CN103577478A (en) Web page pushing method and system
CN103793434A (en) Content-based image search method and device
KR101254362B1 (en) Method and system for providing keyword ranking using common affix
CN103020067A (en) Method and device for determining webpage type
CN102236692A (en) Information processing device, information processing method, and program
CN103914533A (en) Promotion search result display method and device
CN104268230A (en) Method for detecting objective points of Chinese micro-blogs based on heterogeneous graph random walk
CN103745380A (en) Advertisement delivery method and apparatus
CN106933878B (en) Information processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant