CN103020066A - Method and device for recognizing search demand - Google Patents

Method and device for recognizing search demand Download PDF

Info

Publication number
CN103020066A
CN103020066A CN 201110282840 CN201110282840A CN103020066A CN 103020066 A CN103020066 A CN 103020066A CN 201110282840 CN201110282840 CN 201110282840 CN 201110282840 A CN201110282840 A CN 201110282840A CN 103020066 A CN103020066 A CN 103020066A
Authority
CN
China
Prior art keywords
query
search
identified
sorter
search results
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201110282840
Other languages
Chinese (zh)
Other versions
CN103020066B (en
Inventor
黄际洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110282840.8A priority Critical patent/CN103020066B/en
Publication of CN103020066A publication Critical patent/CN103020066A/en
Application granted granted Critical
Publication of CN103020066B publication Critical patent/CN103020066B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for recognizing a search demand. The method comprises the following steps of: after a query to be recognized is received, obtaining search results of the query to be recognized; conducting demand classification to the search results based on preset search result text features by using a classifier; and merging the demand classification results of the search results and determining the demand type of the query to be recognized according to the result of merging. The method is not influenced by the condition whether the query to be recognized contains preset keywords or not at all, and the demand recognition can be realized for any query to be recognized; and besides, since the timeliness of the search demand of a user is commonly reflected by the search results, the demand type recognized by adopting the method can fully reflect the timeliness of the search demand, and the recognition accuracy of the search demand is improved.

Description

A kind of method and apparatus of identifying search need
[technical field]
The present invention relates to field of computer technology, particularly a kind of method and apparatus of identifying search need.
[background technology]
Along with internet developing rapidly and maturation in the world, the information resources on the network are enriched constantly, and information data amount has become the major way of modern's obtaining information also in expansion at full speed by the search engine obtaining information.For provide more convenient to the user, inquiry service is that search engine technique is in the current and following developing direction exactly.
In search engine technique, it is an important ring that improves searching accuracy and validity that user's search need is identified, and effect is remarkable in structuring search (being vertical search) especially.Existing search need identification is the simple mode that adopts coupling to preset keyword usually, for example, corresponding video requirement presets some keywords: " watching online ", " download online ", " program request ", " high definition is watched " etc., if comprise some keywords in the searching request (query), such as query " the home cooking high definition is watched ", then can identify this query and have video requirement.But this mode has following defective:
Do not preset keyword if do not comprise among defective one query, then None-identified goes out the demand type of query, if for example query only is " home cooking ", just is difficult to directly judge according to this query the demand of this query.
Defective two, can't embody the ageing of query demand.As time goes on the demand of some query can change, for example, " home cooking " this query, before TV play " home cooking " is not shown, the main demand of this query is menu class and cuisines class, but TV play " home cooking " is when showing, and the main demand of this query may just be changed into video class, and menu class and cuisines class may become less important demand.And after TV play " home cooking " finished hot showing, people descended for the attention rate of this TV play, and at this moment the main demand of this query becomes menu class and cuisines class again again.Existing search need recognition methods obviously can't embody this variation.
Above-mentioned two defectives finally all can cause the accuracy of search need identification lower, cause the Search Results for this query can't satisfy exactly search need, and the user need to spend the more time and resource finds the content that needs.
[summary of the invention]
The invention provides a kind of method and apparatus of identifying search need, do not solve because query does not comprise and preset the demand None-identified that keyword causes and the ageing defective that can't embody the query demand, improve the accuracy of search need identification.
Concrete technical scheme is as follows:
A kind of method of identifying search need, the method comprises:
S1, receive query to be identified after, obtain the Search Results of described query to be identified;
S2, utilize sorter based on default Search Results text feature each Search Results to be carried out the demand classification;
S3, the demand classification results of each Search Results is merged, determine the demand type of described query to be identified according to fusion results.
According to one preferred embodiment of the present invention, described step S1 specifically comprises:
After receiving query to be identified, described query to be identified is offered search engine search, obtain the Search Results that comes top n the Search Results from search engine; Perhaps,
After receiving query to be identified, described query to be identified is expanded, query to be identified is offered search engine search with the expansion contamination, and from search engine obtains described query to be identified and Search Results corresponding to expansion contamination, coming the Search Results of top n, described expansion word is the demand word of each default demand class;
Wherein said N is default positive integer.
According to one preferred embodiment of the present invention, in described step S2, adopt more than one sorter and each sorter to adopt respectively different Search Results text features.
According to one preferred embodiment of the present invention, described sorter comprises: the sorter of setting up for web page title, the sorter of setting up for the webpage summary or for the sorter of network address.
According to one preferred embodiment of the present invention, the described sorter of setting up for web page title adopts at least a as the sorter feature in the following Search Results text feature:
The number of times that whether occurs described query to be identified in the web page title and described query to be identified occurs;
The overlapping conditions of the n phrase n-gram of unit that is determined by web page title and the core word vector of each demand type; And;
In the search daily record during the corresponding described query to be identified of web page title clicked number of times account for the ratio of the clicked total degree of corresponding all web page titles of described query to be identified.
According to one preferred embodiment of the present invention, the described sorter of setting up for webpage summary adopts at least a as the sorter feature in the following Search Results text feature:
The sentence number or the ratio that occur described query to be identified in the webpage summary; And,
The n-gram that comprises in the webpage summary and the overlapping conditions between demand type core word vector.
According to one preferred embodiment of the present invention, the described sorter of setting up for network address adopts at least a as the sorter feature in the following Search Results text feature:
The ranking value of the corresponding Search Results of network address;
The page type that network address is corresponding; And,
The ratio of clicked number of times and the clicked total degree of corresponding all network address of described query to be identified during the corresponding described query to be identified of network address in the search daily record.
According to one preferred embodiment of the present invention, the foundation of the core word vector of described demand type comprises:
A1, obtain the seed query of described demand type;
A2, search for for each seed query of described demand type, obtain respectively and come front N1 Search Results, described N1 is default positive integer;
A3, the text of the Search Results that obtains is carried out word segmentation processing, obtain all n-gram;
A4, determine the weight of each n-gram according to the reverse file frequency of word frequency tf* idf value, obtain weighted value and come front N2 n-gram as the core word vector of described demand type that described N2 is default positive integer.
According to one preferred embodiment of the present invention, described steps A 1 comprises:
Obtain the seed query by the described demand type of manual type configuration; Perhaps,
Obtain the seed query that adopts the described demand type that manual type marks in the search daily record; Perhaps,
From the search daily record of described demand type vertical search, obtain searching times and be higher than the query of preset first threshold value as the seed query of described demand type; Perhaps,
From the search daily record of the Webpage search of described demand type, obtain corresponding to the website of having clicked described search-type or clicked the query of the title that comprises described demand type Feature Words, and searching times among the query that obtains is higher than the query of default Second Threshold as the seed query of described demand type.
According to one preferred embodiment of the present invention, described sorter is: maximum entropy classifiers or support vector machine classifier.
According to one preferred embodiment of the present invention, if described sorter is one, then described S3 is: in the classification results, the Search Results quantity that each classification comprises is determined the demand type of described query to be identified according to demand;
If described sorter is a plurality of, then in described step S3, adopt the fusion method based on boosting, perhaps adopt the Combining Multiple Classifiers of linear weighted function.
A kind of device of identifying search need, this device comprises:
Acquiring unit after being used for receiving query to be identified, obtains the Search Results of described query to be identified as a result;
Sorter is used for carrying out the demand classification based on each Search Results that default Search Results text feature obtains described as a result acquiring unit;
The demand integrated unit is used for the demand classification results of described each Search Results is merged, and determines the demand type of described query to be identified according to fusion results.
According to one preferred embodiment of the present invention, after described as a result acquiring unit receives query to be identified, described query to be identified is offered search engine search, obtain the Search Results that comes top n the Search Results from search engine; Perhaps,
After receiving query to be identified, described query to be identified is expanded, query to be identified is offered search engine search with the expansion contamination, and from search engine obtains described query to be identified and Search Results corresponding to expansion contamination, coming the Search Results of top n, described expansion word is the demand word of each default demand class;
Wherein said N is default positive integer.
According to one preferred embodiment of the present invention, this device adopts more than one sorter and each sorter to adopt respectively different Search Results text features.
According to one preferred embodiment of the present invention, described sorter comprises: the sorter of setting up for web page title, the sorter of setting up for the webpage summary or for the sorter of network address.
According to one preferred embodiment of the present invention, the described sorter of setting up for web page title adopts at least a as the sorter feature in the following Search Results text feature:
The number of times that whether occurs described query to be identified in the web page title and described query to be identified occurs;
The overlapping conditions of the n phrase n-gram of unit that is determined by web page title and the core word vector of each demand type; And;
In the search daily record during the corresponding described query to be identified of web page title clicked number of times account for the ratio of the clicked total degree of corresponding all web page titles of described query to be identified.
According to one preferred embodiment of the present invention, the described sorter of setting up for webpage summary adopts at least a as the sorter feature in the following Search Results text feature:
The sentence number or the ratio that occur described query to be identified in the webpage summary; And,
The n-gram that comprises in the webpage summary and the overlapping conditions between demand type core word vector.
According to one preferred embodiment of the present invention, the described sorter of setting up for network address adopts at least a as the sorter feature in the following Search Results text feature:
The ranking value of the corresponding Search Results of network address;
The page type that network address is corresponding; And,
The ratio of clicked number of times and the clicked total degree of corresponding all network address of described query to be identified during the corresponding described query to be identified of network address in the search daily record.
According to one preferred embodiment of the present invention, this device also comprises: the vector for the core word vector of setting up the demand type is set up the unit;
Described vector is set up the unit and is specifically comprised:
Seed query obtains subelement, is used for obtaining the seed query of described demand type;
Search Results obtains subelement, is used for searching for for each seed query of described demand type, obtains respectively and comes front N1 Search Results, and described N1 is default positive integer;
Phrase obtains subelement, and the text that is used for described Search Results is obtained the Search Results that subelement obtains carries out word segmentation processing, obtains all n-gram;
Vector is set up subelement, is used for determining according to the reverse file frequency of word frequency tf* idf value the weight of each n-gram, obtains weighted value and comes the individual n-gram of front N2 as the core word vector of described demand type, and described N2 is default positive integer.
According to one preferred embodiment of the present invention, described seed query obtains subelement and obtains seed query by the described demand type of manual type configuration; Perhaps,
Obtain the seed query that adopts the described demand type that manual type marks in the search daily record; Perhaps,
From the search daily record of described demand type vertical search, obtain searching times and be higher than the query of preset first threshold value as the seed query of described demand type; Perhaps,
From the search daily record of the Webpage search of described demand type, obtain corresponding to the website of having clicked described search-type or clicked the query of the title that comprises described demand type Feature Words, and searching times among the query that obtains is higher than the query of default Second Threshold as the seed query of described demand type.
According to one preferred embodiment of the present invention, described sorter is: maximum entropy classifiers or support vector machine classifier.
According to one preferred embodiment of the present invention, if described sorter is one, then in the classification results, the Search Results quantity that each classification comprises is determined the demand type of described query to be identified to described demand integrated unit according to demand;
If described sorter is a plurality of, then described demand integrated unit adopts the fusion method based on boosting, perhaps adopts the Combining Multiple Classifiers of linear weighted function.
As can be seen from the above technical solutions, after the present invention obtains the Search Results of query to be identified, by Search Results being carried out the demand classification, further the demand classification results is merged the demand type of determining query to be identified again.This mode can not be subject to whether comprising among the query to be identified the impact of predetermined keyword fully, and any query to be identified can both the realization demand be identified; In addition, because the ageing of user search demand is embodied on the Search Results usually, the demand type that therefore identifies by mode of the present invention can demonstrate fully the ageing of search need, thereby has improved the accuracy of search need identification.
[description of drawings]
The method flow diagram of the identification search need that Fig. 1 provides for the embodiment of the invention one;
The method for building up process flow diagram of the core word vector of the demand type that Fig. 2 provides for the embodiment of the invention two;
The structure drawing of device of the identification search need that Fig. 3 provides for the embodiment of the invention three;
The search need identification that Fig. 4 provides for the embodiment of the invention is used for the instance graph of large search ordering;
The search need identification that Fig. 5 provides for the embodiment of the invention is used for the instance graph of vertical search.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
Embodiment one,
The method flow diagram of the identification search need that Fig. 1 provides for the embodiment of the invention one, as shown in Figure 1, the method can may further comprise the steps:
Step 101: after receiving query to be identified, obtain the Search Results of query to be identified.
After receiving query to be identified, query to be identified is offered search engine retrieve, obtain the Search Results that comes top n in the Search Results.
Wherein, when query to be identified being offered search engine retrieve, can only this query to be identified be offered search engine, and obtain the Search Results of this query to be identified from search engine.Preferably, can expand query to be identified, query to be identified is offered search engine with the expansion contamination, and obtain this query to be identified Search Results corresponding with the expansion contamination from search engine, wherein expansion word is the preset need word of demand class.Because the quantity that these preset need words need is less, be generally tens and get final product, therefore can adopt the manual type configuration.
For example, the preset need word of video class demand comprises: video, TV play, film, high definition are watched etc.The preset need word of menu class demand comprises: menu, recipe, cuisines etc.For query to be identified " home cooking ", then can obtain following query to be identified and expansion contamination so:
" home cooking video ", " home cooking TV play ", " home cooking film ", " the home cooking high definition is watched online ", " home cooking menu ", " home cooking recipe ", " home cooking cuisines " etc.After these combinations are offered search engine, after searching for, search engine returns comprehensive Search Results, then from these Search Results, obtain the Search Results that comes top n, can certainly from the Search Results that search engine returns for each combination, all obtain the Search Results that comes the front, thereby altogether obtain N Search Results.
Query to be identified after the use expansion searches for the purpose of obtaining Search Results and is: thus the demand that overcomes the top n Search Results of some query too concentrates the demand that causes to identify inaccurate problem.For example, query " Zhang Ziyi " has a lot of demands, during roving commission " Zhang Ziyi ", the Search Results of picture category may occur less in coming the Search Results of top n, just be difficult to judge this query and have strong picture demand, if but after this query was extended to " Zhang Ziyi's photo ", the result relevant with the picture demand will more appear in the Search Results that comes top n in the Search Results, this accuracy for follow-up identification query search need has very great help.
Step 102: utilize sorter based on default Search Results text feature each Search Results to be carried out the demand classification.
When in this step Search Results being carried out the demand classification, can adopt more than one sorter, each sorter adopts respectively different Search Results text features.Can at least a sorter of setting up in web page title, webpage summary and the network address of Search Results, to set up three sorters as example, be called title sorter, summary sorter and network address sorter at this in the present embodiment.The below is described the sorter feature that these three kinds of sorters adopt respectively.
1) can adopt at least a as the sorter feature in following three kinds of Search Results text features for the title sorter:
The first: the number of times that query to be identified whether occurs and query to be identified occurs in the web page title.
What this feature was weighed is the correlativity of web page title and the query to be identified of Search Results, if occur query to be identified in the web page title, illustrate that then this Search Results is more relevant with query to be identified, more can cause contribution to the search need of identifying query to be identified.For example, the web page title of certain Search Results is " the way Foods home cooking of modal home cooking menu-home cooking ", query to be identified is " home cooking ", this query to be identified appears in this web page title, and occurred 3 times, illustrated that this Search Results has larger contribution to the demand of identifying this query.
The second: the overlapping conditions of the n-gram that is determined by web page title and the core word vector of each demand type.
So-called n-gram is exactly n the combination that word occurs in order of minimum particle size, and wherein n is default one or more positive integers.Take web page title " the way Foods home cooking of modal home cooking menu-home cooking " as example, be 1 and 2 if choose n, the n-gram that is then determined by this web page title is:
1-gram:, common,, home cooking, menu, home cooking,, way, cuisines, all over the world, home cooking
2-gram: the most common, the most common, home cooking, home cooking menu, menu home cooking, home cooking, way, way cuisines, Foods, home cooking all over the world
The foundation of the core word vector of each demand type can be adopted the mode of human configuration, also can adopt the mode of automatic mining, and the mode of automatic mining is referring to embodiment two.
Suppose that the core word vector of the menu class demand that obtains can be as follows, comprises core word and respective weights in the core word vector for after the flow process shown in the menu class demand execution embodiment two:
Figure BDA0000093135840000091
When the overlapping conditions of the core word vector of the n-gram that is determined by web page title and each demand type, this overlapping conditions can be overlapping number of times or Duplication.
The upper example that continues, the overlapping number of times of the core word vector of n-gram and menu class is as shown in table 1.
Table 1
n-gram Overlapping number of times
Home cooking
3
Menu 1
The home cooking menu 1
Cuisines 1
The account form of Duplication can for: with the ratio of total weight sum of the core word vector of the weight sum of the overlapping n-gram of the core word of demand type vector and this demand type.The upper example that continues, the Duplication of the core word vector of n-gram and menu class is:
(0.82+0.98+1.00+0.95)/(0.82+1.00+1.00+1.00+0.92+0.56+0.98+0.87+1.00+0.95+1.00)=0.37
The third: in the search daily record during the corresponding query to be identified of this web page title clicked number of times account for the ratio of the clicked total degree of corresponding all web page titles of query to be identified.
Behind query of user search, if the web page title of certain Search Results is enough attractive, then the user will tend to click this Search Results.Therefore the web page title of certain Search Results by the user click more, illustrate that then the ability that this web page title meets consumers' demand is stronger.
For instance, query to be identified " home cooking ", web page title is 120 times for the clicked number of times of " the way Foods home cooking of modal home cooking menu-home cooking " in its corresponding Search Results, and the clicked total degree of corresponding all web page titles of this query to be identified is 300 times, and then calculating ratio is 120/300=0.4.
2) can adopt at least a as the sorter feature in following two kinds of Search Results text features for the summary sorter:
The first: the sentence number or the ratio that occur query to be identified in the webpage summary.
What this feature was weighed is the situation that the webpage summary satisfies user's request.In the webpage summary, the sentence that comprises query to be identified is more, and better to the need satisfaction of this query to be identified of this Search Results is described.
The webpage summary of supposing certain Search Results is: home cooking is requisite during we live, the way of home cooking is various, such as the northeast home cooking, and Guo Lin home cooking etc., is it how the simplest that cook the home cooking menu? the cuisines outstanding person enriches simple home cooking menu complete works for you provide, and allows you learn fast.
This webpage summary can be cut into 7 sentences, and the sentence that wherein comprises home cooking is 6, and the sentence ratio that then occurs query to be identified in the webpage summary is 6/7=0.86.
The second: the n-gram that comprises in the webpage summary and the overlapping conditions between demand type core word vector.
This kind feature can with reference to the second feature described in the title sorter, not repeat them here.
3) can adopt at least a as the sorter feature in following three kinds of Search Results text features for the network address sorter:
The first: the ranking value of corresponding Search Results.
When search engine sorts at each Search Results to a query; usually can with the weights of network address as according to one of; therefore, the text and the correlativity between query that comprise in the webpage that the weight of network address is larger, network address is corresponding are stronger, and then the ordering of network address is more forward.Therefore we can weigh network address to the need satisfaction degree of query to be identified with the ranking value of the corresponding Search Results of network address, and computing formula is as follows:
rank _ score = log N + 1 n
Wherein rank_score be network address to the need satisfaction degree of query to be identified, N is the above-mentioned Search Results number of choosing, n is current search result's ranking value.
The second: the page type that network address is corresponding.
We can be in advance method by machine learning obtain page type corresponding to network address, such as video class, picture category, novel class, menu class etc., if the page type of network address is consistent with the demand type, illustrate that then this network address need satisfaction degree to user query on this demand type is high.The value of this feature can be 0 or 1, and is if unanimously must be divided into 1, inconsistent as to be divided into 0.
The third: in the search daily record during the corresponding query to be identified of this network address clicked number of times account for the ratio of the clicked total degree of corresponding all network address of this query to be identified.
Behind query of user search, if certain bar Search Results is high to user's whole satisfaction, then the user will tend to click this Search Results.Therefore certain network address by the user click more, illustrate that then the ability that this network address meets consumers' demand is stronger.
For example, in the search daily record, in the Search Results of query " home cooking " correspondence, the clicked number of times of network address " www.meishij.net/chufang/diy/jiangchangcaipu/ " is 100 times, and the clicked total degree of network address corresponding to this query is 300 times, and then to account for the ratio of the clicked total degree of the corresponding network address of this query to be identified be 100/300=0.33 to the clicked number of times of this network address.
After the sorter feature that obtains respectively title sorter, summary sorter, network address sorter, make up respectively training set, this training set can be set up by the method for craft or machine learning, thereby trains 3 sorters.These training sets can be made up after feature extraction by the sample of each demand type, comprise the eigenwert of each sorter feature that each demand type is corresponding in the training set that finally obtains.
Wherein each sorter can adopt but be not limited to maximum entropy classifiers, support vector machine (SVM) sorter etc.After feature extraction, respectively input header sorter, summary sorter, network address sorter just can have been classified to each Search Results respectively again with the Search Results of the query to be identified that gets access to.Because behind the characteristic of division of having determined sorter, the mode that adopts the sorters such as maximum entropy classifiers, svm classifier device that text is classified is existing mature technology, does not repeat them here.
The below has provided Search Results and carried out sorted classification example by title sorter, summary sorter, network address sorter, and is as shown in table 2.
Table 2
Figure BDA0000093135840000121
Figure BDA0000093135840000131
Step 103: the demand classification results to each Search Results merges, and determines the demand type of query to be identified according to fusion results.
If only adopt in the present invention a sorter, then when a sorter is merged the demand classification results of each Search Results, the Search Results quantity that comprises according to each classification is determined the demand type of query to be identified.For example, can adopt the mode of ballot, the classification at most Search Results place is as the demand type of query to be identified, for example, in 100 Search Results, there are 70 Search Results to be divided into the menu class, there are 30 Search Results to be divided into the novel class, determine that then query to be identified is the menu class.Also can adopt the mode of the class probability that calculates each classification, class probability is surpassed the demand type as this query to be identified of setting threshold classification, wherein class probability is Search Results quantity during this is classified and the ratio of Search Results total quantity.
When in this step the classification results of multi-categorizer being merged, can adopt existing multiple Classifiers Combination mode, such as based on the fusion method of boosting, the Combining Multiple Classifiers of linear weighted function etc.Only carry out simple declaration take the Combining Multiple Classifiers of linear weighted function as example at this, namely calculate Search Results at demand type c according to following formula kOn probability c *:
c *=αP title(c k|q)+βP text(c k|q)+(1-α-β)P url(c k|q)
P wherein Title(c k| q) be query to be identified based on the title sorter at demand type c kOn class probability, P Text(c k| be query to be identified based on the summary sorter at demand type c q) kOn class probability, P Url(c k| q) be query to be identified based on the network address sorter at demand type c kOn class probability.α, β are weighting coefficient, can obtain with the preset algorithm training by experiment, to obtain the optimal classification effect.
Finally, adopt flow process shown in the present embodiment, can determine the demand type of each query to be identified, give some instances, as shown in table 3.
Table 3
query Video Menu Picture The restaurant
Home cooking Strong demand Inferior demand Without demand Weak demand
The way of home cooking Weak demand Strong demand Without demand Without demand
Jewel in the Palace Strong demand Without demand Without demand Inferior demand
The Jewel in the Palace high definition is watched online Strong demand Without demand Without demand Without demand
Embodiment two,
The method for building up process flow diagram of the core word vector of the demand type that Fig. 2 provides for the embodiment of the invention two, as shown in Figure 2, the method may further comprise the steps:
Step 201, obtain the seed query of demand type.
At first preset the seed query set of each demand type, these seeds query embodies the demand of corresponding demand type, and these seeds query set can be disposed by artificial mode, perhaps adopts artificial mode to mark in the search daily record.More preferably, also can from the search daily record, excavate seed query, for example from the search daily record of this demand type vertical search, obtain searching times and be higher than the query of preset first threshold value as the seed query of this demand type; Perhaps, from the search daily record of the Webpage search of this demand type, obtain corresponding to the website of having clicked this demand type or clicked the query of the title that comprises this demand type Feature Words, and searching times among the query that obtains is higher than the query of default Second Threshold as the seed query of this demand type, etc.
Step 202: each the seed query for this demand type searches for, and obtains respectively to come front N1 Search Results, and wherein N1 is default positive integer.
Step 203: the text to the Search Results that obtains carries out word segmentation processing, obtains all n-gram.
The text of Search Results can include but not limited to herein: web page title, webpage summary etc.
Step 204: the weight of determining each n-gram according to word frequency (tf) the reverse file frequency of * (idf) value, according to weight all n-gram are sorted, it is vectorial as the core word of this demand type to obtain the n-gram that comes front N2, and wherein N2 is default positive integer.
The weight that comprises n-gram and n-gram in the core word vector of the demand type that finally obtains.
More than be the detailed description that method provided by the present invention is carried out, be described in detail below by three pairs of devices provided by the invention of embodiment.
Embodiment three,
The structure drawing of device of the identification search need that Fig. 3 provides for the embodiment of the invention three, as shown in Figure 3, this device comprises: as a result acquiring unit 300, sorter 310 and demand integrated unit 320.
After acquiring unit 300 receives query to be identified as a result, obtain the Search Results of query to be identified.
Particularly, after acquiring unit 300 receives query to be identified as a result, query to be identified is offered search engine retrieve, obtain the Search Results that comes top n the Search Results from search engine; Perhaps, after receiving query to be identified, query to be identified is expanded, query to be identified is offered search engine with the expansion contamination, and from search engine obtains query to be identified and Search Results corresponding to expansion contamination, coming the Search Results of top n, expansion word is the demand word of each default demand class; Wherein N is default positive integer.
Sorter 310 carries out the demand classification based on default Search Results text feature to each Search Results that acquiring unit as a result obtains.
The demand classification results of 320 pairs of each Search Results of demand integrated unit merges, and determines the demand type of query to be identified according to fusion results.
In this device, can adopt more than one sorter and each sorter to adopt respectively different Search Results text features.Particularly, sorter 310 can comprise: the sorter 311 of setting up for web page title, the sorter 312 of setting up for the webpage summary or for the sorter 313 of network address.
Wherein, the sorter 311 of setting up for web page title can adopt at least a as the sorter feature in the following Search Results text feature:
The number of times that query to be identified whether occurs and query to be identified occurs in the web page title;
The overlapping conditions of the n phrase n-gram of unit that is determined by web page title and the core word vector of each demand type; And;
In the search daily record during the corresponding query to be identified of web page title clicked number of times account for the ratio of the clicked total degree of corresponding all web page titles of query to be identified.
The sorter of setting up for webpage summary can adopt at least a as the sorter feature in the following Search Results text feature:
The sentence number or the ratio that occur query to be identified in the webpage summary; And,
The n-gram that comprises in the webpage summary and the overlapping conditions between demand type core word vector.
The sorter of setting up for network address adopts at least a as the sorter feature in the following Search Results text feature:
The ranking value of the corresponding Search Results of network address;
The page type that network address is corresponding; And,
The ratio of clicked number of times and the clicked total degree of corresponding all network address of query to be identified during the corresponding query to be identified of network address in the search daily record.
Because for the sorter 3 11 of web page title foundation and the core word vector of all having used the demand type for the sorter 312 that the webpage summary is set up, therefore, this device can also comprise: the vector that is used for setting up the core word vector of demand type is set up unit 330, specifically comprises: seed query obtains that subelement 331, Search Results obtain subelement 332, phrase obtains subelement 333 and vector is set up subelement 334.
Seed query obtains the seed query that subelement 331 obtains the demand type.Particularly, can obtain the seed query of the demand type that disposes by manual type; Perhaps, obtain the seed query that adopts the demand type that manual type marks in the search daily record; Perhaps, from the search daily record of demand type vertical search, obtain searching times and be higher than the query of preset first threshold value as the seed query of demand type; Perhaps, from the search daily record of the Webpage search of demand type, obtain corresponding to the website of having clicked search-type or clicked the query of the title that comprises demand type Feature Words, and searching times among the query that obtains is higher than the query of default Second Threshold as the seed query of demand type.
Search Results obtains subelement 332 and searches for for each seed query of demand type, obtains respectively to come front N1 Search Results, and N1 is default positive integer.
Phrase obtains the text that 333 pairs of Search Results of subelement obtain the Search Results that subelement 332 obtains and carries out word segmentation processing, obtains all n-gram.
Vector is set up subelement 334 according to the weight of definite each n-gram of tf*idf value, obtains weighted value and comes the individual n-gram of front N2 as the core word vector of demand type, and N2 is default positive integer.
As preferred embodiment, above-mentioned sorter 310 can adopt but be not limited to maximum entropy classifiers or support vector machine classifier.
If sorter is one, then in the classification results, the Search Results quantity that each classification comprises is determined the demand type of query to be identified to demand integrated unit 320 according to demand.
If sorter is a plurality of, when then the classification results of 320 pairs of multi-categorizers of demand integrated unit merges, can adopt existing multiple Classifiers Combination mode, for example based on the fusion method of boosting, perhaps adopt the Combining Multiple Classifiers of linear weighted function etc., do not repeat them here.
After the said method that adopts the embodiment of the invention to provide or device identify the demand type, can be used for but be not limited to following application scenarios:
1) is used for the ordering of large search.After the user inputted query, the said method by the embodiment of the invention and device can identify the demand type of this query, with in the Search Results of large search to the page-ranking of demand type that should query in advance.
For example, when the user inputs query " home cooking high definition ", can in large search, identify this query and have the video class demand, the associated video information that in for the results page of this large search, can have " home cooking " this TV play, obtaining of this partial video information can be that the video vertical search provides and inserts in the Search Results of large search, like this in the Search Results of large search, the page of this video class can be come the front of Search Results, as shown in Figure 4, so that user's satisfaction and search experience all will be greatly improved.
2) be used for vertical search.After the user inputs query, said method by the embodiment of the invention and device can identify the demand type of this query, this query is distributed to optimum content resource or application provider's processing, the final accurate result that the user is complementary that returns to efficiently.
For example, and when user's input " from Baidu's mansion to five road junctions ", can identify this query and have the map class demand, this query is offered the map vertical search, carried out the calculating of bus routes by the map vertical search, then directly show bus trip map and relevant bus information from Baidu's mansion to five road junctions, as shown in Figure 5.
3) be used for information recommendation.After the user inputted query, the said method by the embodiment of the invention and device can identify the demand type of this query, based on this demand type the user are carried out information recommendation, recommended such as recommendation, the query of advertisement recommendation, knowledge question platform etc.
For example, the user inputs query " cheap MP3 player " and identifies its demand type and be the shopping class, then can recommend the advertisement relevant with the MP3 player at Search Results, and advertisement and user's actual demand matching degree is just very high like this.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (22)

1. method of identifying search need is characterized in that the method comprises:
S1, receive query to be identified after, obtain the Search Results of described query to be identified;
S2, utilize sorter based on default Search Results text feature each Search Results to be carried out the demand classification;
S3, the demand classification results of each Search Results is merged, determine the demand type of described query to be identified according to fusion results.
2. method according to claim 1 is characterized in that, described step S1 specifically comprises:
After receiving query to be identified, described query to be identified is offered search engine search, obtain the Search Results that comes top n the Search Results from search engine; Perhaps,
After receiving query to be identified, described query to be identified is expanded, query to be identified is offered search engine search with the expansion contamination, and from search engine obtains described query to be identified and Search Results corresponding to expansion contamination, coming the Search Results of top n, described expansion word is the demand word of each default demand class;
Wherein said N is default positive integer.
3. method according to claim 1 is characterized in that, adopts more than one sorter and each sorter to adopt respectively different Search Results text features in described step S2.
4. method according to claim 1 is characterized in that, described sorter comprises: the sorter of setting up for web page title, the sorter of setting up for the webpage summary or for the sorter of network address.
5. method according to claim 4 is characterized in that, the described sorter of setting up for web page title adopts at least a as the sorter feature in the following Search Results text feature:
The number of times that whether occurs described query to be identified in the web page title and described query to be identified occurs;
The overlapping conditions of the n phrase n-gram of unit that is determined by web page title and the core word vector of each demand type; And;
In the search daily record during the corresponding described query to be identified of web page title clicked number of times account for the ratio of the clicked total degree of corresponding all web page titles of described query to be identified.
6. method according to claim 4 is characterized in that, the described sorter of setting up for webpage summary adopts at least a as the sorter feature in the following Search Results text feature:
The sentence number or the ratio that occur described query to be identified in the webpage summary; And,
The n-gram that comprises in the webpage summary and the overlapping conditions between demand type core word vector.
7. method according to claim 4 is characterized in that, the described sorter of setting up for network address adopts at least a as the sorter feature in the following Search Results text feature:
The ranking value of the corresponding Search Results of network address;
The page type that network address is corresponding; And,
The ratio of clicked number of times and the clicked total degree of corresponding all network address of described query to be identified during the corresponding described query to be identified of network address in the search daily record.
8. according to claim 5 or 6 described methods, it is characterized in that the foundation of the core word vector of described demand type comprises:
A1, obtain the seed query of described demand type;
A2, search for for each seed query of described demand type, obtain respectively and come front N1 Search Results, described N1 is default positive integer;
A3, the text of the Search Results that obtains is carried out word segmentation processing, obtain all n-gram;
A4, determine the weight of each n-gram according to the reverse file frequency of word frequency tf* idf value, obtain weighted value and come front N2 n-gram as the core word vector of described demand type that described N2 is default positive integer.
9. method according to claim 8 is characterized in that, described steps A 1 comprises:
Obtain the seed query by the described demand type of manual type configuration; Perhaps,
Obtain the seed query that adopts the described demand type that manual type marks in the search daily record; Perhaps,
From the search daily record of described demand type vertical search, obtain searching times and be higher than the query of preset first threshold value as the seed query of described demand type; Perhaps,
From the search daily record of the Webpage search of described demand type, obtain corresponding to the website of having clicked described search-type or clicked the query of the title that comprises described demand type Feature Words, and searching times among the query that obtains is higher than the query of default Second Threshold as the seed query of described demand type.
10. according to claim 1 to the described method of 7 arbitrary claims, it is characterized in that described sorter is: maximum entropy classifiers or support vector machine classifier.
11. to the described method of 7 arbitrary claims, it is characterized in that according to claim 1 if described sorter is one, then described S3 is: in the classification results, the Search Results quantity that each classification comprises is determined the demand type of described query to be identified according to demand;
If described sorter is a plurality of, then in described step S3, adopt the fusion method based on boosting, perhaps adopt the Combining Multiple Classifiers of linear weighted function.
12. a device of identifying search need is characterized in that, this device comprises:
Acquiring unit after being used for receiving query to be identified, obtains the Search Results of described query to be identified as a result;
Sorter is used for carrying out the demand classification based on each Search Results that default Search Results text feature obtains described as a result acquiring unit;
The demand integrated unit is used for the demand classification results of described each Search Results is merged, and determines the demand type of described query to be identified according to fusion results.
13. device according to claim 12, it is characterized in that, after described as a result acquiring unit receives query to be identified, described query to be identified is offered search engine search, obtain the Search Results that comes top n the Search Results from search engine; Perhaps,
After receiving query to be identified, described query to be identified is expanded, query to be identified is offered search engine search with the expansion contamination, and from search engine obtains described query to be identified and Search Results corresponding to expansion contamination, coming the Search Results of top n, described expansion word is the demand word of each default demand class;
Wherein said N is default positive integer.
14. device according to claim 12 is characterized in that, this device adopts more than one sorter and each sorter to adopt respectively different Search Results text features.
15. device according to claim 12 is characterized in that, described sorter comprises: the sorter of setting up for web page title, the sorter of setting up for the webpage summary or for the sorter of network address.
16. device according to claim 15 is characterized in that, the described sorter of setting up for web page title adopts at least a as the sorter feature in the following Search Results text feature:
The number of times that whether occurs described query to be identified in the web page title and described query to be identified occurs;
The overlapping conditions of the n phrase n-gram of unit that is determined by web page title and the core word vector of each demand type; And;
In the search daily record during the corresponding described query to be identified of web page title clicked number of times account for the ratio of the clicked total degree of corresponding all web page titles of described query to be identified.
17. device according to claim 15 is characterized in that, the described sorter of setting up for webpage summary adopts at least a as the sorter feature in the following Search Results text feature:
The sentence number or the ratio that occur described query to be identified in the webpage summary; And,
The n-gram that comprises in the webpage summary and the overlapping conditions between demand type core word vector.
18. device according to claim 15 is characterized in that, the described sorter of setting up for network address adopts at least a as the sorter feature in the following Search Results text feature:
The ranking value of the corresponding Search Results of network address;
The page type that network address is corresponding; And,
The ratio of clicked number of times and the clicked total degree of corresponding all network address of described query to be identified during the corresponding described query to be identified of network address in the search daily record.
19. according to claim 16 or 17 described devices, it is characterized in that this device also comprises: the vector that is used for setting up the core word vector of demand type is set up the unit;
Described vector is set up the unit and is specifically comprised:
Seed query obtains subelement, is used for obtaining the seed query of described demand type;
Search Results obtains subelement, is used for searching for for each seed query of described demand type, obtains respectively and comes front N1 Search Results, and described N1 is default positive integer;
Phrase obtains subelement, and the text that is used for described Search Results is obtained the Search Results that subelement obtains carries out word segmentation processing, obtains all n-gram;
Vector is set up subelement, is used for determining according to the reverse file frequency of word frequency tf* idf value the weight of each n-gram, obtains weighted value and comes the individual n-gram of front N2 as the core word vector of described demand type, and described N2 is default positive integer.
20. device according to claim 19 is characterized in that, described seed query obtains the seed query that subelement obtains the described demand type that disposes by manual type; Perhaps,
Obtain the seed query that adopts the described demand type that manual type marks in the search daily record; Perhaps,
From the search daily record of described demand type vertical search, obtain searching times and be higher than the query of preset first threshold value as the seed query of described demand type; Perhaps,
From the search daily record of the Webpage search of described demand type, obtain corresponding to the website of having clicked described search-type or clicked the query of the title that comprises described demand type Feature Words, and searching times among the query that obtains is higher than the query of default Second Threshold as the seed query of described demand type.
21. to the described device of 18 arbitrary claims, it is characterized in that described sorter is: maximum entropy classifiers or support vector machine classifier according to claim 12.
22. according to claim 12 to the described device of 18 arbitrary claims, it is characterized in that, if described sorter is one, then in the classification results, the Search Results quantity that each classification comprises is determined the demand type of described query to be identified to described demand integrated unit according to demand;
If described sorter is a plurality of, then described demand integrated unit adopts the fusion method based on boosting, perhaps adopts the Combining Multiple Classifiers of linear weighted function.
CN201110282840.8A 2011-09-21 2011-09-21 A kind of method and apparatus identifying search need Active CN103020066B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110282840.8A CN103020066B (en) 2011-09-21 2011-09-21 A kind of method and apparatus identifying search need

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110282840.8A CN103020066B (en) 2011-09-21 2011-09-21 A kind of method and apparatus identifying search need

Publications (2)

Publication Number Publication Date
CN103020066A true CN103020066A (en) 2013-04-03
CN103020066B CN103020066B (en) 2016-09-07

Family

ID=47968682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110282840.8A Active CN103020066B (en) 2011-09-21 2011-09-21 A kind of method and apparatus identifying search need

Country Status (1)

Country Link
CN (1) CN103020066B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103366002A (en) * 2013-07-17 2013-10-23 北京奇虎科技有限公司 Personalized vertical search method and device
CN103838744A (en) * 2012-11-22 2014-06-04 百度在线网络技术(北京)有限公司 Method and device for analyzing query requirement
CN104424296A (en) * 2013-09-02 2015-03-18 阿里巴巴集团控股有限公司 Query word classifying method and query word classifying device
CN105574177A (en) * 2015-12-21 2016-05-11 北京奇虎科技有限公司 Method for presenting search result and display equipment
CN106682192A (en) * 2016-12-29 2017-05-17 北京奇虎科技有限公司 Method and device for training answer intention classification model based on search keywords
CN107092621A (en) * 2016-11-24 2017-08-25 北京小度信息科技有限公司 Information search method and device
WO2017201907A1 (en) * 2016-05-24 2017-11-30 百度在线网络技术(北京)有限公司 Search term classification method and device
CN108052613A (en) * 2017-12-14 2018-05-18 北京百度网讯科技有限公司 For generating the method and apparatus of the page
TWI645303B (en) * 2016-12-21 2018-12-21 財團法人工業技術研究院 Method for verifying string, method for expanding string and method for training verification model
CN109582791A (en) * 2018-11-13 2019-04-05 阿里巴巴集团控股有限公司 The Risk Identification Method and device of text
CN109582792A (en) * 2018-11-16 2019-04-05 北京奇虎科技有限公司 A kind of method and device of text classification
CN110019304A (en) * 2017-12-18 2019-07-16 上海智臻智能网络科技股份有限公司 Extend the method and storage medium, terminal of question and answer knowledge base
CN112100480A (en) * 2020-09-15 2020-12-18 北京百度网讯科技有限公司 Search method, device, equipment and storage medium

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838744A (en) * 2012-11-22 2014-06-04 百度在线网络技术(北京)有限公司 Method and device for analyzing query requirement
CN103838744B (en) * 2012-11-22 2019-01-15 百度在线网络技术(北京)有限公司 A kind of method and device of query word demand analysis
CN103366002A (en) * 2013-07-17 2013-10-23 北京奇虎科技有限公司 Personalized vertical search method and device
CN104424296B (en) * 2013-09-02 2018-07-31 阿里巴巴集团控股有限公司 Query word sorting technique and device
CN104424296A (en) * 2013-09-02 2015-03-18 阿里巴巴集团控股有限公司 Query word classifying method and query word classifying device
CN105574177A (en) * 2015-12-21 2016-05-11 北京奇虎科技有限公司 Method for presenting search result and display equipment
CN105574177B (en) * 2015-12-21 2019-03-05 北京奇虎科技有限公司 The method and display equipment of search result is presented
WO2017201907A1 (en) * 2016-05-24 2017-11-30 百度在线网络技术(北京)有限公司 Search term classification method and device
CN107092621A (en) * 2016-11-24 2017-08-25 北京小度信息科技有限公司 Information search method and device
TWI645303B (en) * 2016-12-21 2018-12-21 財團法人工業技術研究院 Method for verifying string, method for expanding string and method for training verification model
CN106682192A (en) * 2016-12-29 2017-05-17 北京奇虎科技有限公司 Method and device for training answer intention classification model based on search keywords
CN106682192B (en) * 2016-12-29 2020-07-03 北京奇虎科技有限公司 Method and device for training answer intention classification model based on search keywords
CN108052613A (en) * 2017-12-14 2018-05-18 北京百度网讯科技有限公司 For generating the method and apparatus of the page
CN108052613B (en) * 2017-12-14 2021-12-31 北京百度网讯科技有限公司 Method and device for generating page
CN110019304A (en) * 2017-12-18 2019-07-16 上海智臻智能网络科技股份有限公司 Extend the method and storage medium, terminal of question and answer knowledge base
CN110019304B (en) * 2017-12-18 2024-01-05 上海智臻智能网络科技股份有限公司 Method for expanding question-answering knowledge base, storage medium and terminal
CN109582791A (en) * 2018-11-13 2019-04-05 阿里巴巴集团控股有限公司 The Risk Identification Method and device of text
CN109582791B (en) * 2018-11-13 2023-01-24 创新先进技术有限公司 Text risk identification method and device
CN109582792A (en) * 2018-11-16 2019-04-05 北京奇虎科技有限公司 A kind of method and device of text classification
CN112100480A (en) * 2020-09-15 2020-12-18 北京百度网讯科技有限公司 Search method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN103020066B (en) 2016-09-07

Similar Documents

Publication Publication Date Title
CN103020066A (en) Method and device for recognizing search demand
CN101551806B (en) Personalized website navigation method and system
US9405805B2 (en) Identification and ranking of news stories of interest
CN103744981B (en) System for automatic classification analysis for website based on website content
CN102298616B (en) Method and device for providing related sub links in search result
TWI615724B (en) Information push, search method and device based on electronic information-based keyword extraction
CN103577478B (en) Web page push method and system
CN106156204A (en) The extracting method of text label and device
CN106156372B (en) A kind of classification method and device of internet site
CN105095187A (en) Search intention identification method and device
CN103116588A (en) Method and system for personalized recommendation
CN103020067A (en) Method and device for determining webpage type
CN103729359A (en) Method and system for recommending search terms
CN103186550A (en) Method and system for generating video-related video list
CN103246725A (en) Wireless network based data traffic pushing system and method
JP2009043156A (en) Apparatus and method for searching for program
CN103279557A (en) Related-word registration device, information processing device, and related-word registration method
CN102385585A (en) Establishing method of webpage database, webpage searching method and relative device
Wu et al. News filtering and summarization on the web
CN105653562A (en) Calculation method and apparatus for correlation between text content and query request
CN113722478B (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
CN103186556A (en) Method for obtaining and searching structural semantic knowledge and corresponding device
CN104008180A (en) Association method of structural data with picture, association device thereof
CN103927339B (en) Knowledge Reorganizing system and method for knowledge realignment
CN104899215A (en) Data processing method, recommendation source information organization, information recommendation method and information recommendation device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant