CN103020066B - A kind of method and apparatus identifying search need - Google Patents

A kind of method and apparatus identifying search need Download PDF

Info

Publication number
CN103020066B
CN103020066B CN201110282840.8A CN201110282840A CN103020066B CN 103020066 B CN103020066 B CN 103020066B CN 201110282840 A CN201110282840 A CN 201110282840A CN 103020066 B CN103020066 B CN 103020066B
Authority
CN
China
Prior art keywords
query
identified
search results
search
grader
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110282840.8A
Other languages
Chinese (zh)
Other versions
CN103020066A (en
Inventor
黄际洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110282840.8A priority Critical patent/CN103020066B/en
Publication of CN103020066A publication Critical patent/CN103020066A/en
Application granted granted Critical
Publication of CN103020066B publication Critical patent/CN103020066B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of method and apparatus identifying search need, wherein method includes: after receiving query to be identified, obtains the Search Results of described query to be identified;Utilize grader, based on default Search Results text feature, each Search Results is carried out demand classification;The demand classification result of each Search Results is merged, determines the demand type of described query to be identified according to fusion results.This mode by whether comprising predetermined keyword in query to be identified will not be affected completely, and any query to be identified can be realized demand identification;Further, since the ageing of user's search need is generally embodied on Search Results, the demand type therefore identified by the way of the present invention can fully demonstrate the ageing of search need, thus improves the accuracy of search need identification.

Description

A kind of method and apparatus identifying search need
[technical field]
The present invention relates to field of computer technology, particularly to a kind of method identifying search need and dress Put.
[background technology]
Along with the Internet developing rapidly with ripe in the world, the information resources on network are the richest Richness, information data amount is also rapidly expanding, and obtains information by search engine and has become as modern's acquisition The major way of information.In order to provide a user with more convenient, exactly inquiry service be search engine skill Art is in the current and following developing direction.
In search engine technique, it is identified being to improve searching accuracy and have to the search need of user An important ring of effect property, in structured search (i.e. vertical search), effect is notable especially.Existing search Rope demand identification the most simply uses the mode mating preset key word, and such as, corresponding video requirement is pre- Put some key words: " viewing online ", " download online ", " program request ", " high definition viewing " etc., If a searching request (query) comprises some key word, such as query " home cooking high definition Viewing ", then can identify this query and there is video requirement.But this mode has the disadvantage that
If not comprising preset key word in defect one query, then None-identified goes out the demand of query Type, if such as query is only " home cooking ", is difficult to directly judge according to this query The demand of this query.
Defect two, the ageing of query demand cannot be embodied.The demand of some query can be over time Passage and change, such as, " home cooking " this query, in TV play " home cooking " not Before showing, the major demands of this query is menu class and cuisines class, but in TV play " home cooking " When reflecting, the major demands of this query may just change into video class, and menu class and cuisines class may become For secondary demand.And after TV play " home cooking " terminates hot showing, people are for the concern of this TV play Degree declines, and at this moment the major demands of this query becomes menu class and cuisines class again again.Existing search Demand recognition methods obviously cannot embody this change.
Above-mentioned two defect the most all can cause the accuracy of search need identification relatively low, causes for this The Search Results of query cannot meet search need exactly, and user needs to spend more time and money The content of needs is found in source.
[summary of the invention]
The invention provides a kind of method and apparatus identifying search need, solution does not comprise pre-because of query Put the demand None-identified that key word causes and the ageing defect that query demand cannot be embodied, carry The accuracy of high search need identification.
Concrete technical scheme is as follows:
A kind of method identifying search need, the method includes:
S1, receive query to be identified after, obtain the Search Results of described query to be identified;
S2, utilize grader, based on default Search Results text feature, each Search Results is carried out demand classification;
S3, demand classification result to each Search Results merge, according to fusion results determine described in wait to know The demand type of other query.
According to one preferred embodiment of the present invention, described step S1 specifically includes:
After receiving query to be identified, described query to be identified is supplied to search engine and scans for, from Search engine obtains the Search Results coming top n in Search Results;Or,
After receiving query to be identified, described query to be identified is extended, by query to be identified with Extension contamination is supplied to search engine and scans for, and from search engine obtain described query to be identified with Coming the Search Results of top n in the Search Results that extension contamination is corresponding, described expansion word is default The demand word of each demand class;
Wherein said N is default positive integer.
According to one preferred embodiment of the present invention, in described step S2, use more than one grader and each Individual grader is respectively adopted different Search Results text features.
According to one preferred embodiment of the present invention, described grader includes: for web page title set up grader, Grader for web-page summarization foundation or the grader for network address.
According to one preferred embodiment of the present invention, the described grader employing for web page title foundation is following searches for knot Really at least one in text feature is as grader feature:
Whether web page title occurs described query to be identified and the number of times of described query to be identified occurs;
The overlapping shape that n-gram word group n-gram determined by web page title is vectorial with the core word of each demand type Condition;And;
Number of times clicked during the corresponding described query to be identified of web page title in search daily record accounts for described to be identified The ratio of the clicked total degree of the corresponding all web page titles of query.
According to one preferred embodiment of the present invention, the described grader employing for web-page summarization foundation is following searches for knot Really at least one in text feature is as grader feature:
Web-page summarization occurs sentence number or the ratio of described query to be identified;And,
Overlapping conditions between the n-gram comprised in web-page summarization and demand type core word vector.
According to one preferred embodiment of the present invention, the described grader set up for network address uses following Search Results At least one in text feature is as grader feature:
The ranking value of network address correspondence Search Results;
The page type that network address is corresponding;And,
Number of times clicked during the corresponding described query to be identified of network address in search daily record and described query to be identified The ratio of the clicked total degree of corresponding all network address.
According to one preferred embodiment of the present invention, the foundation of the core word vector of described demand type includes:
A1, obtain the seed query of described demand type;
A2, each seed query for described demand type scan for, and obtain respectively and come front N1 Search Results, described N1 is default positive integer;
A3, the text of Search Results obtained is carried out word segmentation processing, obtain all n-gram;
A4, determine the weight of each n-gram according to word frequency tf* reverse document-frequency idf value, obtain weighted value row Vectorial as the core word of described demand type at the n-gram of front N2, described N2 is default positive integer.
According to one preferred embodiment of the present invention, described step A1 includes:
Obtain the seed query of the described demand type configured by manual type;Or,
Obtain and use manual type seed query of the described demand type of mark in search daily record;Or,
From the search daily record of described demand type vertical search, obtain searching times higher than preset first threshold value Query as the seed query of described demand type;Or,
From the search daily record of the Webpage search of described demand type, obtain corresponding to clicking described demand class The website of type or click the query of the title comprising described demand type Feature Words, and the query that will obtain Middle searching times is higher than the query seed query as described demand type presetting Second Threshold.
According to one preferred embodiment of the present invention, described grader is: maximum entropy classifiers or support vector machine Grader.
According to one preferred embodiment of the present invention, if described grader is one, the most described S3 is: according to need Asking in classification results, the Search Results quantity comprised of respectively classifying determines the demand type of described query to be identified;
If described grader is multiple, then in described step S3, use fusion method based on boosting, Or use the Combining Multiple Classifiers of linear weighted function.
A kind of device identifying search need, this device includes:
Result acquiring unit, after being used for receiving query to be identified, obtains the search of described query to be identified Result;
Grader, each for described result acquiring unit is obtained based on default Search Results text feature Search Results carries out demand classification;
Demand integrated unit, for merging, according to melting the demand classification result of described each Search Results Close result and determine the demand type of described query to be identified.
According to one preferred embodiment of the present invention, after described result acquiring unit receives query to be identified, by institute State query to be identified to be supplied to search engine and scan for, obtain Search Results from search engine and come front N Individual Search Results;Or,
After receiving query to be identified, described query to be identified is extended, by query to be identified with Extension contamination is supplied to search engine and scans for, and from search engine obtain described query to be identified with Coming the Search Results of top n in the Search Results that extension contamination is corresponding, described expansion word is default The demand word of each demand class;
Wherein said N is default positive integer.
According to one preferred embodiment of the present invention, this device uses more than one grader and each grader to divide Do not use different Search Results text features.
According to one preferred embodiment of the present invention, described grader includes: for web page title set up grader, Grader for web-page summarization foundation or the grader for network address.
According to one preferred embodiment of the present invention, the described grader employing for web page title foundation is following searches for knot Really at least one in text feature is as grader feature:
Whether web page title occurs described query to be identified and the number of times of described query to be identified occurs;
The overlapping shape that n-gram word group n-gram determined by web page title is vectorial with the core word of each demand type Condition;And;
Number of times clicked during the corresponding described query to be identified of web page title in search daily record accounts for described to be identified The ratio of the clicked total degree of the corresponding all web page titles of query.
According to one preferred embodiment of the present invention, the described grader employing for web-page summarization foundation is following searches for knot Really at least one in text feature is as grader feature:
Web-page summarization occurs sentence number or the ratio of described query to be identified;And,
Overlapping conditions between the n-gram comprised in web-page summarization and demand type core word vector.
According to one preferred embodiment of the present invention, the described grader set up for network address uses following Search Results At least one in text feature is as grader feature:
The ranking value of network address correspondence Search Results;
The page type that network address is corresponding;And,
Number of times clicked during the corresponding described query to be identified of network address in search daily record and described query to be identified The ratio of the clicked total degree of corresponding all network address.
According to one preferred embodiment of the present invention, this device also includes: for set up the core word of demand type to The vector of amount sets up unit;
Described vector is set up unit and is specifically included:
Seed query obtains subelement, for obtaining the seed query of described demand type;
Search Results obtains subelement, scans for for each seed query for described demand type, point Huo Qu not come the Search Results of front N1, described N1 is default positive integer;
Phrase obtains subelement, for described Search Results obtains the text of the Search Results that subelement obtains Carry out word segmentation processing, obtain all n-gram;
Vector sets up subelement, for determining the power of each n-gram according to word frequency tf* reverse document-frequency idf value Weight, obtains weighted value and comes the n-gram individual for the front N2 core word vector as described demand type, described N2 is default positive integer.
According to one preferred embodiment of the present invention, described seed query acquisition subelement is obtained and is joined by manual type The seed query of the described demand type put;Or,
Obtain and use manual type seed query of the described demand type of mark in search daily record;Or,
From the search daily record of described demand type vertical search, obtain searching times higher than preset first threshold value Query as the seed query of described demand type;Or,
From the search daily record of the Webpage search of described demand type, obtain corresponding to clicking described demand class The website of type or click the query of the title comprising described demand type Feature Words, and the query that will obtain Middle searching times is higher than the query seed query as described demand type presetting Second Threshold.
According to one preferred embodiment of the present invention, described grader is: maximum entropy classifiers or support vector machine Grader.
According to one preferred embodiment of the present invention, if described grader is one, the most described demand integrated unit According to demand in classification results, the Search Results quantity comprised of respectively classifying determines the demand of described query to be identified Type;
If described grader is multiple, the most described demand integrated unit uses fusion side based on boosting Method, or use the Combining Multiple Classifiers of linear weighted function.
As can be seen from the above technical solutions, after the present invention obtains the Search Results of query to be identified, logical Cross and Search Results is carried out demand classification, further demand classification result is carried out fusion and determine to be identified The demand type of query.Whether this mode will not be comprised preset critical by query to be identified completely The impact of word, can realize demand identification to any query to be identified;Further, since user's search The ageing of demand is generally embodied on Search Results, the demand therefore identified by the way of the present invention Type can fully demonstrate the ageing of search need, thus improves the accuracy of search need identification.
[accompanying drawing explanation]
The method flow diagram identifying search need that Fig. 1 provides for the embodiment of the present invention one;
The method for building up flow process of the core word vector of the demand type that Fig. 2 provides for the embodiment of the present invention two Figure;
The structure drawing of device identifying search need that Fig. 3 provides for the embodiment of the present invention three;
Fig. 4 is used for the instance graph of big searching order for the search need identification that the embodiment of the present invention provides;
Fig. 5 is used for the instance graph of vertical search for the search need identification that the embodiment of the present invention provides.
[detailed description of the invention]
In order to make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawings and specifically Embodiment describes the present invention.
Embodiment one,
The method flow diagram identifying search need that Fig. 1 provides for the embodiment of the present invention one, as it is shown in figure 1, The method may comprise steps of:
Step 101: after receiving query to be identified, obtains the Search Results of query to be identified.
After receiving query to be identified, query to be identified is supplied to search engine and retrieves, obtain Take the Search Results coming top n in Search Results.
Wherein, when query to be identified being supplied to search engine and retrieving, can only wait to know by this Other query is supplied to search engine, and obtains the Search Results of this query to be identified from search engine. Preferably, query to be identified can be extended, query to be identified is carried with extension contamination Supply search engine, and obtain this query to be identified and extension corresponding the searching of contamination from search engine Hitch fruit, wherein expansion word is the preset need word of demand class.Due to these preset need words needs Quantity is less, usually tens, and manual type therefore can be used to configure.
Such as, the preset need word of video class demand includes: video, TV play, film, high definition viewing Deng.The preset need word of menu class demand includes: menu, recipe, cuisines etc..So for be identified Query " home cooking ", then can obtain following query to be identified and extend contamination:
" home cooking video ", " home cooking TV play ", " home cooking film ", " home cooking height Clear online viewing ", " home cooking menu ", " home cooking recipe ", " home cooking cuisines " etc.. After these combinations are supplied to search engine, search engine returns comprehensive Search Results after scanning for, Then from these Search Results, the Search Results coming top n is obtained, naturally it is also possible to from search engine All obtain, in the Search Results that each combination returns, the Search Results come above, thus altogether obtain Take N number of Search Results.
The purpose using the query to be identified after extension to scan for obtaining Search Results is: overcome certain The demand of the top n Search Results of a little query is excessively concentrated thus the demand identification that causes is inaccurate asks Topic.Such as, query " Zhang Ziyi " has a lot of demand, during roving commission " Zhang Ziyi ", picture category Search Results may occur less in the Search Results come top n, is difficult to judge this query There is strong picture demand, if but after this query is extended to " Zhang Ziyi's photo ", in Search Results The result relevant to picture demand will more occur in the Search Results coming top n, this for The accuracy of follow-up identification query search need has very great help.
Step 102: utilize grader based on default Search Results text feature, each Search Results to be carried out Demand classification.
When in this step Search Results being carried out demand classification, more than one grader can be used, Each grader is respectively adopted different Search Results text features.In the present embodiment can be for searching At least one in web page title, web-page summarization and the network address of hitch fruit sets up grader, at this to set up As a example by three graders, it is called title classifier, summary grader and network address grader.Divide below The other grader feature using these three grader is described.
1) at least one in following three kinds of Search Results text features can be used for title classifier As grader feature:
The first: whether web page title occurs query to be identified and occurs that query's to be identified is secondary Number.
What this feature was weighed is the web page title dependency with query to be identified of Search Results, if Web page title occurs query to be identified, then illustrates that this Search Results is the most relevant to query to be identified, more The search need identifying query to be identified can be caused contribution.Such as, the webpage mark of certain Search Results Entitled " the way Foods home cooking of modal home cooking menu-home cooking ", query to be identified For " home cooking ", this web page title occurs this query to be identified, and occurs in that 3 times, this is described Search Results has bigger contribution to the demand identifying this query.
The second: the n-gram determined by web page title is overlapping with the core word of each demand type vector Situation.
So-called n-gram is exactly the combination that n word of minimum particle size occurs in order, and wherein n is default One or more positive integers.With web page title, " way of modal home cooking menu-home cooking is beautiful Food home cooking all over the world " as a example by, if choosing n is 1 and 2, then the n-gram determined by this web page title For:
1-gram:, common, home cooking, menu, home cooking, way, cuisines, sky Under, home cooking
2-gram: most common, common, home cooking, home cooking menu, menu home cooking, the daily life of a family Dish, way, way cuisines, Foods, all over the world home cooking
The foundation of the core word vector of each demand type can be in the way of using human configuration, it would however also be possible to employ The mode of automatic mining, the mode of automatic mining sees embodiment two.
After assuming to perform flow process shown in embodiment two for menu class demand, the core of the menu class demand obtained Heart term vector can be as follows, and core word vector includes core word and respective weights:
Overlapping conditions at the n-gram determined by web page title with the core word vector of each demand type Time, this overlapping conditions can be overlapping number of times or Duplication.
Continue upper example, and n-gram is as shown in table 1 with the overlapping number of times of the core word of menu class vector.
Table 1
n-gram Overlapping number of times
Home cooking 3
Menu 1
Home cooking menu 1
Cuisines 1
The calculation of Duplication can be: the power of the n-gram overlapping with the core word vector of demand type The ratio of total weight sum that weight sum is vectorial with the core word of this demand type.Continue upper example, n-gram With the Duplication of the core word vector of menu class it is:
(0.82+0.98+1.00+0.95)/(0.82+1.00+1.00+1.00+0.92+0.56+0.98+0.87+ 1.00+0.95+1.00)=0.37
The third: number of times clicked during the corresponding query to be identified of this web page title in search daily record accounts for be treated Identify the ratio of the clicked total degree of the corresponding all web page titles of query.
After user searches for a query, if the web page title of certain Search Results is the most attractive, Then user will tend to click on this Search Results.Therefore the web page title of certain Search Results is by user That clicks on is the most, then illustrate that this web page title meets the ability of user's request the strongest.
For example, query to be identified " home cooking ", in the Search Results of its correspondence, web page title is The clicked number of times of " the way Foods home cooking of modal home cooking menu-home cooking " It is 120 times, and the clicked total degree of the corresponding all web page titles of this query to be identified is 300 times, Then calculating ratio is 120/300=0.4.
2) at least one in the following two kinds Search Results text feature can be used to make for summary grader For grader feature:
The first: web-page summarization occurs sentence number or the ratio of query to be identified.
The situation that what this feature was weighed is web-page summarization meets user's request.In web-page summarization, comprise The sentence of query to be identified is the most, and this Search Results need satisfaction to this query to be identified is described The best.
The web-page summarization assuming certain Search Results is: home cooking is requisite during we live, the daily life of a family The way of dish is various, and such as northeast home cooking, Guo Lin home cooking etc., it is the simplest that cook home cooking menu ??Cuisines are outstanding provides abundant simple home cooking menu complete works of for you, allows you quickly learn.
This web-page summarization can be cut into 7 sentences, and the sentence wherein comprising home cooking is 6, then net Page occurs in making a summary that the sentence ratio of query to be identified is 6/7=0.86.
The second: the overlapping shape between the n-gram comprised in web-page summarization with demand type core word vector Condition.
This kind of feature is referred to the second feature described in title classifier, does not repeats them here.
3) at least one in following three kinds of Search Results text features can be used for network address grader As grader feature:
The first: the ranking value of corresponding Search Results.
Search engine is when being ranked up each Search Results of a query, it will usually with the power of network address Value is as one of foundation, the text included in webpage that therefore, the weight of network address is the biggest, network address is corresponding And the dependency between query is the strongest, then the sequence of network address is the most forward.Therefore we can be corresponding by network address The ranking value of Search Results weighs the network address need satisfaction degree to query to be identified, and computing formula is as follows:
r a n k _ s c o r e = log N + 1 n
Wherein rank_score is the network address need satisfaction degree to query to be identified, N be above-mentioned choose search Rope number of results, n is the ranking value of current search result.
The second: the page type that network address is corresponding.
We can obtain the page type that network address is corresponding, such as video by the method for machine learning in advance Class, picture category, novel class, menu class etc., if the page type of network address is consistent with demand type, then Illustrate that this network address is high to the need satisfaction degree of user query in this demand type.The value of this feature is permissible It is 0 or 1,1 must be divided into if consistent, so inconsistent that to be divided into 0.
The third: number of times clicked during the corresponding query to be identified of this network address in search daily record accounts for this and waits to know The ratio of the total degree that the corresponding all network address of other query are clicked.
After user searches for a query, if certain Search Results is high, then to the overall satisfaction of user User will tend to click on this Search Results.Therefore certain network address by user click on the most, then say The ability that this network address bright meets user's request is the strongest.
Such as, in search daily record, in the Search Results that query " home cooking " is corresponding, network address " www.meishij.net/chufang/diy/jiangchangcaipu/ " clicked number of times is 100 times, and The clicked total degree of network address corresponding for this query is 300 times, then the number of times that this network address is clicked accounts for this The ratio of the total degree that query correspondence network address to be identified is clicked is 100/300=0.33.
After the grader feature obtaining title classifier, summary grader, network address grader respectively, Building training set respectively, this training set can be set up by the method for craft or machine learning, thus trains Go out 3 graders.These training sets can be built after feature extraction by the sample of each demand type, The training set obtained eventually includes the eigenvalue of each grader feature corresponding to each demand type.
The most each grader can use but be not limited to maximum entropy classifiers, support vector machine (SVM) Grader etc..By the Search Results of query to be identified that gets after feature extraction, then input respectively Title classifier, summary grader, network address grader, it becomes possible to respectively each Search Results is classified ?.Owing to, after determining the characteristic of division of grader, using maximum entropy classifiers, SVM classifier etc. The mode that text is classified by grader is existing mature technology, does not repeats them here.
Shown below is Search Results to be carried out by title classifier, summary grader, network address grader Sorted classification example, as shown in table 2.
Table 2
Step 103: the demand classification result of each Search Results is merged, determines according to fusion results The demand type of query to be identified.
If in the present invention only with a grader, then to a grader to each Search Results When demand classification result merges, determine query to be identified according to the Search Results quantity that each classification comprises Demand type.It is for instance possible to use the mode of ballot, the classification at most Search Results places is as treating Identify the demand type of query, such as, in 100 Search Results, have 70 Search Results to be divided into Menu class, has 30 Search Results to be divided into novel class, it is determined that query to be identified is menu class.Also In the way of using the class probability calculating each classification, class probability can be exceeded the work setting threshold classification For the demand type of this query to be identified, the Search Results quantity during wherein class probability is this classification with The ratio of Search Results total quantity.
When classification results to multi-categorizer merges in this step, existing multi-categorizer can be used Amalgamation mode, such as based on boosting fusion method, the Combining Multiple Classifiers etc. of linear weighted function. Only it is briefly described as a example by the Combining Multiple Classifiers of linear weighted function at this, i.e. according to equation below Calculate Search Results in demand type ckOn probability c*:
c*=α Ptitle(ck|q)+βPtext(ck|q)+(1-α-β)Purl(ck|q)
Wherein Ptitle(ck| q) be query to be identified based on title classifier in demand type ckOn classification general Rate, Ptext(ck| it is q) query to be identified based on summary grader in demand type ckOn class probability, Purl(ck| q) be query to be identified based on network address grader in demand type ckOn class probability.α, β For weight coefficient, can be obtained, to obtain optimal classification effect with preset algorithm training by experiment.
Finally, flow process shown in the present embodiment is used, it may be determined that go out the demand type of each query to be identified, Give some instances, as shown in table 3.
Table 3
query Video Menu Picture Restaurant
Home cooking Strong demand Secondary demand Without asking Weak demand
The way of home cooking Weak demand Strong demand Without asking Without asking
Jewel in the Palace Strong demand Without asking Without asking Secondary demand
Jewel in the Palace high definition is watched online Strong demand Without asking Without asking Without asking
Embodiment two,
The method for building up flow process of the core word vector of the demand type that Fig. 2 provides for the embodiment of the present invention two Figure, as in figure 2 it is shown, the method comprises the following steps:
Step 201, the seed query of acquisition demand type.
The seed query set of the most preset each demand type, these seeds query embodies corresponding need Seeking the demand of type, these seeds query set can configure by the way of artificial, or uses people The mode of work marks in search daily record.More preferably, it is also possible to from search daily record, excavate seed query, Such as from the search daily record of this demand type vertical search, obtain searching times higher than preset first threshold value Query is as the seed query of this demand type;Or, from the search of the Webpage search of this demand type In daily record, obtain corresponding to clicking the website of this demand type or clicking and comprise this demand type feature In the query of the title of word, and the query that will obtain, searching times is higher than the query presetting Second Threshold As the seed query of this demand type, etc..
Step 202: each seed query for this demand type scans for, obtains respectively and comes The Search Results of front N1, wherein N1 is default positive integer.
Step 203: the text of the Search Results obtained is carried out word segmentation processing, obtains all n-gram.
The text of Search Results can include but not limited to herein: web page title, web-page summarization etc..
Step 204: determine the weight of each n-gram according to word frequency (tf) * reverse document-frequency (idf) value, According to weight, all n-gram are ranked up, obtain and come the n-gram of front N2 as this demand type Core word vector, wherein N2 is default positive integer.
The core word vector of the demand type finally given includes the weight of n-gram and n-gram.
It is above the detailed description that method provided by the present invention is carried out, below by embodiment three to this The device that invention provides is described in detail.
Embodiment three,
The structure drawing of device identifying search need that Fig. 3 provides for the embodiment of the present invention three, as it is shown on figure 3, This device includes: result acquiring unit 300, grader 310 and demand integrated unit 320.
After result acquiring unit 300 receives query to be identified, obtain the Search Results of query to be identified.
Specifically, after result acquiring unit 300 receives query to be identified, query to be identified is supplied to Search engine is retrieved, and obtains the Search Results coming top n Search Results from search engine;Or, After receiving query to be identified, query to be identified is extended, by query to be identified and expansion word Combination is supplied to search engine, and obtains query to be identified and extension corresponding the searching of contamination from search engine Coming the Search Results of top n in hitch fruit, expansion word is the demand word of default each demand class;Wherein N is default positive integer.
Grader 310 respectively searches for knot based on default Search Results text feature to what result acquiring unit obtained Fruit carries out demand classification.
The demand classification result of each Search Results is merged, according to fusion results by demand integrated unit 320 Determine the demand type of query to be identified.
In the apparatus, more than one grader and each grader can be used to be respectively adopted different searching Hitch fruit text feature.Specifically, grader 310 may include that the grader set up for web page title 311, the grader 312 for web-page summarization foundation or the grader 313 for network address.
Wherein, the grader 311 set up for web page title can use in following Search Results text feature At least one is as grader feature:
Whether web page title occurs query to be identified and the number of times of query to be identified occurs;
The overlapping shape that n-gram word group n-gram determined by web page title is vectorial with the core word of each demand type Condition;And;
It is corresponding that number of times clicked during the corresponding query to be identified of web page title in search daily record accounts for query to be identified The ratio of the clicked total degree of all web page titles.
The grader set up for web-page summarization can use at least one in following Search Results text feature As grader feature:
Web-page summarization occurs sentence number or the ratio of query to be identified;And,
Overlapping conditions between the n-gram comprised in web-page summarization and demand type core word vector.
The grader set up for network address uses at least one in following Search Results text feature as classification Device feature:
The ranking value of network address correspondence Search Results;
The page type that network address is corresponding;And,
Number of times clicked during the corresponding query to be identified of network address in search daily record is corresponding with query to be identified all The ratio of the clicked total degree of network address.
Due to the grader 311 set up for web page title and the grader 312 set up for web-page summarization All having used the core word vector of demand type, therefore, this device can also include: is used for setting up demand class The vector of the core word vector of type sets up unit 330, specifically includes: seed query acquisition subelement 331, Search Results obtains subelement 332, phrase obtains subelement 333 and vector sets up subelement 334.
Seed query obtains subelement 331 and obtains the seed query of demand type.Specifically, can obtain By the seed query of the demand type that manual type configures;Or, obtain and use manual type in search day The seed query of the demand type of mark in will;Or, from the search daily record of demand type vertical search, Obtain searching times and be higher than the query seed query as demand type of preset first threshold value;Or, from In the search daily record of the Webpage search of demand type, obtain corresponding to the website or the click that click demand type Comprise in the query of the title of demand type Feature Words, and the query that will obtain searching times higher than presetting The query of Second Threshold is as the seed query of demand type.
Search Results acquisition subelement 332 scans for for each seed query of demand type, obtains respectively Coming the Search Results of front N1, N1 is default positive integer.
Phrase obtains the text of the Search Results that subelement 333 obtains subelement 332 acquisition to Search Results and enters Row word segmentation processing, obtains all n-gram.
Vector is set up subelement 334 and is determined the weight of each n-gram according to tf*idf value, obtains weighted value and comes The n-gram of front N2 is as the core word vector of demand type, and N2 is default positive integer.
As preferred embodiment, above-mentioned grader 310 can use but be not limited to maximum entropy classifiers or Support vector machine classifier.
If grader is one, then in demand integrated unit 320 classification results according to demand, bag of respectively classifying The Search Results quantity contained determines the demand type of query to be identified.
If grader is multiple, then the classification results of multi-categorizer is melted by demand integrated unit 320 During conjunction, can be to use existing multiple Classifiers Combination mode, such as based on boosting fusion method, Or use the Combining Multiple Classifiers etc. of linear weighted function, do not repeat them here.
After the said method using the embodiment of the present invention to provide or device identify demand type, Ke Yiyong In but be not limited to following application scenarios:
1) for the sequence of big search.After user inputs query, by the above-mentioned side of the embodiment of the present invention Method and device can recognize that the demand type of this query, to should in the Search Results that will search for greatly The page-ranking of the demand type of query is in advance.
Such as, when user inputs query " home cooking high definition ", it is possible to identify this in big search Query has video class demand, can exist in the results page for this big search " home cooking " this The associated video information of portion's TV play, obtaining of this partial video information can be that video vertical search provides And insert in the Search Results of big search, and so in the Search Results of big search, can be by this video The page of class comes before Search Results, as shown in Figure 4 so that the satisfaction of user and search experience All will be greatly improved.
2) for vertical search.After user inputs query, by the said method of the embodiment of the present invention and Device can recognize that the demand type of this query, this query is distributed to optimum content resource or Application provider processes, and the most accurately and efficiently returns to the result that user matches.
Such as, and when user input " from Baidu mansion to five road junctions " time, it is possible to identify this query There is map class demand, this query is supplied to map vertical search, map vertical search carry out public affairs The calculating of cross-channel line, the most directly shows that the bus trip map from Baidu mansion to five road junctions is public to relevant Hand over car information, as shown in Figure 5.
3) for information recommendation.After user inputs query, by the said method of the embodiment of the present invention and Device can recognize that the demand type of this query, based on this demand type, user is carried out information recommendation, Such as advertisement recommendation, the recommendation of knowledge question platform, query recommendation etc..
Such as, user inputs query " cheap MP3 player " and identifies its demand type for shopping Class, then can recommend the reality of the advertisement relevant to MP3 player, such advertisement and user at Search Results Border demand matching degree is the highest.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all at this Within the spirit of invention and principle, any modification, equivalent substitution and improvement etc. done, should be included in Within the scope of protection of the invention.

Claims (20)

1. the method identifying search need, it is characterised in that the method includes:
S1, receive query to be identified after, obtain the Search Results of described query to be identified;
S2, utilize grader, based on default Search Results text feature, each Search Results is carried out demand classification;
S3, demand classification result to each Search Results merge, according to fusion results determine described in wait to know The demand type of other query, wherein, if described grader is one, each in classification results the most according to demand The Search Results quantity that classification comprises determines the demand type of described query to be identified;If described grader is Multiple, then use fusion method based on boosting, or use the multiple Classifiers Combination side of linear weighted function Method.
Method the most according to claim 1, it is characterised in that described step S1 specifically includes:
After receiving query to be identified, described query to be identified is supplied to search engine and scans for, from Search engine obtains the Search Results coming top n in Search Results;Or,
After receiving query to be identified, described query to be identified is extended, by query to be identified with Extension contamination is supplied to search engine and scans for, and from search engine obtain described query to be identified with Coming the Search Results of top n in the Search Results that extension contamination is corresponding, described expansion word is default The demand word of each demand class;
Wherein said N is default positive integer.
Method the most according to claim 1, it is characterised in that use one in described step S2 Above grader and each grader are respectively adopted different Search Results text features.
Method the most according to claim 1, it is characterised in that described grader includes: for webpage The grader that title is set up, the grader set up for web-page summarization or the grader set up for network address.
Method the most according to claim 4, it is characterised in that described for dividing that web page title is set up Class device uses at least one in following Search Results text feature as grader feature:
Whether web page title occurs described query to be identified and the number of times of described query to be identified occurs;
The overlapping shape that n-gram word group n-gram determined by web page title is vectorial with the core word of each demand type Condition;And;
Number of times clicked during the corresponding described query to be identified of web page title in search daily record accounts for described to be identified The ratio of the clicked total degree of the corresponding all web page titles of query.
Method the most according to claim 4, it is characterised in that described for dividing that web-page summarization is set up Class device uses at least one in following Search Results text feature as grader feature:
Web-page summarization occurs sentence number or the ratio of described query to be identified;And,
Overlapping conditions between the n-gram comprised in web-page summarization and demand type core word vector.
Method the most according to claim 4, it is characterised in that the described grader set up for network address Use at least one in following Search Results text feature as grader feature:
The ranking value of network address correspondence Search Results;
The page type that network address is corresponding;And,
Number of times clicked during the corresponding described query to be identified of network address in search daily record and described query to be identified The ratio of the clicked total degree of corresponding all network address.
8. according to the method described in claim 5 or 6, it is characterised in that the core word of described demand type The foundation of vector includes:
A1, obtain the seed query of described demand type;
A2, each seed query for described demand type scan for, and obtain respectively and come front N1 Search Results, described N1 is default positive integer;
A3, the text of Search Results obtained is carried out word segmentation processing, obtain all n-gram;
A4, determine the weight of each n-gram according to word frequency tf* reverse document-frequency idf value, obtain weighted value row Vectorial as the core word of described demand type at the n-gram of front N2, described N2 is default positive integer.
Method the most according to claim 8, it is characterised in that described step A1 includes:
Obtain the seed query of the described demand type configured by manual type;Or,
Obtain and use manual type seed query of the described demand type of mark in search daily record;Or,
From the search daily record of described demand type vertical search, obtain searching times higher than preset first threshold value Query as the seed query of described demand type;Or,
From the search daily record of the Webpage search of described demand type, obtain corresponding to clicking described demand class The website of type or click the query of the title comprising described demand type Feature Words, and the query that will obtain Middle searching times is higher than the query seed query as described demand type presetting Second Threshold.
10. according to the method described in the arbitrary claim of claim 1 to 7, it is characterised in that described grader For: maximum entropy classifiers or support vector machine classifier.
11. 1 kinds of devices identifying search need, it is characterised in that this device includes:
Result acquiring unit, after being used for receiving query to be identified, obtains the search of described query to be identified Result;
Grader, each for described result acquiring unit is obtained based on default Search Results text feature Search Results carries out demand classification;
Demand integrated unit, for merging, according to melting the demand classification result of described each Search Results Close result and determine the demand type of described query to be identified, if the most described grader is one, then basis The Search Results quantity comprised of respectively classifying in demand classification result determines the demand type of described query to be identified; If described grader is multiple, then uses fusion method based on boosting, or use linear weighted function Combining Multiple Classifiers.
12. device according to claim 11, it is characterised in that described result acquiring unit receives After query to be identified, described query to be identified is supplied to search engine and scans for, obtain from search engine Take the Search Results coming top n in Search Results;Or,
After receiving query to be identified, described query to be identified is extended, by query to be identified with Extension contamination is supplied to search engine and scans for, and from search engine obtain described query to be identified with Coming the Search Results of top n in the Search Results that extension contamination is corresponding, described expansion word is default The demand word of each demand class;
Wherein said N is default positive integer.
13. devices according to claim 11, it is characterised in that this device uses more than one point Class device and each grader are respectively adopted different Search Results text features.
14. devices according to claim 11, it is characterised in that described grader includes: for net The grader that page head is set up, the grader set up for web-page summarization or the grader set up for network address.
15. devices according to claim 14, it is characterised in that described for web page title set up Grader uses at least one in following Search Results text feature as grader feature:
Whether web page title occurs described query to be identified and the number of times of described query to be identified occurs;
The overlapping shape that n-gram word group n-gram determined by web page title is vectorial with the core word of each demand type Condition;And;
Number of times clicked during the corresponding described query to be identified of web page title in search daily record accounts for described to be identified The ratio of the clicked total degree of the corresponding all web page titles of query.
16. devices according to claim 14, it is characterised in that described for web-page summarization set up Grader uses at least one in following Search Results text feature as grader feature:
Web-page summarization occurs sentence number or the ratio of described query to be identified;And,
Overlapping conditions between the n-gram comprised in web-page summarization and demand type core word vector.
17. devices according to claim 14, it is characterised in that the described classification set up for network address Device uses at least one in following Search Results text feature as grader feature:
The ranking value of network address correspondence Search Results;
The page type that network address is corresponding;And,
Number of times clicked during the corresponding described query to be identified of network address in search daily record and described query to be identified The ratio of the clicked total degree of corresponding all network address.
18. according to the device described in claim 15 or 16, it is characterised in that this device also includes: use Vector in the core word vector setting up demand type sets up unit;
Described vector is set up unit and is specifically included:
Seed query obtains subelement, for obtaining the seed query of described demand type;
Search Results obtains subelement, scans for for each seed query for described demand type, point Huo Qu not come the Search Results of front N1, described N1 is default positive integer;
Phrase obtains subelement, for described Search Results obtains the text of the Search Results that subelement obtains Carry out word segmentation processing, obtain all n-gram;
Vector sets up subelement, for determining the power of each n-gram according to word frequency tf* reverse document-frequency idf value Weight, obtains weighted value and comes the n-gram individual for the front N2 core word vector as described demand type, described N2 is default positive integer.
19. devices according to claim 18, it is characterised in that it is single that described seed query obtains son Unit obtains the seed query of the described demand type configured by manual type;Or,
Obtain and use manual type seed query of the described demand type of mark in search daily record;Or,
From the search daily record of described demand type vertical search, obtain searching times higher than preset first threshold value Query as the seed query of described demand type;Or,
From the search daily record of the Webpage search of described demand type, obtain corresponding to clicking described demand class The website of type or click the query of the title comprising described demand type Feature Words, and the query that will obtain Middle searching times is higher than the query seed query as described demand type presetting Second Threshold.
20. according to the device described in the arbitrary claim of claim 11 to 17, it is characterised in that described classification Device is: maximum entropy classifiers or support vector machine classifier.
CN201110282840.8A 2011-09-21 2011-09-21 A kind of method and apparatus identifying search need Active CN103020066B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110282840.8A CN103020066B (en) 2011-09-21 2011-09-21 A kind of method and apparatus identifying search need

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110282840.8A CN103020066B (en) 2011-09-21 2011-09-21 A kind of method and apparatus identifying search need

Publications (2)

Publication Number Publication Date
CN103020066A CN103020066A (en) 2013-04-03
CN103020066B true CN103020066B (en) 2016-09-07

Family

ID=47968682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110282840.8A Active CN103020066B (en) 2011-09-21 2011-09-21 A kind of method and apparatus identifying search need

Country Status (1)

Country Link
CN (1) CN103020066B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838744B (en) * 2012-11-22 2019-01-15 百度在线网络技术(北京)有限公司 A kind of method and device of query word demand analysis
CN103366002B (en) * 2013-07-17 2017-08-11 北京奇虎科技有限公司 Personalized method for vertical search and device
CN104424296B (en) * 2013-09-02 2018-07-31 阿里巴巴集团控股有限公司 Query word sorting technique and device
CN105574177B (en) * 2015-12-21 2019-03-05 北京奇虎科技有限公司 The method and display equipment of search result is presented
CN107423304A (en) * 2016-05-24 2017-12-01 百度在线网络技术(北京)有限公司 Term sorting technique and device
CN107092621A (en) * 2016-11-24 2017-08-25 北京小度信息科技有限公司 Information search method and device
TWI645303B (en) * 2016-12-21 2018-12-21 財團法人工業技術研究院 Method for verifying string, method for expanding string and method for training verification model
CN106682192B (en) * 2016-12-29 2020-07-03 北京奇虎科技有限公司 Method and device for training answer intention classification model based on search keywords
CN108052613B (en) * 2017-12-14 2021-12-31 北京百度网讯科技有限公司 Method and device for generating page
CN110019304B (en) * 2017-12-18 2024-01-05 上海智臻智能网络科技股份有限公司 Method for expanding question-answering knowledge base, storage medium and terminal
CN109582791B (en) * 2018-11-13 2023-01-24 创新先进技术有限公司 Text risk identification method and device
CN109582792A (en) * 2018-11-16 2019-04-05 北京奇虎科技有限公司 A kind of method and device of text classification
CN112100480A (en) * 2020-09-15 2020-12-18 北京百度网讯科技有限公司 Search method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN103020066A (en) 2013-04-03

Similar Documents

Publication Publication Date Title
CN103020066B (en) A kind of method and apparatus identifying search need
CN103514299B (en) Information search method and device
US8732155B2 (en) Categorization in a system and method for conducting a search
US10235421B2 (en) Systems and methods for facilitating the gathering of open source intelligence
CN103491205B (en) The method for pushing of a kind of correlated resources address based on video search and device
CN105138653B (en) It is a kind of that method and its recommendation apparatus are recommended based on typical degree and the topic of difficulty
US8145703B2 (en) User interface and method in a local search system with related search results
US7809721B2 (en) Ranking of objects using semantic and nonsemantic features in a system and method for conducting a search
CN101388022B (en) Web portrait search method for fusing text semantic and vision content
US8271495B1 (en) System and method for automating categorization and aggregation of content from network sites
WO2015055094A1 (en) Method and device for providing screening conditions and method and device for searching
CN103186550A (en) Method and system for generating video-related video list
CN103593474B (en) Image retrieval sort method based on deep learning
CN106202294B (en) Related news computing method and device based on keyword and topic model fusion
US20200294071A1 (en) Determining user intents related to websites based on site search user behavior
US20090132646A1 (en) User interface and method in a local search system with static location markers
CN102332025A (en) Intelligent vertical search method and system
CN110633407B (en) Information retrieval method, device, equipment and computer readable medium
CN101551806A (en) Personalized website navigation method and system
CN102779136A (en) Method and device for information search
CN103955529A (en) Internet information searching and aggregating presentation method
CN106777282B (en) The sort method and device of relevant search
CN103020067A (en) Method and device for determining webpage type
Arguello et al. The effect of aggregated search coherence on search behavior
CN100447793C (en) Method for extracting page query interface based on character of vision

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant