CN103020066B - A kind of method and apparatus identifying search need - Google Patents
A kind of method and apparatus identifying search need Download PDFInfo
- Publication number
- CN103020066B CN103020066B CN201110282840.8A CN201110282840A CN103020066B CN 103020066 B CN103020066 B CN 103020066B CN 201110282840 A CN201110282840 A CN 201110282840A CN 103020066 B CN103020066 B CN 103020066B
- Authority
- CN
- China
- Prior art keywords
- query
- identified
- search results
- search
- grader
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of method and apparatus identifying search need, wherein method includes: after receiving query to be identified, obtains the Search Results of described query to be identified;Utilize grader, based on default Search Results text feature, each Search Results is carried out demand classification;The demand classification result of each Search Results is merged, determines the demand type of described query to be identified according to fusion results.This mode by whether comprising predetermined keyword in query to be identified will not be affected completely, and any query to be identified can be realized demand identification;Further, since the ageing of user's search need is generally embodied on Search Results, the demand type therefore identified by the way of the present invention can fully demonstrate the ageing of search need, thus improves the accuracy of search need identification.
Description
[technical field]
The present invention relates to field of computer technology, particularly to a kind of method identifying search need and dress
Put.
[background technology]
Along with the Internet developing rapidly with ripe in the world, the information resources on network are the richest
Richness, information data amount is also rapidly expanding, and obtains information by search engine and has become as modern's acquisition
The major way of information.In order to provide a user with more convenient, exactly inquiry service be search engine skill
Art is in the current and following developing direction.
In search engine technique, it is identified being to improve searching accuracy and have to the search need of user
An important ring of effect property, in structured search (i.e. vertical search), effect is notable especially.Existing search
Rope demand identification the most simply uses the mode mating preset key word, and such as, corresponding video requirement is pre-
Put some key words: " viewing online ", " download online ", " program request ", " high definition viewing " etc.,
If a searching request (query) comprises some key word, such as query " home cooking high definition
Viewing ", then can identify this query and there is video requirement.But this mode has the disadvantage that
If not comprising preset key word in defect one query, then None-identified goes out the demand of query
Type, if such as query is only " home cooking ", is difficult to directly judge according to this query
The demand of this query.
Defect two, the ageing of query demand cannot be embodied.The demand of some query can be over time
Passage and change, such as, " home cooking " this query, in TV play " home cooking " not
Before showing, the major demands of this query is menu class and cuisines class, but in TV play " home cooking "
When reflecting, the major demands of this query may just change into video class, and menu class and cuisines class may become
For secondary demand.And after TV play " home cooking " terminates hot showing, people are for the concern of this TV play
Degree declines, and at this moment the major demands of this query becomes menu class and cuisines class again again.Existing search
Demand recognition methods obviously cannot embody this change.
Above-mentioned two defect the most all can cause the accuracy of search need identification relatively low, causes for this
The Search Results of query cannot meet search need exactly, and user needs to spend more time and money
The content of needs is found in source.
[summary of the invention]
The invention provides a kind of method and apparatus identifying search need, solution does not comprise pre-because of query
Put the demand None-identified that key word causes and the ageing defect that query demand cannot be embodied, carry
The accuracy of high search need identification.
Concrete technical scheme is as follows:
A kind of method identifying search need, the method includes:
S1, receive query to be identified after, obtain the Search Results of described query to be identified;
S2, utilize grader, based on default Search Results text feature, each Search Results is carried out demand classification;
S3, demand classification result to each Search Results merge, according to fusion results determine described in wait to know
The demand type of other query.
According to one preferred embodiment of the present invention, described step S1 specifically includes:
After receiving query to be identified, described query to be identified is supplied to search engine and scans for, from
Search engine obtains the Search Results coming top n in Search Results;Or,
After receiving query to be identified, described query to be identified is extended, by query to be identified with
Extension contamination is supplied to search engine and scans for, and from search engine obtain described query to be identified with
Coming the Search Results of top n in the Search Results that extension contamination is corresponding, described expansion word is default
The demand word of each demand class;
Wherein said N is default positive integer.
According to one preferred embodiment of the present invention, in described step S2, use more than one grader and each
Individual grader is respectively adopted different Search Results text features.
According to one preferred embodiment of the present invention, described grader includes: for web page title set up grader,
Grader for web-page summarization foundation or the grader for network address.
According to one preferred embodiment of the present invention, the described grader employing for web page title foundation is following searches for knot
Really at least one in text feature is as grader feature:
Whether web page title occurs described query to be identified and the number of times of described query to be identified occurs;
The overlapping shape that n-gram word group n-gram determined by web page title is vectorial with the core word of each demand type
Condition;And;
Number of times clicked during the corresponding described query to be identified of web page title in search daily record accounts for described to be identified
The ratio of the clicked total degree of the corresponding all web page titles of query.
According to one preferred embodiment of the present invention, the described grader employing for web-page summarization foundation is following searches for knot
Really at least one in text feature is as grader feature:
Web-page summarization occurs sentence number or the ratio of described query to be identified;And,
Overlapping conditions between the n-gram comprised in web-page summarization and demand type core word vector.
According to one preferred embodiment of the present invention, the described grader set up for network address uses following Search Results
At least one in text feature is as grader feature:
The ranking value of network address correspondence Search Results;
The page type that network address is corresponding;And,
Number of times clicked during the corresponding described query to be identified of network address in search daily record and described query to be identified
The ratio of the clicked total degree of corresponding all network address.
According to one preferred embodiment of the present invention, the foundation of the core word vector of described demand type includes:
A1, obtain the seed query of described demand type;
A2, each seed query for described demand type scan for, and obtain respectively and come front N1
Search Results, described N1 is default positive integer;
A3, the text of Search Results obtained is carried out word segmentation processing, obtain all n-gram;
A4, determine the weight of each n-gram according to word frequency tf* reverse document-frequency idf value, obtain weighted value row
Vectorial as the core word of described demand type at the n-gram of front N2, described N2 is default positive integer.
According to one preferred embodiment of the present invention, described step A1 includes:
Obtain the seed query of the described demand type configured by manual type;Or,
Obtain and use manual type seed query of the described demand type of mark in search daily record;Or,
From the search daily record of described demand type vertical search, obtain searching times higher than preset first threshold value
Query as the seed query of described demand type;Or,
From the search daily record of the Webpage search of described demand type, obtain corresponding to clicking described demand class
The website of type or click the query of the title comprising described demand type Feature Words, and the query that will obtain
Middle searching times is higher than the query seed query as described demand type presetting Second Threshold.
According to one preferred embodiment of the present invention, described grader is: maximum entropy classifiers or support vector machine
Grader.
According to one preferred embodiment of the present invention, if described grader is one, the most described S3 is: according to need
Asking in classification results, the Search Results quantity comprised of respectively classifying determines the demand type of described query to be identified;
If described grader is multiple, then in described step S3, use fusion method based on boosting,
Or use the Combining Multiple Classifiers of linear weighted function.
A kind of device identifying search need, this device includes:
Result acquiring unit, after being used for receiving query to be identified, obtains the search of described query to be identified
Result;
Grader, each for described result acquiring unit is obtained based on default Search Results text feature
Search Results carries out demand classification;
Demand integrated unit, for merging, according to melting the demand classification result of described each Search Results
Close result and determine the demand type of described query to be identified.
According to one preferred embodiment of the present invention, after described result acquiring unit receives query to be identified, by institute
State query to be identified to be supplied to search engine and scan for, obtain Search Results from search engine and come front N
Individual Search Results;Or,
After receiving query to be identified, described query to be identified is extended, by query to be identified with
Extension contamination is supplied to search engine and scans for, and from search engine obtain described query to be identified with
Coming the Search Results of top n in the Search Results that extension contamination is corresponding, described expansion word is default
The demand word of each demand class;
Wherein said N is default positive integer.
According to one preferred embodiment of the present invention, this device uses more than one grader and each grader to divide
Do not use different Search Results text features.
According to one preferred embodiment of the present invention, described grader includes: for web page title set up grader,
Grader for web-page summarization foundation or the grader for network address.
According to one preferred embodiment of the present invention, the described grader employing for web page title foundation is following searches for knot
Really at least one in text feature is as grader feature:
Whether web page title occurs described query to be identified and the number of times of described query to be identified occurs;
The overlapping shape that n-gram word group n-gram determined by web page title is vectorial with the core word of each demand type
Condition;And;
Number of times clicked during the corresponding described query to be identified of web page title in search daily record accounts for described to be identified
The ratio of the clicked total degree of the corresponding all web page titles of query.
According to one preferred embodiment of the present invention, the described grader employing for web-page summarization foundation is following searches for knot
Really at least one in text feature is as grader feature:
Web-page summarization occurs sentence number or the ratio of described query to be identified;And,
Overlapping conditions between the n-gram comprised in web-page summarization and demand type core word vector.
According to one preferred embodiment of the present invention, the described grader set up for network address uses following Search Results
At least one in text feature is as grader feature:
The ranking value of network address correspondence Search Results;
The page type that network address is corresponding;And,
Number of times clicked during the corresponding described query to be identified of network address in search daily record and described query to be identified
The ratio of the clicked total degree of corresponding all network address.
According to one preferred embodiment of the present invention, this device also includes: for set up the core word of demand type to
The vector of amount sets up unit;
Described vector is set up unit and is specifically included:
Seed query obtains subelement, for obtaining the seed query of described demand type;
Search Results obtains subelement, scans for for each seed query for described demand type, point
Huo Qu not come the Search Results of front N1, described N1 is default positive integer;
Phrase obtains subelement, for described Search Results obtains the text of the Search Results that subelement obtains
Carry out word segmentation processing, obtain all n-gram;
Vector sets up subelement, for determining the power of each n-gram according to word frequency tf* reverse document-frequency idf value
Weight, obtains weighted value and comes the n-gram individual for the front N2 core word vector as described demand type, described
N2 is default positive integer.
According to one preferred embodiment of the present invention, described seed query acquisition subelement is obtained and is joined by manual type
The seed query of the described demand type put;Or,
Obtain and use manual type seed query of the described demand type of mark in search daily record;Or,
From the search daily record of described demand type vertical search, obtain searching times higher than preset first threshold value
Query as the seed query of described demand type;Or,
From the search daily record of the Webpage search of described demand type, obtain corresponding to clicking described demand class
The website of type or click the query of the title comprising described demand type Feature Words, and the query that will obtain
Middle searching times is higher than the query seed query as described demand type presetting Second Threshold.
According to one preferred embodiment of the present invention, described grader is: maximum entropy classifiers or support vector machine
Grader.
According to one preferred embodiment of the present invention, if described grader is one, the most described demand integrated unit
According to demand in classification results, the Search Results quantity comprised of respectively classifying determines the demand of described query to be identified
Type;
If described grader is multiple, the most described demand integrated unit uses fusion side based on boosting
Method, or use the Combining Multiple Classifiers of linear weighted function.
As can be seen from the above technical solutions, after the present invention obtains the Search Results of query to be identified, logical
Cross and Search Results is carried out demand classification, further demand classification result is carried out fusion and determine to be identified
The demand type of query.Whether this mode will not be comprised preset critical by query to be identified completely
The impact of word, can realize demand identification to any query to be identified;Further, since user's search
The ageing of demand is generally embodied on Search Results, the demand therefore identified by the way of the present invention
Type can fully demonstrate the ageing of search need, thus improves the accuracy of search need identification.
[accompanying drawing explanation]
The method flow diagram identifying search need that Fig. 1 provides for the embodiment of the present invention one;
The method for building up flow process of the core word vector of the demand type that Fig. 2 provides for the embodiment of the present invention two
Figure;
The structure drawing of device identifying search need that Fig. 3 provides for the embodiment of the present invention three;
Fig. 4 is used for the instance graph of big searching order for the search need identification that the embodiment of the present invention provides;
Fig. 5 is used for the instance graph of vertical search for the search need identification that the embodiment of the present invention provides.
[detailed description of the invention]
In order to make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawings and specifically
Embodiment describes the present invention.
Embodiment one,
The method flow diagram identifying search need that Fig. 1 provides for the embodiment of the present invention one, as it is shown in figure 1,
The method may comprise steps of:
Step 101: after receiving query to be identified, obtains the Search Results of query to be identified.
After receiving query to be identified, query to be identified is supplied to search engine and retrieves, obtain
Take the Search Results coming top n in Search Results.
Wherein, when query to be identified being supplied to search engine and retrieving, can only wait to know by this
Other query is supplied to search engine, and obtains the Search Results of this query to be identified from search engine.
Preferably, query to be identified can be extended, query to be identified is carried with extension contamination
Supply search engine, and obtain this query to be identified and extension corresponding the searching of contamination from search engine
Hitch fruit, wherein expansion word is the preset need word of demand class.Due to these preset need words needs
Quantity is less, usually tens, and manual type therefore can be used to configure.
Such as, the preset need word of video class demand includes: video, TV play, film, high definition viewing
Deng.The preset need word of menu class demand includes: menu, recipe, cuisines etc..So for be identified
Query " home cooking ", then can obtain following query to be identified and extend contamination:
" home cooking video ", " home cooking TV play ", " home cooking film ", " home cooking height
Clear online viewing ", " home cooking menu ", " home cooking recipe ", " home cooking cuisines " etc..
After these combinations are supplied to search engine, search engine returns comprehensive Search Results after scanning for,
Then from these Search Results, the Search Results coming top n is obtained, naturally it is also possible to from search engine
All obtain, in the Search Results that each combination returns, the Search Results come above, thus altogether obtain
Take N number of Search Results.
The purpose using the query to be identified after extension to scan for obtaining Search Results is: overcome certain
The demand of the top n Search Results of a little query is excessively concentrated thus the demand identification that causes is inaccurate asks
Topic.Such as, query " Zhang Ziyi " has a lot of demand, during roving commission " Zhang Ziyi ", picture category
Search Results may occur less in the Search Results come top n, is difficult to judge this query
There is strong picture demand, if but after this query is extended to " Zhang Ziyi's photo ", in Search Results
The result relevant to picture demand will more occur in the Search Results coming top n, this for
The accuracy of follow-up identification query search need has very great help.
Step 102: utilize grader based on default Search Results text feature, each Search Results to be carried out
Demand classification.
When in this step Search Results being carried out demand classification, more than one grader can be used,
Each grader is respectively adopted different Search Results text features.In the present embodiment can be for searching
At least one in web page title, web-page summarization and the network address of hitch fruit sets up grader, at this to set up
As a example by three graders, it is called title classifier, summary grader and network address grader.Divide below
The other grader feature using these three grader is described.
1) at least one in following three kinds of Search Results text features can be used for title classifier
As grader feature:
The first: whether web page title occurs query to be identified and occurs that query's to be identified is secondary
Number.
What this feature was weighed is the web page title dependency with query to be identified of Search Results, if
Web page title occurs query to be identified, then illustrates that this Search Results is the most relevant to query to be identified, more
The search need identifying query to be identified can be caused contribution.Such as, the webpage mark of certain Search Results
Entitled " the way Foods home cooking of modal home cooking menu-home cooking ", query to be identified
For " home cooking ", this web page title occurs this query to be identified, and occurs in that 3 times, this is described
Search Results has bigger contribution to the demand identifying this query.
The second: the n-gram determined by web page title is overlapping with the core word of each demand type vector
Situation.
So-called n-gram is exactly the combination that n word of minimum particle size occurs in order, and wherein n is default
One or more positive integers.With web page title, " way of modal home cooking menu-home cooking is beautiful
Food home cooking all over the world " as a example by, if choosing n is 1 and 2, then the n-gram determined by this web page title
For:
1-gram:, common, home cooking, menu, home cooking, way, cuisines, sky
Under, home cooking
2-gram: most common, common, home cooking, home cooking menu, menu home cooking, the daily life of a family
Dish, way, way cuisines, Foods, all over the world home cooking
The foundation of the core word vector of each demand type can be in the way of using human configuration, it would however also be possible to employ
The mode of automatic mining, the mode of automatic mining sees embodiment two.
After assuming to perform flow process shown in embodiment two for menu class demand, the core of the menu class demand obtained
Heart term vector can be as follows, and core word vector includes core word and respective weights:
Overlapping conditions at the n-gram determined by web page title with the core word vector of each demand type
Time, this overlapping conditions can be overlapping number of times or Duplication.
Continue upper example, and n-gram is as shown in table 1 with the overlapping number of times of the core word of menu class vector.
Table 1
n-gram | Overlapping number of times |
Home cooking | 3 |
Menu | 1 |
Home cooking menu | 1 |
Cuisines | 1 |
The calculation of Duplication can be: the power of the n-gram overlapping with the core word vector of demand type
The ratio of total weight sum that weight sum is vectorial with the core word of this demand type.Continue upper example, n-gram
With the Duplication of the core word vector of menu class it is:
(0.82+0.98+1.00+0.95)/(0.82+1.00+1.00+1.00+0.92+0.56+0.98+0.87+
1.00+0.95+1.00)=0.37
The third: number of times clicked during the corresponding query to be identified of this web page title in search daily record accounts for be treated
Identify the ratio of the clicked total degree of the corresponding all web page titles of query.
After user searches for a query, if the web page title of certain Search Results is the most attractive,
Then user will tend to click on this Search Results.Therefore the web page title of certain Search Results is by user
That clicks on is the most, then illustrate that this web page title meets the ability of user's request the strongest.
For example, query to be identified " home cooking ", in the Search Results of its correspondence, web page title is
The clicked number of times of " the way Foods home cooking of modal home cooking menu-home cooking "
It is 120 times, and the clicked total degree of the corresponding all web page titles of this query to be identified is 300 times,
Then calculating ratio is 120/300=0.4.
2) at least one in the following two kinds Search Results text feature can be used to make for summary grader
For grader feature:
The first: web-page summarization occurs sentence number or the ratio of query to be identified.
The situation that what this feature was weighed is web-page summarization meets user's request.In web-page summarization, comprise
The sentence of query to be identified is the most, and this Search Results need satisfaction to this query to be identified is described
The best.
The web-page summarization assuming certain Search Results is: home cooking is requisite during we live, the daily life of a family
The way of dish is various, and such as northeast home cooking, Guo Lin home cooking etc., it is the simplest that cook home cooking menu
??Cuisines are outstanding provides abundant simple home cooking menu complete works of for you, allows you quickly learn.
This web-page summarization can be cut into 7 sentences, and the sentence wherein comprising home cooking is 6, then net
Page occurs in making a summary that the sentence ratio of query to be identified is 6/7=0.86.
The second: the overlapping shape between the n-gram comprised in web-page summarization with demand type core word vector
Condition.
This kind of feature is referred to the second feature described in title classifier, does not repeats them here.
3) at least one in following three kinds of Search Results text features can be used for network address grader
As grader feature:
The first: the ranking value of corresponding Search Results.
Search engine is when being ranked up each Search Results of a query, it will usually with the power of network address
Value is as one of foundation, the text included in webpage that therefore, the weight of network address is the biggest, network address is corresponding
And the dependency between query is the strongest, then the sequence of network address is the most forward.Therefore we can be corresponding by network address
The ranking value of Search Results weighs the network address need satisfaction degree to query to be identified, and computing formula is as follows:
Wherein rank_score is the network address need satisfaction degree to query to be identified, N be above-mentioned choose search
Rope number of results, n is the ranking value of current search result.
The second: the page type that network address is corresponding.
We can obtain the page type that network address is corresponding, such as video by the method for machine learning in advance
Class, picture category, novel class, menu class etc., if the page type of network address is consistent with demand type, then
Illustrate that this network address is high to the need satisfaction degree of user query in this demand type.The value of this feature is permissible
It is 0 or 1,1 must be divided into if consistent, so inconsistent that to be divided into 0.
The third: number of times clicked during the corresponding query to be identified of this network address in search daily record accounts for this and waits to know
The ratio of the total degree that the corresponding all network address of other query are clicked.
After user searches for a query, if certain Search Results is high, then to the overall satisfaction of user
User will tend to click on this Search Results.Therefore certain network address by user click on the most, then say
The ability that this network address bright meets user's request is the strongest.
Such as, in search daily record, in the Search Results that query " home cooking " is corresponding, network address
" www.meishij.net/chufang/diy/jiangchangcaipu/ " clicked number of times is 100 times, and
The clicked total degree of network address corresponding for this query is 300 times, then the number of times that this network address is clicked accounts for this
The ratio of the total degree that query correspondence network address to be identified is clicked is 100/300=0.33.
After the grader feature obtaining title classifier, summary grader, network address grader respectively,
Building training set respectively, this training set can be set up by the method for craft or machine learning, thus trains
Go out 3 graders.These training sets can be built after feature extraction by the sample of each demand type,
The training set obtained eventually includes the eigenvalue of each grader feature corresponding to each demand type.
The most each grader can use but be not limited to maximum entropy classifiers, support vector machine (SVM)
Grader etc..By the Search Results of query to be identified that gets after feature extraction, then input respectively
Title classifier, summary grader, network address grader, it becomes possible to respectively each Search Results is classified
?.Owing to, after determining the characteristic of division of grader, using maximum entropy classifiers, SVM classifier etc.
The mode that text is classified by grader is existing mature technology, does not repeats them here.
Shown below is Search Results to be carried out by title classifier, summary grader, network address grader
Sorted classification example, as shown in table 2.
Table 2
Step 103: the demand classification result of each Search Results is merged, determines according to fusion results
The demand type of query to be identified.
If in the present invention only with a grader, then to a grader to each Search Results
When demand classification result merges, determine query to be identified according to the Search Results quantity that each classification comprises
Demand type.It is for instance possible to use the mode of ballot, the classification at most Search Results places is as treating
Identify the demand type of query, such as, in 100 Search Results, have 70 Search Results to be divided into
Menu class, has 30 Search Results to be divided into novel class, it is determined that query to be identified is menu class.Also
In the way of using the class probability calculating each classification, class probability can be exceeded the work setting threshold classification
For the demand type of this query to be identified, the Search Results quantity during wherein class probability is this classification with
The ratio of Search Results total quantity.
When classification results to multi-categorizer merges in this step, existing multi-categorizer can be used
Amalgamation mode, such as based on boosting fusion method, the Combining Multiple Classifiers etc. of linear weighted function.
Only it is briefly described as a example by the Combining Multiple Classifiers of linear weighted function at this, i.e. according to equation below
Calculate Search Results in demand type ckOn probability c*:
c*=α Ptitle(ck|q)+βPtext(ck|q)+(1-α-β)Purl(ck|q)
Wherein Ptitle(ck| q) be query to be identified based on title classifier in demand type ckOn classification general
Rate, Ptext(ck| it is q) query to be identified based on summary grader in demand type ckOn class probability,
Purl(ck| q) be query to be identified based on network address grader in demand type ckOn class probability.α, β
For weight coefficient, can be obtained, to obtain optimal classification effect with preset algorithm training by experiment.
Finally, flow process shown in the present embodiment is used, it may be determined that go out the demand type of each query to be identified,
Give some instances, as shown in table 3.
Table 3
query | Video | Menu | Picture | Restaurant |
Home cooking | Strong demand | Secondary demand | Without asking | Weak demand |
The way of home cooking | Weak demand | Strong demand | Without asking | Without asking |
Jewel in the Palace | Strong demand | Without asking | Without asking | Secondary demand |
Jewel in the Palace high definition is watched online | Strong demand | Without asking | Without asking | Without asking |
Embodiment two,
The method for building up flow process of the core word vector of the demand type that Fig. 2 provides for the embodiment of the present invention two
Figure, as in figure 2 it is shown, the method comprises the following steps:
Step 201, the seed query of acquisition demand type.
The seed query set of the most preset each demand type, these seeds query embodies corresponding need
Seeking the demand of type, these seeds query set can configure by the way of artificial, or uses people
The mode of work marks in search daily record.More preferably, it is also possible to from search daily record, excavate seed query,
Such as from the search daily record of this demand type vertical search, obtain searching times higher than preset first threshold value
Query is as the seed query of this demand type;Or, from the search of the Webpage search of this demand type
In daily record, obtain corresponding to clicking the website of this demand type or clicking and comprise this demand type feature
In the query of the title of word, and the query that will obtain, searching times is higher than the query presetting Second Threshold
As the seed query of this demand type, etc..
Step 202: each seed query for this demand type scans for, obtains respectively and comes
The Search Results of front N1, wherein N1 is default positive integer.
Step 203: the text of the Search Results obtained is carried out word segmentation processing, obtains all n-gram.
The text of Search Results can include but not limited to herein: web page title, web-page summarization etc..
Step 204: determine the weight of each n-gram according to word frequency (tf) * reverse document-frequency (idf) value,
According to weight, all n-gram are ranked up, obtain and come the n-gram of front N2 as this demand type
Core word vector, wherein N2 is default positive integer.
The core word vector of the demand type finally given includes the weight of n-gram and n-gram.
It is above the detailed description that method provided by the present invention is carried out, below by embodiment three to this
The device that invention provides is described in detail.
Embodiment three,
The structure drawing of device identifying search need that Fig. 3 provides for the embodiment of the present invention three, as it is shown on figure 3,
This device includes: result acquiring unit 300, grader 310 and demand integrated unit 320.
After result acquiring unit 300 receives query to be identified, obtain the Search Results of query to be identified.
Specifically, after result acquiring unit 300 receives query to be identified, query to be identified is supplied to
Search engine is retrieved, and obtains the Search Results coming top n Search Results from search engine;Or,
After receiving query to be identified, query to be identified is extended, by query to be identified and expansion word
Combination is supplied to search engine, and obtains query to be identified and extension corresponding the searching of contamination from search engine
Coming the Search Results of top n in hitch fruit, expansion word is the demand word of default each demand class;Wherein
N is default positive integer.
Grader 310 respectively searches for knot based on default Search Results text feature to what result acquiring unit obtained
Fruit carries out demand classification.
The demand classification result of each Search Results is merged, according to fusion results by demand integrated unit 320
Determine the demand type of query to be identified.
In the apparatus, more than one grader and each grader can be used to be respectively adopted different searching
Hitch fruit text feature.Specifically, grader 310 may include that the grader set up for web page title
311, the grader 312 for web-page summarization foundation or the grader 313 for network address.
Wherein, the grader 311 set up for web page title can use in following Search Results text feature
At least one is as grader feature:
Whether web page title occurs query to be identified and the number of times of query to be identified occurs;
The overlapping shape that n-gram word group n-gram determined by web page title is vectorial with the core word of each demand type
Condition;And;
It is corresponding that number of times clicked during the corresponding query to be identified of web page title in search daily record accounts for query to be identified
The ratio of the clicked total degree of all web page titles.
The grader set up for web-page summarization can use at least one in following Search Results text feature
As grader feature:
Web-page summarization occurs sentence number or the ratio of query to be identified;And,
Overlapping conditions between the n-gram comprised in web-page summarization and demand type core word vector.
The grader set up for network address uses at least one in following Search Results text feature as classification
Device feature:
The ranking value of network address correspondence Search Results;
The page type that network address is corresponding;And,
Number of times clicked during the corresponding query to be identified of network address in search daily record is corresponding with query to be identified all
The ratio of the clicked total degree of network address.
Due to the grader 311 set up for web page title and the grader 312 set up for web-page summarization
All having used the core word vector of demand type, therefore, this device can also include: is used for setting up demand class
The vector of the core word vector of type sets up unit 330, specifically includes: seed query acquisition subelement 331,
Search Results obtains subelement 332, phrase obtains subelement 333 and vector sets up subelement 334.
Seed query obtains subelement 331 and obtains the seed query of demand type.Specifically, can obtain
By the seed query of the demand type that manual type configures;Or, obtain and use manual type in search day
The seed query of the demand type of mark in will;Or, from the search daily record of demand type vertical search,
Obtain searching times and be higher than the query seed query as demand type of preset first threshold value;Or, from
In the search daily record of the Webpage search of demand type, obtain corresponding to the website or the click that click demand type
Comprise in the query of the title of demand type Feature Words, and the query that will obtain searching times higher than presetting
The query of Second Threshold is as the seed query of demand type.
Search Results acquisition subelement 332 scans for for each seed query of demand type, obtains respectively
Coming the Search Results of front N1, N1 is default positive integer.
Phrase obtains the text of the Search Results that subelement 333 obtains subelement 332 acquisition to Search Results and enters
Row word segmentation processing, obtains all n-gram.
Vector is set up subelement 334 and is determined the weight of each n-gram according to tf*idf value, obtains weighted value and comes
The n-gram of front N2 is as the core word vector of demand type, and N2 is default positive integer.
As preferred embodiment, above-mentioned grader 310 can use but be not limited to maximum entropy classifiers or
Support vector machine classifier.
If grader is one, then in demand integrated unit 320 classification results according to demand, bag of respectively classifying
The Search Results quantity contained determines the demand type of query to be identified.
If grader is multiple, then the classification results of multi-categorizer is melted by demand integrated unit 320
During conjunction, can be to use existing multiple Classifiers Combination mode, such as based on boosting fusion method,
Or use the Combining Multiple Classifiers etc. of linear weighted function, do not repeat them here.
After the said method using the embodiment of the present invention to provide or device identify demand type, Ke Yiyong
In but be not limited to following application scenarios:
1) for the sequence of big search.After user inputs query, by the above-mentioned side of the embodiment of the present invention
Method and device can recognize that the demand type of this query, to should in the Search Results that will search for greatly
The page-ranking of the demand type of query is in advance.
Such as, when user inputs query " home cooking high definition ", it is possible to identify this in big search
Query has video class demand, can exist in the results page for this big search " home cooking " this
The associated video information of portion's TV play, obtaining of this partial video information can be that video vertical search provides
And insert in the Search Results of big search, and so in the Search Results of big search, can be by this video
The page of class comes before Search Results, as shown in Figure 4 so that the satisfaction of user and search experience
All will be greatly improved.
2) for vertical search.After user inputs query, by the said method of the embodiment of the present invention and
Device can recognize that the demand type of this query, this query is distributed to optimum content resource or
Application provider processes, and the most accurately and efficiently returns to the result that user matches.
Such as, and when user input " from Baidu mansion to five road junctions " time, it is possible to identify this query
There is map class demand, this query is supplied to map vertical search, map vertical search carry out public affairs
The calculating of cross-channel line, the most directly shows that the bus trip map from Baidu mansion to five road junctions is public to relevant
Hand over car information, as shown in Figure 5.
3) for information recommendation.After user inputs query, by the said method of the embodiment of the present invention and
Device can recognize that the demand type of this query, based on this demand type, user is carried out information recommendation,
Such as advertisement recommendation, the recommendation of knowledge question platform, query recommendation etc..
Such as, user inputs query " cheap MP3 player " and identifies its demand type for shopping
Class, then can recommend the reality of the advertisement relevant to MP3 player, such advertisement and user at Search Results
Border demand matching degree is the highest.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all at this
Within the spirit of invention and principle, any modification, equivalent substitution and improvement etc. done, should be included in
Within the scope of protection of the invention.
Claims (20)
1. the method identifying search need, it is characterised in that the method includes:
S1, receive query to be identified after, obtain the Search Results of described query to be identified;
S2, utilize grader, based on default Search Results text feature, each Search Results is carried out demand classification;
S3, demand classification result to each Search Results merge, according to fusion results determine described in wait to know
The demand type of other query, wherein, if described grader is one, each in classification results the most according to demand
The Search Results quantity that classification comprises determines the demand type of described query to be identified;If described grader is
Multiple, then use fusion method based on boosting, or use the multiple Classifiers Combination side of linear weighted function
Method.
Method the most according to claim 1, it is characterised in that described step S1 specifically includes:
After receiving query to be identified, described query to be identified is supplied to search engine and scans for, from
Search engine obtains the Search Results coming top n in Search Results;Or,
After receiving query to be identified, described query to be identified is extended, by query to be identified with
Extension contamination is supplied to search engine and scans for, and from search engine obtain described query to be identified with
Coming the Search Results of top n in the Search Results that extension contamination is corresponding, described expansion word is default
The demand word of each demand class;
Wherein said N is default positive integer.
Method the most according to claim 1, it is characterised in that use one in described step S2
Above grader and each grader are respectively adopted different Search Results text features.
Method the most according to claim 1, it is characterised in that described grader includes: for webpage
The grader that title is set up, the grader set up for web-page summarization or the grader set up for network address.
Method the most according to claim 4, it is characterised in that described for dividing that web page title is set up
Class device uses at least one in following Search Results text feature as grader feature:
Whether web page title occurs described query to be identified and the number of times of described query to be identified occurs;
The overlapping shape that n-gram word group n-gram determined by web page title is vectorial with the core word of each demand type
Condition;And;
Number of times clicked during the corresponding described query to be identified of web page title in search daily record accounts for described to be identified
The ratio of the clicked total degree of the corresponding all web page titles of query.
Method the most according to claim 4, it is characterised in that described for dividing that web-page summarization is set up
Class device uses at least one in following Search Results text feature as grader feature:
Web-page summarization occurs sentence number or the ratio of described query to be identified;And,
Overlapping conditions between the n-gram comprised in web-page summarization and demand type core word vector.
Method the most according to claim 4, it is characterised in that the described grader set up for network address
Use at least one in following Search Results text feature as grader feature:
The ranking value of network address correspondence Search Results;
The page type that network address is corresponding;And,
Number of times clicked during the corresponding described query to be identified of network address in search daily record and described query to be identified
The ratio of the clicked total degree of corresponding all network address.
8. according to the method described in claim 5 or 6, it is characterised in that the core word of described demand type
The foundation of vector includes:
A1, obtain the seed query of described demand type;
A2, each seed query for described demand type scan for, and obtain respectively and come front N1
Search Results, described N1 is default positive integer;
A3, the text of Search Results obtained is carried out word segmentation processing, obtain all n-gram;
A4, determine the weight of each n-gram according to word frequency tf* reverse document-frequency idf value, obtain weighted value row
Vectorial as the core word of described demand type at the n-gram of front N2, described N2 is default positive integer.
Method the most according to claim 8, it is characterised in that described step A1 includes:
Obtain the seed query of the described demand type configured by manual type;Or,
Obtain and use manual type seed query of the described demand type of mark in search daily record;Or,
From the search daily record of described demand type vertical search, obtain searching times higher than preset first threshold value
Query as the seed query of described demand type;Or,
From the search daily record of the Webpage search of described demand type, obtain corresponding to clicking described demand class
The website of type or click the query of the title comprising described demand type Feature Words, and the query that will obtain
Middle searching times is higher than the query seed query as described demand type presetting Second Threshold.
10. according to the method described in the arbitrary claim of claim 1 to 7, it is characterised in that described grader
For: maximum entropy classifiers or support vector machine classifier.
11. 1 kinds of devices identifying search need, it is characterised in that this device includes:
Result acquiring unit, after being used for receiving query to be identified, obtains the search of described query to be identified
Result;
Grader, each for described result acquiring unit is obtained based on default Search Results text feature
Search Results carries out demand classification;
Demand integrated unit, for merging, according to melting the demand classification result of described each Search Results
Close result and determine the demand type of described query to be identified, if the most described grader is one, then basis
The Search Results quantity comprised of respectively classifying in demand classification result determines the demand type of described query to be identified;
If described grader is multiple, then uses fusion method based on boosting, or use linear weighted function
Combining Multiple Classifiers.
12. device according to claim 11, it is characterised in that described result acquiring unit receives
After query to be identified, described query to be identified is supplied to search engine and scans for, obtain from search engine
Take the Search Results coming top n in Search Results;Or,
After receiving query to be identified, described query to be identified is extended, by query to be identified with
Extension contamination is supplied to search engine and scans for, and from search engine obtain described query to be identified with
Coming the Search Results of top n in the Search Results that extension contamination is corresponding, described expansion word is default
The demand word of each demand class;
Wherein said N is default positive integer.
13. devices according to claim 11, it is characterised in that this device uses more than one point
Class device and each grader are respectively adopted different Search Results text features.
14. devices according to claim 11, it is characterised in that described grader includes: for net
The grader that page head is set up, the grader set up for web-page summarization or the grader set up for network address.
15. devices according to claim 14, it is characterised in that described for web page title set up
Grader uses at least one in following Search Results text feature as grader feature:
Whether web page title occurs described query to be identified and the number of times of described query to be identified occurs;
The overlapping shape that n-gram word group n-gram determined by web page title is vectorial with the core word of each demand type
Condition;And;
Number of times clicked during the corresponding described query to be identified of web page title in search daily record accounts for described to be identified
The ratio of the clicked total degree of the corresponding all web page titles of query.
16. devices according to claim 14, it is characterised in that described for web-page summarization set up
Grader uses at least one in following Search Results text feature as grader feature:
Web-page summarization occurs sentence number or the ratio of described query to be identified;And,
Overlapping conditions between the n-gram comprised in web-page summarization and demand type core word vector.
17. devices according to claim 14, it is characterised in that the described classification set up for network address
Device uses at least one in following Search Results text feature as grader feature:
The ranking value of network address correspondence Search Results;
The page type that network address is corresponding;And,
Number of times clicked during the corresponding described query to be identified of network address in search daily record and described query to be identified
The ratio of the clicked total degree of corresponding all network address.
18. according to the device described in claim 15 or 16, it is characterised in that this device also includes: use
Vector in the core word vector setting up demand type sets up unit;
Described vector is set up unit and is specifically included:
Seed query obtains subelement, for obtaining the seed query of described demand type;
Search Results obtains subelement, scans for for each seed query for described demand type, point
Huo Qu not come the Search Results of front N1, described N1 is default positive integer;
Phrase obtains subelement, for described Search Results obtains the text of the Search Results that subelement obtains
Carry out word segmentation processing, obtain all n-gram;
Vector sets up subelement, for determining the power of each n-gram according to word frequency tf* reverse document-frequency idf value
Weight, obtains weighted value and comes the n-gram individual for the front N2 core word vector as described demand type, described
N2 is default positive integer.
19. devices according to claim 18, it is characterised in that it is single that described seed query obtains son
Unit obtains the seed query of the described demand type configured by manual type;Or,
Obtain and use manual type seed query of the described demand type of mark in search daily record;Or,
From the search daily record of described demand type vertical search, obtain searching times higher than preset first threshold value
Query as the seed query of described demand type;Or,
From the search daily record of the Webpage search of described demand type, obtain corresponding to clicking described demand class
The website of type or click the query of the title comprising described demand type Feature Words, and the query that will obtain
Middle searching times is higher than the query seed query as described demand type presetting Second Threshold.
20. according to the device described in the arbitrary claim of claim 11 to 17, it is characterised in that described classification
Device is: maximum entropy classifiers or support vector machine classifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110282840.8A CN103020066B (en) | 2011-09-21 | 2011-09-21 | A kind of method and apparatus identifying search need |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110282840.8A CN103020066B (en) | 2011-09-21 | 2011-09-21 | A kind of method and apparatus identifying search need |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103020066A CN103020066A (en) | 2013-04-03 |
CN103020066B true CN103020066B (en) | 2016-09-07 |
Family
ID=47968682
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110282840.8A Active CN103020066B (en) | 2011-09-21 | 2011-09-21 | A kind of method and apparatus identifying search need |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103020066B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103838744B (en) * | 2012-11-22 | 2019-01-15 | 百度在线网络技术(北京)有限公司 | A kind of method and device of query word demand analysis |
CN103366002B (en) * | 2013-07-17 | 2017-08-11 | 北京奇虎科技有限公司 | Personalized method for vertical search and device |
CN104424296B (en) * | 2013-09-02 | 2018-07-31 | 阿里巴巴集团控股有限公司 | Query word sorting technique and device |
CN105574177B (en) * | 2015-12-21 | 2019-03-05 | 北京奇虎科技有限公司 | The method and display equipment of search result is presented |
CN107423304A (en) * | 2016-05-24 | 2017-12-01 | 百度在线网络技术(北京)有限公司 | Term sorting technique and device |
CN107092621A (en) * | 2016-11-24 | 2017-08-25 | 北京小度信息科技有限公司 | Information search method and device |
TWI645303B (en) * | 2016-12-21 | 2018-12-21 | 財團法人工業技術研究院 | Method for verifying string, method for expanding string and method for training verification model |
CN106682192B (en) * | 2016-12-29 | 2020-07-03 | 北京奇虎科技有限公司 | Method and device for training answer intention classification model based on search keywords |
CN108052613B (en) * | 2017-12-14 | 2021-12-31 | 北京百度网讯科技有限公司 | Method and device for generating page |
CN110019304B (en) * | 2017-12-18 | 2024-01-05 | 上海智臻智能网络科技股份有限公司 | Method for expanding question-answering knowledge base, storage medium and terminal |
CN109582791B (en) * | 2018-11-13 | 2023-01-24 | 创新先进技术有限公司 | Text risk identification method and device |
CN109582792A (en) * | 2018-11-16 | 2019-04-05 | 北京奇虎科技有限公司 | A kind of method and device of text classification |
CN112100480A (en) * | 2020-09-15 | 2020-12-18 | 北京百度网讯科技有限公司 | Search method, device, equipment and storage medium |
-
2011
- 2011-09-21 CN CN201110282840.8A patent/CN103020066B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN103020066A (en) | 2013-04-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103020066B (en) | A kind of method and apparatus identifying search need | |
CN103514299B (en) | Information search method and device | |
US8732155B2 (en) | Categorization in a system and method for conducting a search | |
US10235421B2 (en) | Systems and methods for facilitating the gathering of open source intelligence | |
CN103491205B (en) | The method for pushing of a kind of correlated resources address based on video search and device | |
CN105138653B (en) | It is a kind of that method and its recommendation apparatus are recommended based on typical degree and the topic of difficulty | |
US8145703B2 (en) | User interface and method in a local search system with related search results | |
US7809721B2 (en) | Ranking of objects using semantic and nonsemantic features in a system and method for conducting a search | |
CN101388022B (en) | Web portrait search method for fusing text semantic and vision content | |
US8271495B1 (en) | System and method for automating categorization and aggregation of content from network sites | |
WO2015055094A1 (en) | Method and device for providing screening conditions and method and device for searching | |
CN103186550A (en) | Method and system for generating video-related video list | |
CN103593474B (en) | Image retrieval sort method based on deep learning | |
CN106202294B (en) | Related news computing method and device based on keyword and topic model fusion | |
US20200294071A1 (en) | Determining user intents related to websites based on site search user behavior | |
US20090132646A1 (en) | User interface and method in a local search system with static location markers | |
CN102332025A (en) | Intelligent vertical search method and system | |
CN110633407B (en) | Information retrieval method, device, equipment and computer readable medium | |
CN101551806A (en) | Personalized website navigation method and system | |
CN102779136A (en) | Method and device for information search | |
CN103955529A (en) | Internet information searching and aggregating presentation method | |
CN106777282B (en) | The sort method and device of relevant search | |
CN103020067A (en) | Method and device for determining webpage type | |
Arguello et al. | The effect of aggregated search coherence on search behavior | |
CN100447793C (en) | Method for extracting page query interface based on character of vision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |