CN103279486A - Method and device for providing related searches - Google Patents

Method and device for providing related searches Download PDF

Info

Publication number
CN103279486A
CN103279486A CN2013101459585A CN201310145958A CN103279486A CN 103279486 A CN103279486 A CN 103279486A CN 2013101459585 A CN2013101459585 A CN 2013101459585A CN 201310145958 A CN201310145958 A CN 201310145958A CN 103279486 A CN103279486 A CN 103279486A
Authority
CN
China
Prior art keywords
query
candidate
user
bunch
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013101459585A
Other languages
Chinese (zh)
Other versions
CN103279486B (en
Inventor
黄际洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201310145958.5A priority Critical patent/CN103279486B/en
Publication of CN103279486A publication Critical patent/CN103279486A/en
Application granted granted Critical
Publication of CN103279486B publication Critical patent/CN103279486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and device for providing related searches. During the offline RS mining process, all queries in a historic search log are respectively used as a current query to carry out the flowing steps and other queries appearing in the same conversation with the current query form candidate RS of the current query; the candidate RS of the current query are clustered to acquire candidate RS clusters corresponding to the current query according to similarity and the candidate RS clusters corresponding to the current query are saved to a database. During the on-line RS providing process, the candidate RS clusters corresponding to a query expressing the same semanteme with the query currently input by a user in the database are acquired; the candidate RS clusters of which search times are ranked in the top N1 are selected, and the candidate RS of which search times are ranked in the top N2 in the selected RS clusters are determined as RS of the query currently input by the user. The method and device can provide more effective RS for the user, save search time for the user and save network resources.

Description

A kind of method and apparatus that relevant search is provided
[technical field]
The present invention relates to the Computer Applied Technology field, particularly a kind of method and apparatus that relevant search is provided.
[background technology]
Whether search for the effective bad search terms (query) that depends on that many times the user imports appropriate, but many times the user often can not accurately import the query of the Search Results that can obtain to want, and hope obtains some references, relevant search (RS) is exactly the similar a series of query of the query to user's input that search engine provides, and can be distributed in the below of search results pages or the below of input frame usually.
The mode that RS is provided in the prior art is the query according to user's input, searches the high frequency query of most lexical item (term) among the query that comprises user's input as far as possible.For example the query of user's input is " star X ", Shu Chu RS comprises as shown in Figure 1 so: " the plain face of star X is shone ", " star X luxurious house ", " star X boyfriend ", " star X bob ", " the plain face of star X ", " star X ancient costume ", " star X living photo ", " star X bikini ".Yet, provide the mode of RS to have following defective in the prior art:
One, may cause the RS that provides to concentrate on some or several semantic classess on, exist a large amount of semantic classess to repeat, this just causes to user's reference information amount seldom, situation as shown in fig. 1, " star X plain face according to ", " the plain face of star X ", " star X bob " and " star X living photo " all belong to the description on the star X apperance as can be seen, and " star X ancient costume " and " star X bikini " all belongs to the dressing class.The shared page resource of RS is limited, if exist the repetition on a large amount of semantic classess just can not provide search suggestion widely to the user, for example the user may wish to obtain the relevant search of star younger brother X class, but the height of the not above several RS of possible its search frequency just has no chance to represent as RS.
Two, RS only is the literal relevant query that has relation of inclusion with the query of user's input that goes up in the prior art, in close relations with it but do not comprise the relevant query of the query that the user imports, and then can't be as RS.For example the title of star X luxurious house just can't show the user as RS in the legend.
The user just needs the query of own constantly conversion input like this, and searches the information of oneself wanting from the Search Results of each query of input, and complex operation has also been wasted Internet resources on the other hand on the one hand.
[summary of the invention]
In view of this, the invention provides a kind of method and apparatus that relevant search is provided, so that for the user provides more effective RS, save user's search time and conserve network resources.
Concrete technical scheme is as follows:
A kind of method that relevant search RS is provided, this method comprises:
In the online down RS mining process, each the search terms query in the historical search daily record is carried out following steps S01 to step S02 as current query respectively:
S01, will constitute the candidate RS of current query with current query co-occurrence in other query of same session session;
S02, according to similarity the candidate RS of current query is carried out cluster and obtain candidate RS bunch of current query correspondence, preserve candidate RS bunch of current query correspondence to database;
RS provides process on the line:
S11, obtain the query of the current input of user;
S12, obtain in the described database with the query of the current input of user and express corresponding candidate RS bunch of the query of identical semanteme;
S13, be chosen in searching times in the described historical search daily record and come before N1 candidate RS bunch, N2 candidate RS before searching times comes in described historical search daily record in select candidate RS bunch is defined as the RS of the query of the current input of user, and described N1 and N2 are default positive integer.
According to one preferred embodiment of the present invention, in described step S01, also comprise:
The query that expresses identical semanteme is normalized to identical statement.
According to one preferred embodiment of the present invention, between described step S01 and described step S02, also comprise:
To filter out in the number of times of the same session candidate RS less than the preset times threshold value with current query co-occurrence.
The similarity calculating method that adopts when according to one preferred embodiment of the present invention, carrying out cluster in described step S02 specifically comprises:
Determine and RS iThe query tabulation that co-occurrence constitutes in the query of same session and each query correspondence and described RS iCo-occurrence in the number of times of same session, described RS iBe a RS among the described candidate RS;
With RS iThe tabulation that constitutes of the candidate RS of query tabulation and described current query seek common ground;
Calculate RS by following formula iWith RS jBetween similarity P (RS i, RS j), described RS jBe a RS in the described set that obtains that seeks common ground:
P ( RS i , RS j ) = Co _ Count ( RS i , RS j ) Σ R S k ∈ R Co _ Count ( RS i , RS k ) ;
Wherein said Co_Count(RS i, RS j) be RS iAnd RS jThe number of times of co-occurrence in same session, R is the set that obtains after described the seeking common ground.
According to one preferred embodiment of the present invention, before described step S12, also comprise:
Inquire about whether there is the query that expresses identical semanteme with the query of the current input of user in the described database, if carry out described step S12.
According to one preferred embodiment of the present invention, if there is the query of the current input of described user in the described database, perhaps exist query with the current input of user to carry out the query that obtains after the normalized, then determine to exist in the described database query that expresses identical semanteme with the query of the current input of user.
According to one preferred embodiment of the present invention, described normalization comprises at least a in the following processing:
Query is removed the processing of stop words;
Word among the query is replaced with the synonym of appointment;
Wrong writing among the query is converted into correct writing.
According to one preferred embodiment of the present invention, candidate RS bunch of searching times sum that the searching times in described historical search daily record is this candidate RS bunch of candidate RS that comprises.
According to one preferred embodiment of the present invention, after described step S13, also comprise:
S14, the RS of the query of the current input of described user is illustrated on the search results pages of query of the current input of user.
According to one preferred embodiment of the present invention, described step S14 specifically comprises:
The RS that searching times in select each candidate RS bunch is come first directly is illustrated on the described search results pages, and other RS in each candidate RS of selection bunch hide with the form of combobox and are illustrated on the described search results pages.
A kind of device that RS is provided, this device comprises: RS excavates that RS provides the unit on unit and the line under the line;
RS excavation unit is handled each query in the historical search daily record respectively under the described line as current query, comprising:
Candidate's subelement is used for constituting the candidate RS of current query with current query co-occurrence in other query of same session;
The cluster subelement is used for according to similarity the candidate RS of current query being carried out cluster and obtains candidate RS bunch of current query correspondence, preserves candidate RS bunch of current query correspondence to database;
RS provides the unit to comprise on the described line:
Query obtains subelement, is used for obtaining the query of the current input of user;
The candidate obtains subelement, is used for obtaining described database and the query of the current input of user and expresses corresponding candidate RS bunch of the query of identical semanteme;
RS determines subelement, be used for being chosen in described historical search daily record searching times and come preceding N1 candidate RS bunch, N2 candidate RS before searching times comes in described historical search daily record in select candidate RS bunch is defined as the RS of the query of the current input of user, and described N1 and N2 are default positive integer.
According to one preferred embodiment of the present invention, described candidate's subelement also is normalized to identical statement for the query that will express identical semanteme.
According to one preferred embodiment of the present invention, RS excavation unit also comprises under the described line: filter subelement, be used for the candidate RS that obtains at described candidate's subelement, will filter out in the number of times of the same session candidate RS less than the preset times threshold value with current query co-occurrence.
The similarity calculating method that adopts when according to one preferred embodiment of the present invention, described cluster subelement carries out cluster specifically comprises:
Determine and RS iThe query tabulation that co-occurrence constitutes in the query of same session and each query correspondence and described RS iCo-occurrence in the number of times of same session, described RS iBe a RS among the described candidate RS;
With RS iThe tabulation that constitutes of the candidate RS of query tabulation and described current query seek common ground;
Calculate RS by following formula iWith RS jBetween similarity P (RS i, RS j), described RS jBe a RS in the described set that obtains that seeks common ground:
P ( RS i , RS j ) = Co _ Count ( RS i , RS j ) Σ R S k ∈ R Co _ Count ( RS i , RS k ) ;
Wherein said Co_Count(RS i, RS j) be RS iAnd RS jThe number of times of co-occurrence in same session, R is the set that obtains after described the seeking common ground.
According to one preferred embodiment of the present invention, RS provides the unit also to comprise on the described line: judgment sub-unit, be used for the described database of inquiry and whether have the query that expresses identical semanteme with the query of the current input of user, if trigger described candidate and obtain subelement and carry out to obtain in the described database with the query of the current input of user and express candidate RS bunch corresponding operation of the query of identical semanteme.
According to one preferred embodiment of the present invention, if there is the query of the current input of described user in the described database, perhaps exist query with the current input of user to carry out the query that obtains after the normalized, then described judgment sub-unit determines to exist in the described database query that expresses identical semanteme with the query of the current input of user.
According to one preferred embodiment of the present invention, described normalization comprises at least a in the following processing:
Query is removed the processing of stop words;
Word among the query is replaced with the synonym of appointment;
Wrong writing among the query is converted into correct writing.
According to one preferred embodiment of the present invention, candidate RS bunch of searching times sum that the searching times in described historical search daily record is this candidate RS bunch of candidate RS that comprises.
According to one preferred embodiment of the present invention, RS provides the unit also to comprise on the described line: RS shows subelement, is used for the RS of the query of the current input of described user is illustrated in the search results pages of the query of the current input of user.
According to one preferred embodiment of the present invention, described RS shows that subelement is when carrying out described displaying, the concrete execution: the RS that searching times comes first in each candidate RS that will select bunch directly is illustrated on the described search results pages, and other RS in each candidate RS of selection bunch are illustrated on the described search results pages so that the form of combobox is hiding.
As can be seen from the above technical solutions, select candidate RS and determine candidate RS bunch of each query correspondence by clustering processing in the situation of same session according to each query co-occurrence in the online down RS mining process of the present invention, RS provides candidate RS bunch of the query correspondence that just can determine the current input of user in the process on line, therefrom select and provide the RS of the query of current input according to searching times, the RS of how different semantic types can be provided in limited RS showing resource in this way, can also be provided at several RS under the same semantic type, that is to say to provide more effective RS, help the user to understand how useful RS fast, the user repeatedly query of conversion input comes to obtain the information of wanting from different Search Results, save user's search time and conserve network resources.
[description of drawings]
Fig. 1 is the instance graph of RS in the prior art;
Fig. 2 is the system construction drawing that the embodiment of the invention was suitable for;
The method flow diagram that Fig. 3 provides for the embodiment of the invention one;
A kind of RS instance graph that Fig. 4 provides for the embodiment of the invention one;
Fig. 5 is the synoptic diagram when representing combobox in the RS instance graph shown in Figure 4;
The structure drawing of device that Fig. 6 provides for the embodiment of the invention two.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
The system that the present invention is suitable for as shown in Figure 2, browser in the subscriber equipment or client send to the search server end with the query of user's input, in most cases the user also can be by query triggers browser or client sends to search server with this query such as importing in the browser address box by importing query in the input frame in searched page.Search engine in the search server is searched for according to the query of this user's input, and provide RS by the RS generator, the search results pages that search server will comprise Search Results and RS sends to browser or the client in the subscriber equipment, represents to the user for browser or client.Wherein subscriber equipment can be but be not limited to: mobile devices such as mobile phone, panel computer, notebook, perhaps fixing computer such as PC.Method provided by the invention is realized by the RS generator of search server end, is described in detail below by a pair of method provided by the invention of embodiment.
Embodiment one,
Method provided by the invention mainly comprises two stages, and first stage is RS mining process under the line, and second stage is that RS provides process on the line.First stage mainly finishes and excavates the RS of query and carry out cluster according to semantic similarity from the historical search daily record; Second stage mainly finished the RS cluster of determining correspondence according to the query of user's input, and provides RS according to cluster.The method flow diagram that Fig. 3 provides for the embodiment of the invention one, as shown in Figure 3, step 301 to step 303 is RS mining process under the line, each query in the historical search daily record is carried out following steps 301 to step 303 as current query respectively, preferably, can obtain each query in the historical search daily record in the predetermined amount of time, for example every other day RS once be excavated, thereby more can embody recent search temperature.
Step 301: will with current query co-occurrence in the candidate RS that other query of same session constitute current query, add up each candidate RS and current query co-occurrence respectively in the number of times of same session.
Can adopt default time granularity that cutting is carried out in the historical search daily record in embodiments of the present invention, according to time of user search with the query cutting in different session, for example carry out cutting according to 15 minutes, so just obtain a lot of users' session.Then will be with current query co-occurrence in other query of same session all as the candidate RS of current query.This step is based on the user when searching for; usually can change different relevant query and come from Search Results, to find to wish the information seen; that is to say; usually has higher correlativity between the query of search before and after among the same session; it is very big to be suitable as the recommended probability of RS, and the present invention just is based on this theoretical foundation.
Suppose that cutting obtains table 1 to 6 session shown in the table 6:
Table 1
Search time query
20:35:26 Star X
20:35:29 Star X sells and sprouts
20:36:24 Star X walks red carpet
20:36:45 Star X fashionable dress week
20:37:02 Star X Cannes
Table 2
Search time query
16:22:41 Star X imperial robe dress
16:22:43 Star X
17:10:49 The car of star X
17:14:41 The bold and unconstrained lamp decoration of star X hard iron
17:14:48 The diamond wrist-watch of star X
17:15:05 Star X villa
17:16:41 Star X person of outstanding talent car
17:17:00 Star X Benz
Table 3
Search time query
21:56:09 Star X walks red carpet
21:56:42 Star X is prize-winning
21:57:03 Star X Cannes
Table 4
Search time query
10:58:57 Star X cheongsam dress
10:59:20 Star X walks red carpet
10:59:22 Star X fashionable dress week
Table 5
Search time query
21:47:18 Star X imperial robe dress
21:47:27 Star X fashionable dress week
[0088]?
21:47:29 Star X walks red carpet
Table 6
Search time query
18:50:56 Star X Cannes
18:51:42 Star X walks the t platform
18:52:46 Star X walks red carpet
Suppose that the current query that will excavate is " star X ", then all are all excavated the candidate RS that is used as " star X " with " star X " query that co-occurrence is crossed in same session, the candidate RS that obtains is as shown in table 7.
Table 7
The candidate RS of " star X "
Star X sells and sprouts
Star X walks red carpet
Star X fashionable dress week
Star X Cannes
Star X imperial robe dress
The car of star X
The bold and unconstrained lamp decoration of star X hard iron
The diamond wrist-watch of star X
Star X villa
Star X person of outstanding talent car
Star X Benz
When definite candidate RS, each query that expresses identical semanteme can be normalized to identical statement, with conserve storage and accelerate processing procedure, the query that candidate RS and current query co-occurrence are originated for this candidate RS in the number of times (being that candidate RS and current query co-occurrence are in the session of same session number) of same session and current query co-occurrence are in the number of times summation of same session.
When carrying out normalization, the processing of carrying out can include but not limited to: the processing to query removes stop words for example is normalized to " star X boyfriend " with " boyfriend of star X "; Word among the query is replaced with the synonym of appointment, for example " star X boyfriend " is normalized to " star X boyfriend "; Error correction is about to wrong writing and is converted into correct writing, and for example " husband Song Huiqiao " is normalized to " husband Song Huiqiao ".
Step 302: will filter out in the number of times of the same session candidate RS less than the preset times threshold value with current query co-occurrence.
Less in the number of times of same session with current query co-occurrence, illustrate that it appears among the same session and may have contingency, be that the possibility of relevant query is less.Above-mentioned preset times threshold value can adopt empirical value or experiment value, for example adopts 5.The purpose of this step is in order to improve the efficient in the subsequent process, also can not carry out this step.
Step 303: according to similarity the candidate RS of current query is carried out cluster and obtain candidate RS bunch of current query correspondence, preserve candidate RS bunch of current query correspondence to database.
Cluster has been very ripe method, introduces no longer in detail at this, can adopt the algorithm such as k-means, and k value (RS that namely obtains at last bunch number) can be set according to the actual requirements.After the cluster, similarity meets certain requirements between the object in same bunch, and namely the effect of cluster often depends on the method for measuring similarity between object, and in embodiments of the present invention bunch the object in is each candidate RS.Though also can adopt traditional cosine similarity calculating method in embodiments of the present invention, weigh semantic relation by the overlapping relation between the word, then can not well reflect for the demand relation between two objects.Therefore, provide a kind of method of weighing similarity between the different candidate RS in the embodiment of the invention, supposed to determine certain the candidate RS among the current query, be designated as RS iAnd the similarity between other candidate RS, then can be in the following ways:
At first determine and RS iThe query tabulation that co-occurrence constitutes in the query of same session and each query correspondence with this RS iCo-occurrence in the number of times of same session.With RS iFor " star X walks red carpet " is example, the co-occurrence of the query of acquisition tabulation and each query correspondence is as shown in table 8 in same session number (being called for short the co-occurrence number of times in table 8).
Table 8
The query tabulation The co-occurrence number of times
Star X 2
Star X sells and sprouts 1
Star X fashionable dress week 3
Star X Cannes 3
[0103]?
Star X is prize-winning 1
Star X cheongsam dress 1
Star X imperial robe dress 1
Star X walks the t platform 1
Then with RS iThe tabulation that constitutes of all candidate RS of query tabulation and current query seek common ground.It is as shown in table 9 with corresponding co-occurrence number of times to suppose to go up the query that obtains after seeking common ground in the example.
Table 9
Query after seeking common ground The co-occurrence number of times
Star X sells and sprouts 1
Star X fashionable dress week 3
Star X Cannes 3
Star X imperial robe dress 1
Calculate RS by following formula at last iAnd the similarity between each query that obtains after seeking common ground:
P ( RS i , RS j ) = Co _ Count ( RS i , RS j ) Σ R S k ∈ R Co _ Count ( RS i , RS k )
Wherein, P (RS i, RS j) be RS iWith RS jBetween similarity, Co_Count(RS i, RS j) be RS iAnd RS jThe number of times of co-occurrence in same session, R is the set that all query of obtaining after above-mentioned the seeking common ground constitute.
Continue and go up example, the similarity in " star X walks red carpet " and the table 9 between each query is as shown in table 10.
Table 10
Similarity The value of similarity
P (star X walks red carpet, and star X sells and sprouts) 1/8=0.125
P (star X walks red carpet, star X fashionable dress week) 3/8=0.375
P (star X walks red carpet, star X Cannes) 3/8=0.375
P (star X walks red carpet, star X imperial robe dress) 1/8=0.125
After carrying out cluster, just can obtain candidate RS bunch of current query " star X " correspondence.When storing, can store the searching times of each candidate RS correspondence in candidate RS bunch simultaneously, the searching times of each candidate RS bunch correspondence, wherein the searching times of candidate RS bunch of correspondence is the searching times sum of this candidate RS bunch of all candidate RS that comprise.Suppose all session(are not limited to table 1 to the exemplified session of table 6) carry out cluster according to above-mentioned steps after, obtain cluster result as shown in table 11.
Table 11
Figure DEST_PATH_GDA00003442439400121
Figure DEST_PATH_GDA00003442439400131
All carry out above-mentioned steps 301 to step 303 at each query, just setting up the cluster result of the candidate RS of each query correspondence in the database.When there is its corresponding candidate RS in one of them query, also may be as the candidate RS of other query.Following step provides process for RS on the line:
Step 304: the query that obtains the current input of user.
Step 305: inquire about whether there is the query that expresses identical semanteme with the query of the current input of user in the above-mentioned database, if, execution in step 306; Otherwise, the RS of the query of the current input of user is provided and provides according to mode of the prior art.
Inquire about the query that whether exists in the above-mentioned database with the query of the current input of user expresses identical semanteme in this step and can comprise two kinds of situations: a kind of is exactly to inquire about the query that whether has the current input of user in the above-mentioned database, just has the query that expresses identical semanteme with the query of the current input of user in the specified data storehouse if exist.Another kind is exactly to inquire about the query that whether exists in the above-mentioned database the current input of user to carry out the query that obtains after the normalized, just has the query that expresses identical semanteme with the query of the current input of user in the specified data storehouse if exist.Wherein normalized mode is identical with the normalization mode of description in the above-mentioned steps 301, repeats no more.
If there is not the query that expresses identical semanteme with the query of the current input of user in the above-mentioned database, then can determine and provide the RS of the query of the current input of user according to mode of the prior art or other modes, for example determine to comprise in the historical search daily record core word among the query of the current input of user and search rate and satisfy query that default search rate requires as RS.
Step 306: obtain in the above-mentioned database with the query of the current input of user and express corresponding candidate RS bunch of the query of identical semanteme.
Step 307: before selecting searching times to come candidate RS bunch of N1, N2 candidate RS was defined as the RS of the query of the current input of user before searching times in select candidate RS bunch come, and wherein N1 and N2 are default positive integer.
That is to say when selecting RS bunch of candidate, it is the ordering of carrying out according to candidate RS bunch searching times, during the RS that in candidate RS bunch, selects to show, be the ordering that the searching times according to candidate RS carries out, because searching times has reflected the search temperature of candidate RS and candidate RS bunch.
Step 308: the RS that determines is illustrated in the search results pages of query of active user's input, wherein the RS that searching times comes first in each candidate RS of Xuan Zeing bunch directly is illustrated on the search results pages, and other RS in each candidate RS of selection bunch hide with the form of combobox and are illustrated on the search results pages.
Query with the current input of user is that " star X " is example, and candidate RS bunch that supposes to select is RS bunch 1 to RS bunch 5 in the table 11, and the RS that then searching times in each bunch is come first directly is illustrated on the search results pages, as shown in Figure 4.
When user's mouse was clicked the drop-down triangle on RS " star X become much more popular blanket " right side, the form by combobox showed other definite RS in " star X become much more popular blanket " affiliated candidate RS bunch, as shown in Figure 5.
From top example as can be seen, not only can in limited RS showing resource, see the RS of how different semantic types by this method provided by the invention, can also see a plurality of RS under the same semantic type, thereby provide more effective RS, help the user to understand how useful RS fast, the user repeatedly query of conversion input comes to obtain the information of wanting from different Search Results, save user's operation and Internet resources.
When showing, except adopting above-mentioned exhibition method, can also adopt other exhibition methods, such as all directly being illustrated in the search results pages all RS that determine, perhaps the RS that searching times comes first in each candidate RS of Xuan Zeing bunch directly is illustrated on the search results pages, and other RS in each candidate RS of selection bunch are illustrated on the search results pages with the form of the frame that floats (showing the frame that should float when first RS goes up in this candidate RS bunch when mouse moves to).
More than be the detailed description that method provided by the present invention is carried out, be described in detail below in conjunction with two pairs of devices provided by the invention of embodiment.
Embodiment two,
The structure drawing of device that Fig. 6 provides for the embodiment of the invention two, this installs the RS generator in the corresponding system shown in Figure 2, and as shown in Figure 6, this device comprises under the line that RS excavates that RS provides unit 10 on unit 00 and the line.
RS excavation unit 00 is handled each query in the historical search daily record respectively under the line as current query, comprising: candidate's subelement 01, filtration subelement 02 and cluster subelement 03.
Wherein candidate's subelement 01 will constitute the candidate RS of current query with current query co-occurrence in other query of same session.Preferably, candidate's subelement 01 also can be normalized to identical statement with the query that expresses identical semanteme, with conserve storage and accelerate processing procedure.
When carrying out normalization, the processing of carrying out can include but not limited to: the processing of query being removed stop words; Word among the query is replaced with the synonym of appointment; Error correction is about to wrong writing and is converted into correct writing.
Filter subelement 02 then in the candidate RS that candidate's subelement 01 obtains, will filter out in the number of times of the same session candidate RS less than the preset times threshold value with current query co-occurrence.With current query with existing less in the number of times of same session, illustrate that it appears among the same session and may have contingency, be that the possibility of relevant query is less.Above-mentioned preset times threshold value can adopt empirical value or experiment value, for example adopts 5.The purpose of this filtration subelement 02 is in order to improve the efficient in the subsequent process, also can not comprise this filtration subelement 02.
Cluster subelement 03 carries out cluster according to similarity to the candidate RS of current query and obtains candidate RS bunch of current query correspondence, preserves candidate RS bunch of current query correspondence to database.Can adopt such as k-means algorithm in the embodiment of the invention, k value (RS that namely obtains at last bunch number) can be set according to the actual requirements.After the cluster, similarity meets certain requirements between the object in same bunch, and namely the effect of cluster often depends on the method for measuring similarity between object, and in embodiments of the present invention bunch the object in is each candidate RS.Though also can adopt traditional cosine similarity calculating method in embodiments of the present invention, weigh semantic relation by the overlapping relation between the word, then can not well reflect for the demand relation between two objects.Therefore, provide a kind of method of weighing similarity between the different candidate RS in the embodiment of the invention, supposed to determine certain the candidate RS among the current query, be designated as the similarity between RSi and other candidate RS, then can be in the following ways:
Determine at first that the query that constitutes in the query of same session with the RSi co-occurrence tabulates and each query is corresponding with this RSi co-occurrence in the number of times of same session; Then with RS iThe tabulation that constitutes of the candidate RS of query tabulation and current query seek common ground; Calculate RS by following formula again iWith RS jBetween similarity P (RS i, RS j), RS jA RS in the set that obtains for seeking common ground:
P ( RS i , RS j ) = Co _ Count ( RS i , RS j ) Σ R S k ∈ R Co _ Count ( RS i , RS k ) ;
Co_Count(RS wherein i, RS j) be RS iAnd RS jThe number of times of co-occurrence in same session, R is the set that obtains after seeking common ground.
RS provides unit 10 to comprise on the above-mentioned line: query obtains subelement 11, the candidate obtains subelement 12 and RS determines subelement 13.
Wherein query obtains subelement 11 for obtaining the query of the current input of user.
The candidate obtains subelement 12 and obtains in the database with the query of the current input of user and express corresponding candidate RS bunch of the query of identical semanteme.
RS determine subelement 13 be chosen in searching times in the historical search daily record and come before N1 candidate RS bunch, N2 candidate RS before searching times comes in the historical search daily record in select candidate RS bunch is defined as the RS of the query of the current input of user, N1 and N2 are default positive integer, and the candidate RS bunch of searching times in the historical search daily record is the searching times sum of this candidate RS bunch of candidate RS that comprises.That is to say when selecting RS bunch of candidate, it is the ordering of carrying out according to candidate RS bunch searching times, during the RS that in candidate RS bunch, selects to show, be the ordering that the searching times according to candidate RS carries out, because searching times has reflected the search temperature of candidate RS and candidate RS bunch.
In addition, RS provides unit 10 to comprise on the line: judgment sub-unit 14, be used for Query Database and whether have the query that expresses identical semanteme with the query of the current input of user, if trigger the candidate and obtain subelement 12 and carry out to obtain in the database with the query of the current input of user and express candidate RS bunch corresponding operation of the query of identical semanteme; Otherwise, can determine and provide the RS of the query of the current input of user according to mode of the prior art or other modes, for example determine to comprise in the historical search daily record core word among the query of the current input of user and search rate and satisfy query that default search rate requires as RS.
Particularly, if there is the query of the current input of user in the database, perhaps exist query with the current input of user to carry out the query that obtains after the normalized, then have the query that expresses identical semanteme with the query of the current input of user in the judgment sub-unit 14 specified data storehouses.Wherein the normalization carried out of judgment sub-unit 14 comprises at least a in the following processing:
Query is removed the processing of stop words;
Word among the query is replaced with the synonym of appointment;
Wrong writing among the query is converted into correct writing.
Further, RS provides unit 10 also to comprise on the line: RS shows subelement 15, is used for the RS of the query of the current input of user is illustrated in the search results pages of the query of the current input of user.Particularly, the RS that searching times in select each candidate RS bunch can be come first directly is illustrated on the search results pages, and other RS in each candidate RS of selection bunch show to hide with the form of combobox and are illustrated on the search results pages.Except adopting above-mentioned exhibition method, can also adopt other exhibition methods, such as all directly being illustrated in the search results pages all RS that determine, perhaps the RS that searching times comes first in each candidate RS of Xuan Zeing bunch directly is illustrated on the search results pages, and other RS in each candidate RS of selection bunch are illustrated on the search results pages with the form of the frame that floats (showing the frame that should float when first RS goes up in this candidate RS bunch when mouse moves to).
In several embodiment provided by the present invention, should be understood that, disclosed system, apparatus and method can realize by other mode.For example, device embodiment described above only is schematically, and for example, the division of described unit only is that a kind of logic function is divided, and during actual the realization other dividing mode can be arranged.
Described unit as separating component explanation can or can not be physically to separate also, and the parts that show as the unit can be or can not be physical locations also, namely can be positioned at a place, perhaps also can be distributed on a plurality of network element.Can select wherein some or all of unit to realize the purpose of present embodiment scheme according to the actual needs.
In addition, each functional unit in each embodiment of the present invention can be integrated in the processing unit, also can be that the independent physics in each unit exists, and also can be integrated in the unit two or more unit.Above-mentioned integrated unit both can adopt the form of hardware to realize, the form that also can adopt hardware to add SFU software functional unit realizes.
The above-mentioned integrated unit of realizing with the form of SFU software functional unit can be stored in the computer read/write memory medium.Above-mentioned SFU software functional unit is stored in the storage medium, comprise that some instructions are with so that a computer equipment (can be personal computer, server, the perhaps network equipment etc.) or processor (processor) carry out the part steps of the described method of each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), various media that can be program code stored such as magnetic disc or CD.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (20)

1. the method that relevant search RS is provided is characterized in that, this method comprises:
In the online down RS mining process, each the search terms query in the historical search daily record is carried out following steps S01 to step S02 as current query respectively:
S01, will constitute the candidate RS of current query with current query co-occurrence in other query of same session session;
S02, according to similarity the candidate RS of current query is carried out cluster and obtain candidate RS bunch of current query correspondence, preserve candidate RS bunch of current query correspondence to database;
RS provides process on the line:
S11, obtain the query of the current input of user;
S12, obtain in the described database with the query of the current input of user and express corresponding candidate RS bunch of the query of identical semanteme;
S13, be chosen in searching times in the described historical search daily record and come before N1 candidate RS bunch, N2 candidate RS before searching times comes in described historical search daily record in select candidate RS bunch is defined as the RS of the query of the current input of user, and described N1 and N2 are default positive integer.
2. method according to claim 1 is characterized in that, also comprises in described step S01:
The query that expresses identical semanteme is normalized to identical statement.
3. method according to claim 1 is characterized in that, also comprises between described step S01 and described step S02:
To filter out in the number of times of the same session candidate RS less than the preset times threshold value with current query co-occurrence.
4. method according to claim 1 is characterized in that, the similarity calculating method that adopts when carrying out cluster in described step S02 specifically comprises:
Determine and RS iThe query tabulation that co-occurrence constitutes in the query of same session and each query correspondence and described RS iCo-occurrence in the number of times of same session, described RS iBe a RS among the described candidate RS;
With RS iThe tabulation that constitutes of the candidate RS of query tabulation and described current query seek common ground;
Calculate RS by following formula iWith RS jBetween similarity P (RS i, RS j), described RS jBe a RS in the described set that obtains that seeks common ground:
P ( R S i , RS j ) = Co _ Count ( RS i , RS j ) Σ R S k ∈ R Co _ Count ( RS i , RS k ) ;
Wherein said Co_Count(RS i, RS j) be RS iAnd RS jThe number of times of co-occurrence in same session, R is the set that obtains after described the seeking common ground.
5. method according to claim 1 is characterized in that, also comprises before described step S12:
Inquire about whether there is the query that expresses identical semanteme with the query of the current input of user in the described database, if carry out described step S12.
6. method according to claim 5, it is characterized in that, if there is the query of the current input of described user in the described database, perhaps exist query with the current input of user to carry out the query that obtains after the normalized, then determine to exist in the described database query that expresses identical semanteme with the query of the current input of user.
7. according to claim 2 or 6 described methods, it is characterized in that described normalization comprises at least a in the following processing:
Query is removed the processing of stop words;
Word among the query is replaced with the synonym of appointment;
Wrong writing among the query is converted into correct writing.
8. method according to claim 1 is characterized in that, the candidate RS bunch of searching times in described historical search daily record is the searching times sum of this candidate RS bunch of candidate RS that comprises.
9. method according to claim 1 is characterized in that, also comprises after described step S13:
S14, the RS of the query of the current input of described user is illustrated on the search results pages of query of the current input of user.
10. method according to claim 9 is characterized in that, described step S14 specifically comprises:
The RS that searching times in select each candidate RS bunch is come first directly is illustrated on the described search results pages, and other RS in each candidate RS of selection bunch hide with the form of combobox and are illustrated on the described search results pages.
11. the device that RS is provided is characterized in that, this device comprises: RS excavates that RS provides the unit on unit and the line under the line;
RS excavation unit is handled each query in the historical search daily record respectively under the described line as current query, comprising:
Candidate's subelement is used for constituting the candidate RS of current query with current query co-occurrence in other query of same session;
The cluster subelement is used for according to similarity the candidate RS of current query being carried out cluster and obtains candidate RS bunch of current query correspondence, preserves candidate RS bunch of current query correspondence to database;
RS provides the unit to comprise on the described line:
Query obtains subelement, is used for obtaining the query of the current input of user;
The candidate obtains subelement, is used for obtaining described database and the query of the current input of user and expresses corresponding candidate RS bunch of the query of identical semanteme;
RS determines subelement, be used for being chosen in described historical search daily record searching times and come preceding N1 candidate RS bunch, N2 candidate RS before searching times comes in described historical search daily record in select candidate RS bunch is defined as the RS of the query of the current input of user, and described N1 and N2 are default positive integer.
12. device according to claim 11 is characterized in that, described candidate's subelement also is normalized to identical statement for the query that will express identical semanteme.
13. device according to claim 11, it is characterized in that, RS excavates the unit and also comprises under the described line: filter subelement, be used for the candidate RS that obtains at described candidate's subelement, will filter out in the number of times of the same session candidate RS less than the preset times threshold value with current query co-occurrence.
14. device according to claim 11 is characterized in that, the similarity calculating method that described cluster subelement adopts when carrying out cluster specifically comprises:
Determine and RS iThe query tabulation that co-occurrence constitutes in the query of same session and each query correspondence and described RS iCo-occurrence in the number of times of same session, described RS iBe a RS among the described candidate RS;
With RS iThe tabulation that constitutes of the candidate RS of query tabulation and described current query seek common ground;
Calculate RS by following formula iWith RS jBetween similarity P (RS i, RS j), described RS jBe a RS in the described set that obtains that seeks common ground:
P ( R S i , RS j ) = Co _ Count ( RS i , RS j ) Σ R S k ∈ R Co _ Count ( RS i , RS k ) ;
Wherein said Co_Count(RS i, RS j) be RS iAnd RS jThe number of times of co-occurrence in same session, R is the set that obtains after described the seeking common ground.
15. device according to claim 11, it is characterized in that, RS provides the unit also to comprise on the described line: judgment sub-unit, be used for the described database of inquiry and whether have the query that expresses identical semanteme with the query of the current input of user, if trigger described candidate and obtain subelement and carry out to obtain in the described database with the query of the current input of user and express candidate RS bunch corresponding operation of the query of identical semanteme.
16. device according to claim 15, it is characterized in that, if there is the query of the current input of described user in the described database, perhaps exist query with the current input of user to carry out the query that obtains after the normalized, then described judgment sub-unit determines to exist in the described database query that expresses identical semanteme with the query of the current input of user.
17., it is characterized in that described normalization comprises at least a in the following processing according to claim 12 or 16 described devices:
Query is removed the processing of stop words;
Word among the query is replaced with the synonym of appointment;
Wrong writing among the query is converted into correct writing.
18. device according to claim 11 is characterized in that, the candidate RS bunch of searching times in described historical search daily record is the searching times sum of this candidate RS bunch of candidate RS that comprises.
19. device according to claim 11 is characterized in that, RS provides the unit also to comprise on the described line: RS shows subelement, is used for the RS of the query of the current input of described user is illustrated in the search results pages of the query of the current input of user.
20. device according to claim 19, it is characterized in that, described RS shows that subelement is when carrying out described displaying, the concrete execution: the RS that searching times comes first in each candidate RS that will select bunch directly is illustrated on the described search results pages, and other RS in each candidate RS of selection bunch are illustrated on the described search results pages so that the form of combobox is hiding.
CN201310145958.5A 2013-04-24 2013-04-24 It is a kind of that the method and apparatus of relevant search are provided Active CN103279486B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310145958.5A CN103279486B (en) 2013-04-24 2013-04-24 It is a kind of that the method and apparatus of relevant search are provided

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310145958.5A CN103279486B (en) 2013-04-24 2013-04-24 It is a kind of that the method and apparatus of relevant search are provided

Publications (2)

Publication Number Publication Date
CN103279486A true CN103279486A (en) 2013-09-04
CN103279486B CN103279486B (en) 2019-03-08

Family

ID=49062006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310145958.5A Active CN103279486B (en) 2013-04-24 2013-04-24 It is a kind of that the method and apparatus of relevant search are provided

Country Status (1)

Country Link
CN (1) CN103279486B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617266A (en) * 2013-12-03 2014-03-05 北京奇虎科技有限公司 Personalized extension search method, device and system
CN105446984A (en) * 2014-06-30 2016-03-30 阿里巴巴集团控股有限公司 Expansion word pair screening method and device
CN105955988A (en) * 2016-04-19 2016-09-21 百度在线网络技术(北京)有限公司 Information search method and apparatus
CN107330672A (en) * 2017-07-03 2017-11-07 北京拉勾科技有限公司 A kind of information processing method based on similarity, device and computing device
CN107679030A (en) * 2017-09-04 2018-02-09 北京京东尚科信息技术有限公司 Method and apparatus based on user's operation behavior data extraction synonym
CN108647730A (en) * 2018-05-14 2018-10-12 中国科学院计算技术研究所 A kind of data partition method and system based on historical behavior co-occurrence
CN109857926A (en) * 2019-03-05 2019-06-07 百度在线网络技术(北京)有限公司 The method and apparatus of information for rendering
CN110442760A (en) * 2019-07-24 2019-11-12 银江股份有限公司 A kind of the synonym method for digging and device of question and answer searching system
US10635678B2 (en) 2014-12-23 2020-04-28 Alibaba Group Holding Limited Method and apparatus for processing search data
WO2021196934A1 (en) * 2020-04-02 2021-10-07 深圳壹账通智能科技有限公司 Question recommendation method and apparatus based on field similarity calculation, and server

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110161311A1 (en) * 2009-12-28 2011-06-30 Yahoo! Inc. Search suggestion clustering and presentation
CN102419778A (en) * 2012-01-09 2012-04-18 中国科学院软件研究所 Information searching method for discovering and clustering sub-topics of query statement
CN102479223A (en) * 2010-11-25 2012-05-30 中国移动通信集团浙江有限公司 Data query method and system
CN102609433A (en) * 2011-12-16 2012-07-25 北京大学 Method and system for recommending query based on user log
CN103136223A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method and device for mining query with similar requirements

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110161311A1 (en) * 2009-12-28 2011-06-30 Yahoo! Inc. Search suggestion clustering and presentation
CN102479223A (en) * 2010-11-25 2012-05-30 中国移动通信集团浙江有限公司 Data query method and system
CN103136223A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method and device for mining query with similar requirements
CN102609433A (en) * 2011-12-16 2012-07-25 北京大学 Method and system for recommending query based on user log
CN102419778A (en) * 2012-01-09 2012-04-18 中国科学院软件研究所 Information searching method for discovering and clustering sub-topics of query statement

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617266A (en) * 2013-12-03 2014-03-05 北京奇虎科技有限公司 Personalized extension search method, device and system
CN105446984A (en) * 2014-06-30 2016-03-30 阿里巴巴集团控股有限公司 Expansion word pair screening method and device
US10635678B2 (en) 2014-12-23 2020-04-28 Alibaba Group Holding Limited Method and apparatus for processing search data
US11347758B2 (en) 2014-12-23 2022-05-31 Alibaba Group Holding Limited Method and apparatus for processing search data
CN105955988A (en) * 2016-04-19 2016-09-21 百度在线网络技术(北京)有限公司 Information search method and apparatus
CN107330672A (en) * 2017-07-03 2017-11-07 北京拉勾科技有限公司 A kind of information processing method based on similarity, device and computing device
CN107330672B (en) * 2017-07-03 2021-02-26 北京拉勾科技有限公司 Similarity-based information processing method and device and computing equipment
CN107679030A (en) * 2017-09-04 2018-02-09 北京京东尚科信息技术有限公司 Method and apparatus based on user's operation behavior data extraction synonym
CN108647730B (en) * 2018-05-14 2020-11-24 中国科学院计算技术研究所 Data partitioning method and system based on historical behavior co-occurrence
CN108647730A (en) * 2018-05-14 2018-10-12 中国科学院计算技术研究所 A kind of data partition method and system based on historical behavior co-occurrence
CN109857926A (en) * 2019-03-05 2019-06-07 百度在线网络技术(北京)有限公司 The method and apparatus of information for rendering
CN110442760A (en) * 2019-07-24 2019-11-12 银江股份有限公司 A kind of the synonym method for digging and device of question and answer searching system
CN110442760B (en) * 2019-07-24 2022-02-15 银江技术股份有限公司 Synonym mining method and device for question-answer retrieval system
WO2021196934A1 (en) * 2020-04-02 2021-10-07 深圳壹账通智能科技有限公司 Question recommendation method and apparatus based on field similarity calculation, and server

Also Published As

Publication number Publication date
CN103279486B (en) 2019-03-08

Similar Documents

Publication Publication Date Title
CN103279486A (en) Method and device for providing related searches
JP6266080B2 (en) Method and system for evaluating matching between content item and image based on similarity score
Poorthuis et al. Making big data small: strategies to expand urban and geographical research using social media
CN102314491B (en) Method for identifying similar behavior mode users in multicore environment based on massive logs
KR101936362B1 (en) Generating an advertising campaign
CN110275920A (en) Data query method, apparatus, electronic equipment and computer readable storage medium
CN101566986A (en) Method and device for processing data in online business processing
US20180300296A1 (en) Document similarity analysis
CN107918658B (en) Business opportunity generation method and system
CN103827852B (en) Assemble WEB page on search engine results page
CN103577416A (en) Query expansion method and system
US8775423B2 (en) Data mining across multiple social platforms
US9754015B2 (en) Feature rich view of an entity subgraph
US20160259817A1 (en) Surfacing actions from social data
CA3089097C (en) Domain-based search engine
CN103226618A (en) Related word extracting method and system based on data market mining
US20230205755A1 (en) Methods and systems for improved search for data loss prevention
CN104035972A (en) Knowledge recommending method and system based on micro blogs
CN102508857B (en) Desktop cloud searching method based on event correlation
CN110928903B (en) Data extraction method and device, equipment and storage medium
CN105446990A (en) Service data processing method and equipment
CN110795613A (en) Commodity searching method, device and system and electronic equipment
Khodaei et al. Temporal-textual retrieval: Time and keyword search in web documents
CN101339568B (en) Method and device for constructing data tree
CN109597934A (en) Determine method, apparatus, storage medium and the electronic equipment clicked and recommend word

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20130904

Assignee: Beijing small mutual Entertainment Technology Co., Ltd.

Assignor: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

Contract record no.: 2017990000088

Denomination of invention: Method and device for providing related searches

License type: Exclusive License

Record date: 20170315

GR01 Patent grant
GR01 Patent grant