CN102521350A

CN102521350A - Selection method of distributed information retrieval sets based on historical click data

Info

Publication number: CN102521350A
Application number: CN2011104122625A
Authority: CN
Inventors: 陈岭; 刘颖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2011-12-12
Filing date: 2011-12-12
Publication date: 2012-06-27
Anticipated expiration: 2031-12-12
Also published as: CN102521350B

Abstract

The invention discloses a selection method of distributed information retrieval sets based on historical click data, wherein the method comprises the steps: 1), a retrieval proxy server performs preprocessing to a query log to extract historical query and click data; 2, the retrieval proxy server computes correlation degree between the historical query and each information set according to the click data; 3), the retrieval proxy server computes comprehensive similarity between the new query and each historical query; 4), the retrieval proxy server selects the most similar historical query according to the comprehensive similarity and computes correlation degree between the new query and each information set according to the historical query and the selected correlation degree between the historical query and each information set;5), the retrieval proxy server selects a plurality of information sets, sends a retrieval request and combines the result returned by the retrieval proxy server to output to a user sending new query. The method has the advantages of high retrieval result accuracy, low network bandwidth consumption, fast response speed and economic and efficient retrieval.

Description

Distributed information retrieval set option method based on historical click data

Technical field

The present invention relates to the distributed information retrieval technology, be specifically related to the set option method of retrieving information in a kind of distributed information retrieval system.

Background technology

It is universal day by day that Along with computer technology, mechanics of communication, rapid development of network technology and Internet use, and the quantity of electronic document and day sharp increase make electronic document become a huge information bank.The explosive increase of WWW information also makes Web become huge information bank.How managing for these ultra-large data, the control user is submerged in googol according to also finding own required information in the storehouse fast.Mainly contain two kinds of solutions at present: a kind of is centralized; Promptly adopt the separate unit high-performance server that mass data is carried out unified management, unified for the user provides service, this scenario-frame is simple; Be easy to dispose and implement; But the service performance of separate unit server always has the upper limit, and system cost is non-linear growth, is not easy to expansion.Another kind is distributed; Promptly adopt the logical server disposition of many Daeporis to manage mass data; Share the multi-user concurrent request, the sharpest edges of this scheme are to carry out dynamic-configuration to system resource according to the actual performance demand, avoid the overweight systemic breakdown that causes of load through load-balancing technique; And cost is relatively low, and applicability is stronger.

As shown in Figure 1; Distributed information retrieval system is made up of retrieval agent server and information retrieval server unit; The retrieval agent server through network user oriented 1, user 2 ..., user n provides the distributed information retrieval interface service; The information retrieval server unit comprise a plurality of information retrieval servers that are distributed frame (information retrieval server 1, information retrieval server 2 ..., information retrieval server n), the retrieval agent server links to each other with each information retrieval server through network.Each information retrieval server is as an ensemble of communication, a part of document of storage system.During retrieval, the retrieval agent server is transmitted to information retrieval server with inquiry, and each information retrieval server is retrieved separately and the result is returned to the agency, and the agency presents to the user after the result is merged by certain rule.

Because the data scale of distributed search is huge, many traditional methods all can not directly be used for distributed system, and the processing power of each node is not quite similar and can only retrieve the data subset of this locality usually; Make distributed information retrieval be faced with many challenges; As: Query Result is of low quality, is mainly reflected in recall ratio and precision ratio is lower, lacks necessary descriptor; Several aspects such as the good ordering rule of neither one, the inconvenience that causes the user to use.How to be that so huge information resources provide navigation Service efficiently, helping the user in the data of magnanimity, to find the information that needs fast is the search engine problem demanding prompt solution.Usually the user only is concerned about the result who comes the front that search engine returns, yet the degree of correlation of Query Result that the current search engine returns and user's request is unsatisfactory.So relevance ranking of search engine-, become the emphasis and the focus of current research according to sorting with the degree of correlation of user inquiring index file to search engine.The process of distributed information retrieval mainly is divided into following 3 steps: set is selected, and promptly for a given query formulation, from whole collection of document, selects maximally related with it document subclass and retrieves; Single document set retrieval is found out in each document sets and the closely-related document of user inquiring; Query Result merges, and promptly the intermediate result returned of each information set must be merged into a single the results list and return to the user.It is the major issue of distributed information retrieval research that set is selected.Given several ensembles of communication, set is chosen in as far as possible and does not influence under the prerequisite of retrieval effectiveness, selects and inquires about relevant information subset and retrieve.Set is selected to avoid searching for all information sets, can reduce network bandwidth consumption, improves the response speed of system, realizes the high-efficiency and economic retrieval.

Summary of the invention

The technical matters that the present invention will solve provides a kind of result for retrieval accuracy height, network bandwidth consumption is low, response speed is fast, the distributed information retrieval set option method based on historical click data of retrieval economical and efficient.

For solving the problems of the technologies described above, the technical scheme that the present invention adopts is: a kind of distributed information retrieval set option method based on historical click data, and implementation step is following:

1) the retrieval agent server carries out pre-service to inquiry log, extracts historical query and corresponding click data thereof;

2) the retrieval agent server is according to the degree of correlation between each ensemble of communication of storing on inquiry of click data computation history and the information retrieval server;

3) the retrieval agent server obtains the new inquiry that the user sends, and calculates the comprehensive similarity between new inquiry and each historical query;

4) the retrieval agent server is selected the most similar historical query of a plurality of and new inquiry according to said comprehensive similarity, according to the historical query of said selection and and each ensemble of communication between the make new advances degree of correlation of inquiry and each ensemble of communication of relatedness computation;

5) the retrieval agent server is selected a plurality of ensembles of communication according to the new inquiry and the degree of correlation of ensemble of communication; The information retrieval server corresponding to ensemble of communication sends retrieval request, and exports to the user who sends new inquiry after the result that information retrieval server returns merged.

Further improvement as technique scheme of the present invention:

Extracting historical query and corresponding click data thereof in the said step 1) specifically is that index is stored and set up in historical query and corresponding click data thereof, and said index entry is formed by comprising the pointer that is used to store the data segment of historical query and point to corresponding click document id.

Said step 2) detailed step is: the retrieval agent server at first sends retrieval request with each historical query to each information retrieval server, and the number of being clicked in the result for retrieval that returns according to each retrieval server of said index statistics, basis then

Obtain the degree of correlation Rel (s of historical query and each ensemble of communication _j| p), wherein p is historical query, and T should retrieve the number of documents of returning, s for each preset information retrieval server _jBe an ensemble of communication that retrieval server comprises,

CTD (p) is the click data of historical query, doc _iFor retrieval server comprises a document in the ensemble of communication.

The detailed step that calculates the comprehensive similarity between new inquiry and each historical query in the said step 3) is:

A) obtain the keyword similarity between the keyword of keyword and each historical query of new inquiry respectively through calculating query vector included angle cosine value;

B) form central sample to each historical query result for retrieval document of information retrieval server collection predetermined number;

C) similarity as a result between new inquiry of calculating and the said central sample;

D) with the keyword similarity and as a result similarity multiply by respectively that summation obtains comprehensive similarity behind the coefficient.

Said steps A) specifically basis in

sim_term (p | q) = \frac{Σ_{i = 1}^{l} w_{t_{i}, p} \times w_{t_{i}, q}}{\sqrt{Σ_{i = 1}^{l} w_{t_{i}, p}^{2}} \times \sqrt{Σ_{i = 1}^{l} w_{t_{i}, q}^{2}}}

、

w_{t_{i}, p} = t f_{i, p} \times iq f_{i}

、

iq f_{i} = \log (\frac{n}{q f_{i}})

Keyword similarity sim_term (p|q) between the keyword of the new inquiry of calculating and the keyword of each historical query, wherein p is new inquiry, q is historical query, t _iBe i index terms, w _{Ti, p}Be index terms t among the inquiry p _iWeight, w _{Ti, p}Be index terms t among the inquiry p _iWeight, tf _{I, p}Be index terms t among the inquiry p _iThe frequency that occurs, iqf _iBe reverse enquiry frequency, qf _iFor keyword t occurring _iInquiry quantity.

Said step C) specifically is based in

sim_result (p | q) = \frac{N (R (p) \cap R (q))}{N (R (p) \cup R (q))}

Calculate the sim_result of similarity as a result (p|q) between new inquiry and the said central sample; R (q) is the result for retrieval to central sample of historical query q, and the number of documents that N (R (p) ∩ R (q)) comprises for the common factor of newly inquiring about p and historical query q result for retrieval, N (R (p) ∪ R (q)) are the number of documents that the union of new inquiry p and historical query q result for retrieval comprises.

Said step D) detailed step comprises:

1., obtain comprehensive similarity sim (p|q) according to sim (p|q)=α * sim_term (p|q)+β * sim_result (p|q); Wherein sim_term (p|q) is the keyword similarity; Sim_result (p|q) is a similarity as a result, and α is the keyword coefficient of similarity, and β is a coefficient of similarity as a result;

2., basis

Standardization comprehensive similarity sim (p|q), wherein cutSim is the preset coefficient of standardization, ∑ sim (p|q) is sim (p|q) summation greater than cutSim, obtains final comprehensive similarity sim (p|q).

Said step 4) retrieval agent server calculates new inquiry and specifically is meant with the degree of correlation of each information retrieval server: according to the degree of correlation of the similarity between said inquiry with historical similar inquiry and each ensemble of communication, through Rel (s _j| q)=∑ Rel (s _j| p) sim (p|q) calculates make new advances inquiry and ensemble of communication s _jDegree of correlation Rel (s _j| q), Rel (s _j| p) be historical query p and ensemble of communication s _jThe degree of correlation, sim (p|q) is the comprehensive similarity of new inquiry p and historical query q.

The present invention has following advantage:

1, the click data of the present invention through extracting each historical query, according to the degree of correlation of the inquiry of click data computation history and each ensemble of communication, obtain comprehensive similarity between new inquiry and each historical query, select a plurality of and newly inquire about the most similar historical query according to comprehensive similarity through calculating keyword similarity and result for retrieval similarity between new inquiry and each historical query; And according to the relatedness computation of the historical similar inquiry of selecting and each ensemble of communication make new advances inquiry and each ensemble of communication the degree of correlation, from the maximum ensemble of communication of the new inquiry degree of correlation select a plurality of information sets to merge to send retrieval to information retrieval server; The result that information retrieval server is returned exports to the user who sends new inquiry after merging, and has the advantage that the result for retrieval accuracy is high, network bandwidth consumption is low, response speed is fast, retrieve economical and efficient.

2, the present invention extracts the click situation that the click data of each historical query further is preceding several results of returning of each retrieval server of statistics retrieval; Promptly only consider the situation that quilt is clicked in the front T result for retrieval; Angle from the user; Estimate the degree of correlation of each ensemble of communication and historical query more accurately, improved the accuracy rate of a preceding K result for retrieval, improved the quality and the efficient of retrieval.

3, the present invention obtains the comprehensive similarity between new inquiry and the historical query through the method for keyword similarity and similarity combination as a result; Take all factors into consideration keyword similarity and theme similarity between inquiry; Estimate the similarity between inquiry more accurately, can improve retrieval precision.

Description of drawings

Fig. 1 is the framed structure synoptic diagram of the distributed information retrieval system of prior art.

Fig. 2 is the main schematic flow sheet of the embodiment of the invention.

Fig. 3 is the framed structure synoptic diagram of retrieval agent server in the embodiment of the invention.

Fig. 4 is the storage organization synoptic diagram of inquiry log in the embodiment of the invention.

Fig. 5 is a step 2 in the embodiment of the invention) concise and to the point schematic flow sheet.

Fig. 6 is the concise and to the point schematic flow sheet of embodiment of the invention step 3).

Fig. 7 is the concise and to the point schematic flow sheet of embodiment of the invention step 4).

Embodiment

As shown in Figure 2, the embodiment of the invention is following based on the implementation step of the distributed information retrieval set option method of historical click data:

4) the retrieval agent server is selected the most similar historical query of a plurality of and new inquiry according to comprehensive similarity, according to the historical query of selecting and and each ensemble of communication between the make new advances degree of correlation of inquiry and each ensemble of communication of relatedness computation;

As shown in Figure 3, the retrieval agent server mainly comprises inquiry inlet module, data preparation module and the module of query set selection in real time, and the input end of data preparation module and the module of query set selection in real time links to each other with the inquiry inlet module respectively.Data preparation module comprises " historical query and the click data pre-processing module " and " historical query and set relatedness computation module " that links to each other successively, and data preparation module is carried out pre-service and calculated historical query and the degree of correlation of each set historical query and click data thereof.Query set selects module to comprise " inquiry similarity calculation module " and " inquiry and set relatedness computation module " in real time, and query set selects module to utilize the similar inquiry in the historical query to calculate the degree of correlation of each set in real time, gathers selection.

As shown in Figure 4, inquiry log comprises a large amount of user inquirings, and the corresponding click result of inquiry; Inquiry log is carried out pre-service; Extract the corresponding click data of each inquiry, its storage organization is as shown in Figure 3, and inquiry and document id are all arranged by lexicographic ordering from low to high.The user is through reading web page title, and after information such as summary had certain understanding to web page contents, whether decision was clicked further and read, if the user has clicked a webpage, this webpage is probably relevant with inquiry so.Click data has reflected the degree of correlation of result for retrieval and inquiry, takes all factors into consideration the validity of retrieval and clicks the distribution situation of document in each set, can estimate the degree of correlation of each set more accurately.Because being the user, the click behavior web page contents is being had certain understanding back take place; Click data has comprised inquiry and the corresponding click result of inquiry that the user submits to; Click data has reflected the preference situation of user to Query Result; Can think that the click result is relevant with inquiry to a great extent, can be more accurately from user's the estimation set and the degree of correlation of inquiry.Extracting historical query and corresponding click data thereof in the present embodiment step 1) specifically is that index is stored and set up in historical query and corresponding click data thereof, and index entry is formed by comprising the pointer that is used to store the data segment of historical query and point to corresponding click document id.

As shown in Figure 5; Present embodiment step 2) detailed step is: the retrieval agent server at first sends retrieval request with each historical query to each information retrieval server; And the click situation of adding up preceding several result for retrieval that each retrieval server returns, basis then Obtain the degree of correlation Rel (s of historical query and each ensemble of communication _j| p), wherein p is historical query, and T should retrieve the number of documents of returning, s for each preset information retrieval server _jBe an ensemble of communication that retrieval server comprises,

As shown in Figure 6, the detailed step that calculates the comprehensive similarity between new inquiry and each historical query in the step 3) is:

C) similarity as a result between new inquiry of calculating and the central sample;

New inquiry p can be expressed as vector (<t ₁, w _{T1, p}>,<t ₂, w _{T2, p}>...,<t _l, w _{Tl, p}>), t wherein _iBe i index terms, w _{Ti, p}Be index terms t among the inquiry p _iWeight.

Steps A) specifically basis in

sim_term (p | q) = \frac{Σ_{i = 1}^{l} w_{t_{i}, p} \times w_{t_{i}, q}}{\sqrt{Σ_{i = 1}^{l} w_{t_{i}, p}^{2}} \times \sqrt{Σ_{i = 1}^{l} w_{t_{i}, q}^{2}}}

、

w_{t_{i}, p} = t f_{i, p} \times iq f_{i}

、

iq f_{i} = \log (\frac{n}{q f_{i}})

Keyword similarity sim_term (p|q) between the keyword of the new inquiry of calculating and the keyword of each historical query, wherein p is new inquiry, q is historical query, t _iBe i index terms, w _{Ti, p}Be index terms t among the inquiry p _iWeight, w _{Ti, p}Be index terms t among the inquiry p _iWeight, tf _{I, p}Be the frequency that index terms ti among the inquiry p occurs, iqf _iBe reverse enquiry frequency, qf _iFor keyword t occurring _iInquiry quantity.

Obtain inquiry and run counter to the original intention that set is selected in the global search result of distributed system; Step B) passes through the method for sampling in based on inquiry; For each historical query; Specifically be to obtain first three document composition central sample that each set retrieval is returned, utilize the result for retrieval of central sample is calculated the similarity of inquiry.

Step C) specifically basis in

sim_result (p | q) = \frac{N (R (p) \cap R (q))}{N (R (p) \cup R (q))}

Calculate the sim_result of similarity as a result (p|q) between new inquiry and the central sample; R (q) is the result for retrieval to central sample of historical query q, and the number of documents that N (R (p) ∩ R (q)) comprises for the common factor of newly inquiring about p and historical query q result for retrieval, N (R (p) ∪ R (q)) are the number of documents that the union of new inquiry p and historical query q result for retrieval comprises.

Step D) detailed step comprises:

2., basis

Standardization obtains final comprehensive similarity sim (p|q), and wherein cutSim is the preset coefficient of standardization, and ∑ sim (p|q) is sim (p|q) summation greater than cutSim, and ∑ sim (p|q)=1 obtains final comprehensive similarity sim (p|q).

Because always have many same queries and similar inquiry in the true searching system, similar inquiry has similar result for retrieval usually, and the user tends to select similar result for retrieval.Each set capable of using is predicted the degree of correlation of each set to new inquiry to the degree of correlation of historical query.As shown in Figure 7, step 4) retrieval agent server calculates new inquiry and specifically is meant with the degree of correlation of each information retrieval server: according to the similarity between inquiry and the degree of correlation of historical similar inquiry and each ensemble of communication, through Rel (s _j| q)=∑ Rel (s _j| p) sim (p|q) calculates make new advances inquiry and ensemble of communication s _jDegree of correlation Rel (s _j| q), Rel (s _j| p) be historical query p and ensemble of communication s _jThe degree of correlation, sim (p|q) is the comprehensive similarity of new inquiry p and historical query q.

The above is merely preferred implementation of the present invention, and protection scope of the present invention is not limited in above-mentioned embodiment, and every technical scheme that belongs to the principle of the invention all belongs to protection scope of the present invention.For a person skilled in the art, some improvement and the retouching under the prerequisite that does not break away from principle of the present invention, carried out, these improvement and retouching also should be regarded as protection scope of the present invention.

Claims

1. distributed information retrieval set option method based on historical click data is characterized in that implementation step is following:

2. the distributed information retrieval set option method based on historical click data according to claim 1; It is characterized in that: extracting historical query and corresponding click data thereof in the said step 1) specifically is that index is stored and set up in historical query and corresponding click data thereof, and said index entry is made up of the pointer that is used to store the data segment of historical query and point to corresponding click document id.

3. the distributed information retrieval set option method based on historical click data according to claim 1; It is characterized in that; Said step 2) detailed step is: the retrieval agent server at first sends retrieval request with each historical query to each information retrieval server; And the number of being clicked in the result for retrieval that returns according to each retrieval server of said index statistics, basis then Obtain the degree of correlation Rel (s of historical query and each ensemble of communication _j| p), wherein p is historical query, and T should retrieve the number of documents of returning, s for each preset information retrieval server _jBe an ensemble of communication that retrieval server comprises,

4. the distributed information retrieval set option method based on historical click data according to claim 3 is characterized in that, the detailed step that calculates the comprehensive similarity between new inquiry and each historical query in the said step 3) is:

5. the distributed information retrieval set option method based on historical click data according to claim 4 is characterized in that: specifically be basis said steps A)

sim_term (p | q) = \frac{Σ_{i = 1}^{l} w_{t_{i}, p} \times w_{t_{i}, q}}{\sqrt{Σ_{i = 1}^{l} w_{t_{i}, p}^{2}} \times \sqrt{Σ_{i = 1}^{l} w_{t_{i}, q}^{2}}}

、

w_{t_{i}, p} = t f_{i, p} \times iq f_{i}

、

iq f_{i} = \log (\frac{n}{q f_{i}})

6. the distributed information retrieval set option method based on historical click data according to claim 4 is characterized in that: specifically be basis said step C)

sim_result (p | q) = \frac{N (R (p) \cap R (q))}{N (R (p) \cup R (q))}

7. the distributed information retrieval set option method based on historical click data according to claim 4 is characterized in that said step D) detailed step comprise:

2., basis

8. according to any described distributed information retrieval set option method in the claim 3～7 based on historical click data; It is characterized in that; Said step 4) retrieval agent server calculates new inquiry and specifically is meant with the degree of correlation of each information retrieval server: according to the degree of correlation of the similarity between said inquiry with historical similar inquiry and each ensemble of communication, through Rel (s _j| q)=∑ Rel (s _j| p) sim (p|q) calculates make new advances inquiry and ensemble of communication s _jDegree of correlation Rel (s _j| q), Rel (s _j| p) be historical query p and ensemble of communication s _jThe degree of correlation, sim (p|q) is the comprehensive similarity of new inquiry p and historical query q.