CN101853272A

CN101853272A - Search engine technology based on relevance feedback and clustering

Info

Publication number: CN101853272A
Application number: CN 201010165586
Authority: CN
Inventors: 李新叶
Original assignee: North China Electric Power University
Current assignee: North China Electric Power University
Priority date: 2010-04-30
Filing date: 2010-04-30
Publication date: 2010-10-06
Anticipated expiration: 2030-04-30
Also published as: CN101853272B

Abstract

The invention relates to a search engine technology based on relevance feedback and clustering. By simultaneously utilizing user relevance feedback information and relavancy sequencing to direct the clustering of retrieval results, the invention ensures that the final partitioning of the retrieval results meet user query requirements; and in a clustering process, a large amount of documents and repeated webpage which are irrelevant to a user are removed, the clustering speed is improved and the retrieval results are optimized at the same time. In the clustering process, a clustering center is not modified by a clustering cluster irrelevant to the user, thereby result documents relevant to the user are ensured not to be lost when noise is introduced in irrelevant document clustering.

Description

Search engine technique based on relevant feedback and cluster

Technical field

The present invention relates to internet information retrieval technique field, relate in particular to a kind of Web result for retrieval optimization method based on relevant feedback and cluster.

Background technology

At present, search engine mostly is based on keyword and carries out index and retrieval, and according to the lists of keywords of user's input, search engine is searched index database, with the document of coupling according to the different sequencing display of the degree of correlation of user inquiring.Because keyword has the polysemy phenomenon, and the keyword that the user often only imports is seldom retrieved, the search result list that makes search engine return has comprised a lot of themes document uncorrelated, mixed in together usually, the user must browse the result for retrieval tabulation one by one to find relevant documentation, the webpage that wherein also has many contents to repeat, browsing information can be wasted many times of user and great effort from such result for retrieval.

The user's browses for convenience, some researchists are used for Web information retrieval resulting class with the automatic cluster technology and divide, the document that will have similar features (for example belonging to a theme) is placed on same group, so that the user dwindles seek scope, only in own interested minority group, search and browse the document of being concerned about.But the automatic cluster of result for retrieval is not had to consider and user's correlativity that cause result for retrieval can not reflect user's specific intent and professional domain, the user can not be according to the mode of own needs and interest selection clustering documents.In addition, its result for retrieval enormous amount on the Web search engine, the research of existing automatic cluster is to comprise a large amount of and the incoherent result of user carries out cluster to whole result for retrieval, cluster process needs the time long, thereby influences the performance of search engine.

For the cluster that makes result for retrieval is relevant with specific user's query demand, what a kind of result for retrieval based on inquiry log occurred partly instructs clustering method.This method obtains the must-link constraint according to the record data that user in the inquiry log clicks the result, concrete grammar is that the supposition user has clicked two result for retrieval with one page, think that then they are relevant with user inquiring, can draw thus and have the must-link restriction relation between them.Consider that the must-link constraint of selecting owing to individual's reason can have noise, this method is the generation frequency of these constraints in the statistical query daily record at first, select then frequency greater than the constraint of certain threshold value as final must-link constraint.Can obtain constraint with the daily record of the method traversal queries, partly instruct cluster according to what result for retrieval was carried out in constraint at last about the must-link of each inquiry.Owing to do not comprise user's all possible inquiry in the inquiry log, the new inquiry for user's input can not obtain restriction relation from inquiry log; In addition, the result who has guaranteed the must-link constraint when cluster is in same clustering, the result of can not-link constraint is not in same clustering, do not consider the optimization of cluster process, as a result still can length consuming time during cluster according to this method to all carrying out clustering processing with user-dependent and incoherent result for retrieval to the Web information retrieval, influence the performance of search engine.

Another kind of field feedback is attached to the method for text cluster, needs the user at first to specify and belong to some example documents that cluster to instruct cluster process.Then cluster result is presented to the user, by the customer inspection cluster result and provide some feedback informations, for example point out that document d should belong to cluster S or should not belong to the S that clusters; Document d should be from the S that clusters _iChange to the S that clusters _jTwo documents should cluster or should not cluster same same.Instruct the next round cluster process according to field feedback, again with user interactions, up to obtaining customer satisfaction system cluster result.When being clustered modeling, each has used the feature partial weight to reflect the importance of a feature that clusters.Improve the quality of feature partial weight by increasing more constraints more accurately, thereby improve the cluster effect.This method has mainly been considered the validity of text cluster, but need the user repeatedly to import feedback information, increased user's burden, needed the user to specify during cluster especially first and belong to some example documents that cluster to instruct cluster process, increased difficulty to the user; And the process of cluster length consuming time, be not suitable for Web information retrieval result's cluster.

Summary of the invention

The present invention is directed to that said method exists to need the user repeatedly to import complicated feedback information or inquiry log invalid to new inquiry, and exist irrelevant document class or document still to have drawbacks such as a large amount of duplicate contents in clustering during whole result for retrieval clusters length consuming time, result divided, a kind of method that needs relevant with query demand and the incoherent small part feedback information of user's input instruct optimization Web result for retrieval is provided.

The present invention adopts following technical method:

(1) determine initial clustering classification number and initial cluster center vector of all categories, comprising:

The relevant documentation that the user is chosen from result for retrieval divides a class into, is called the relevant documentation class, determines the initial cluster center of relevant documentation class; The initial cluster center vector of relevant documentation class obtains by asking for the weighted mean of each keyword in such each document.

Uncorrelated document is divided into one or several uncorrelated document class, determines the initial cluster center of every class, comprising:

-select a uncorrelated document as first uncorrelated document class, the proper vector of the document is the cluster centre vector of the document class

The similarity of all the other uncorrelated documents of-calculating and above-mentioned classification, be divided in certain the most close uncorrelated classification or be divided into new uncorrelated class according to the similarity value, if be divided into a new class, then the proper vector of the document is such cluster centre vector

(2) initial division and definite final cluster classification number;

The document that the user does not choose in the tabulation of calculating result for retrieval and the similarity of relevant documentation class and uncorrelated document class, carry out following processing according to the size of similarity value:

-be divided in certain the most close document class

-or being divided into new document class, the document proper vector is such cluster centre vector;

-or judge the document that belongs to duplicate contents and with its deletion

(3) remove the document that the middle content of each document class (clustering) in the initial division repeats;

Certain document d1 from such begins, calculate proper vector of the document and the similarity between each document vector thereafter, judge according to the similarity value whether certain document repeats with document d1 content, if then from result for retrieval tabulation and the document class, delete the document that repeats with the document d1 content;

Whether the next one from the result for retrieval tabulation of having upgraded begins then, calculate proper vector of the document and the similarity between the proper vector of each document thereafter, and be the judgement of repetitive file.

Repeat said process, last up to the result for retrieval tabulation.

(4) the cluster centre vector of other classification of modification except uncorrelated document class;

The initial cluster center vector of class obtains by asking for the weighted mean of each keyword in such each document.

(5) recomputate user in the result for retrieval tabulation unchecked other with the similarity of each cluster centre, divide again, comprising:

-calculate the proper vector of each document and the similarity between each classification cluster centre vector, document is divided in the most close classification.

If-certain document belongs to uncorrelated document class, and after the relevancy ranking of itself and inquiry leans on, then never delete the document in the tabulation of relevant documentation classification and result for retrieval respectively.

(6) repeating step (4) and (5) are up to satisfying end condition.

The present invention utilizes user's related feedback information and relevancy ranking to instruct the cluster of result for retrieval simultaneously, makes the final division of result for retrieval meet the user inquiring demand; In cluster process, remove a large amount of and incoherent document of user and repeated pages, improved cluster speed, optimized result for retrieval simultaneously.In cluster process, do not revise cluster centre with the incoherent similar cluster of user, guaranteed can in uncorrelated document clusters, not lose and user-dependent result document because of introducing noise.

Description of drawings

Below in conjunction with accompanying drawing the present invention is elaborated:

Fig. 1 is a process flow diagram of the present invention.

Embodiment

Step S101: the user selects relevant document and incoherent document from search engine retrieving result;

Step S102: determine initial clustering classification number and initial cluster center;

Suppose that document is d1 in the result for retrieval tabulation, d2 ..ds (s is a number of files), the keyword of supposing index in the index database of searching system does not comprise stop words, chooses document d1 in the index database of searching system, d2, ..ds middle keyword weight, be the frequency that keyword occurs in document, greater than the keyword t1 of preset threshold value δ k, t2, t3, ..tn (n is the keyword number) constitutes dimension vectorial in the vector space model, and then the proper vector di of document di is defined as:

di＝(w _i1，w _i2，...，w _in) (1)

Wherein, w _Ij=tf _Ij(i=1,2 ... s, j=1,2 ... n), tf _IjBe j the frequency that keyword occurs in i document di.

1. extract the public characteristic vector of relevant documentation:

The relevant documentation that the user is chosen is represented with C1 as a relevant documentation class.Suppose that the relevant documentation in the C1 document class is d1, d2 ..dm (the relevant documentation number that m chooses for the user), keyword t1 then, t2, t3, the weight of ..tn in the C1 class is respectively:

a_{1 j} = \frac{Σ_{i = 1}^{m} {tf}_{ij}}{m}, (j = 1,2, . . . n) - - - (2)

The initial cluster center vector of C1 class is defined as:

C1 _center＝(a ₁₁，a ₁₂，..a _1n)

This moment, the cluster classification was counted k=1.

2. uncorrelated document is carried out category division:

The uncorrelated document that the user chooses may belong to same document class, also may belong to different document class.Divide according to following steps for t uncorrelated document:

-optional uncorrelated document be designated as di (i=1,2 ... t), the cluster classification is counted k=k+1, and document di is divided into the Ck document class, and the proper vector di of document di is as the cluster centre vector Ck of Ck document class _Center

-remaining t-1 uncorrelated document repeated following process:

Similarity between the proper vector di of calculating document di and the cluster centre vector of each uncorrelated document class, calculating formula of similarity adopts the vector angle cosine formula:

sim (di, {Cj}_{center}) = \frac{Σ_{v = 1}^{n} w_{iv} \times a_{jv}}{\sqrt{(Σ_{v = 1}^{n} w_{iv}^{2}) \times (Σ_{v = 1}^{n} a_{jv}^{2})}} - - - (3)

Wherein, Cj _CenterBe the cluster centre vector of j document class, w _IvBe the weight of v keyword in i document di, formula (1) is seen in its definition; a _JvBe the weight of v keyword in j cluster Cj.

If di and certain existing uncorrelated document class Cg (g=2,3 ... cluster centre vector Cg k) _CenterThe similarity value is maximum and when surpassing setting threshold δ 1, then document di is classified as the Cg document class;

If the similarity value of the cluster centre vector of di and current all uncorrelated classifications then makes k=k+1 all less than setting threshold δ 1, document di is divided into new document class Ck, the proper vector di of document di is as the cluster centre vector Ck of Ck document class _Center

Said process finishes, and uncorrelated document is divided into d=k-1 classification.Initial cluster classification number is k.

Step S103: determine that initial division and final cluster classification count k;

To each document di in the unchecked the results list of user, repeat following process:

1. the similarity between the cluster centre vector of the proper vector di of calculating document di and each document class is calculated employing formula (3).

2. if the cluster centre of proper vector di and r document class Cr vector Cr _CenterThe similarity value is maximum and when surpassing setting threshold δ 1:

If-similarity value then is classified as the Cr class with document di less than setting threshold δ 2 (δ 2＞δ 1);

-otherwise think that document di is a duplicate pages, deletion document di from the result for retrieval tabulation.

3. if the similarity value of the cluster centre vector of a proper vector di and a current k document class then makes k=k+1 all less than setting threshold δ 1, document di is divided into new document class Ck, the proper vector di of document di is as the cluster centre vector Ck of Ck _Center

Said process finishes, and initial division forms, and final cluster classification number is k.

Step S104: remove the duplicate contents that is divided into k the webpage in the document class;

If there be p sets of documentation to become lists of documents d1, d2 ..dp in a certain document class.

From document d1, calculate the document vector and the similarity between p-1 document vector thereafter, if and the similarity value between the proper vector of document dx is greater than setting threshold δ 2, think that then both are repeated pages, deletion document dx from result for retrieval tabulation and the document class revises p=p-1 respectively;

D2 from the result for retrieval tabulation of having upgraded begins then, calculate the document vector and the similarity between p-2 document vector thereafter, if and the similarity value between the proper vector of document dy is greater than setting threshold δ 2, think that then both are repeated pages, deletion document dy from result for retrieval tabulation and this cluster revises p=p-1 respectively;

Repeat said process, last up to the result for retrieval tabulation.

Step S105: the cluster centre vector of revising other document class except that d uncorrelated classification;

Step S106: recomputate the similarity of unchecked other document of user and this k class in the result for retrieval tabulation, and divide;

To unchecked each the document di of user in updated search the results list, repeat following process:

1. calculate similarity between the cluster centre vector of the proper vector di of document di and each document class according to formula (3), di is divided in the most close document class with document.

2., and after the relevancy ranking of itself and inquiry leans on, then never delete the document in the tabulation of relevant documentation class and result for retrieval respectively if document di belongs to uncorrelated document class.

Step S107: repeat 105 and 106, up to satisfying end condition.

It is minimum or less than the iterations of setting to set end condition and be target function value.

Claims

1. the search engine technique based on relevant feedback and cluster is characterized in that, may further comprise the steps:

Step 1: determine initial clustering classification number and initial cluster center vector of all categories, comprising:

The relevant documentation that the user is chosen from result for retrieval divides the relevant documentation class into, determines the initial cluster center vector of this relevant documentation class; Described initial cluster center vector obtains by asking for the weighted mean of each keyword in this each document of relevant documentation class;

Uncorrelated document is divided into one or several uncorrelated document class, and determines the initial cluster center vector of described independent document class, comprising:

-select a uncorrelated document as first uncorrelated document class, and proper vector that will this uncorrelated document is defined as the cluster centre vector of described uncorrelated document class;

-calculate the similarity of all the other uncorrelated documents and above-mentioned uncorrelated document class respectively, and will this uncorrelated document be divided in current certain uncorrelated document class the most close with it or be divided in the new uncorrelated document class according to gained similarity value, and proper vector that will this uncorrelated document be defined as the cluster centre vector of described new uncorrelated document class;

Step 2: determine initial division and final cluster classification number;

Calculate the user does not choose in the result for retrieval tabulation the document and the similarity of described relevant documentation class and uncorrelated document class respectively, carry out following processing according to the size of similarity value:

-the document is divided in current certain document class the most close with it;

-or the document is divided into new document class, and the proper vector of the document is defined as the cluster centre vector of described new document class;

-or judge that the document belongs to document that content repeats and with its deletion;

Step 3: remove the document that content repeats in each document class in the initial division;

First document from the document class begins, calculate proper vector of the document and the similarity between the proper vector of each document thereafter, judge according to the similarity value whether the document repeats with other document content, if content repeats, then from result for retrieval tabulation and the document class, delete the document that repeats with the document content;

The next one from updated search the results list begins then, calculates the similarity between the proper vector of the proper vector of the document and each document thereafter, judges and delete the document of content repetition in view of the above;

Repeat said process, last up to the result for retrieval tabulation;

Step 4: the cluster centre vector of revising other document class except that uncorrelated document class;

Described cluster centre vector obtains by asking for the weighted mean of each keyword in each document of the document class;

Step 5: recomputate the similarity of the cluster centre vector of the proper vector of unchecked other document of user in the result for retrieval tabulation and current each document class, and divide in view of the above, comprising:

-document is divided in the document class the most close with it;

If-certain document belongs to a certain uncorrelated document class, and after the relevancy ranking of the document and inquiry leans on, then never delete the document in the tabulation of relevant documentation class and result for retrieval respectively;

Step 6: repeating step four and five, up to satisfying end condition.

2. the method for claim 1 is characterized in that, the proper vector of described document is defined as:

di＝(w _i1，w _i2，...，w _in)

Wherein, di is the proper vector of document di, w _Ij=tf _Ij(j=1,2...n, n are the keyword number), tf _IjBe j the frequency that keyword occurs in document di.

3. the method for claim 1 is characterized in that, the computing formula of the weighted mean of keyword j in each document of document class r is in described step 1 and the step 4:

a_{rj} = \frac{Σ_{i = 1}^{m} {tf}_{ij}}{m}

Wherein, m is the number of files in the document class.

4. the method for claim 1 is characterized in that, the cluster centre vector representation of a certain document class r is:

Cr _center＝(a _r1，a _r2，..a _rn)。

5. the method for claim 1 is characterized in that, the calculation of similarity degree in the described step 1 to five between the cluster centre vector of the proper vector of document and document class adopts the vector angle cosine formula:

sim (di, {Cj}_{center}) = \frac{Σ_{v = 1}^{n} w_{iv} \times a_{jv}}{\sqrt{(Σ_{v = 1}^{n} w_{iv}^{2}) \times (Σ_{v = 1}^{n} a_{jv}^{2})}} .

6. as claim 1 or 5 described methods, it is characterized in that, in the described step 2 by threshold value being set and comparing with similarity, thereby judge the processing mode that should take a certain document, concrete, if when described similarity value surpasses preset threshold δ 1:

If-described similarity value less than preset threshold δ 2 (δ 2＞δ 1), then is included into the document a document class the highest with the document similarity;

-otherwise judge that the document is the document that content repeats.