CN103324707A

CN103324707A - Query expansion method based on semi-supervised clustering

Info

Publication number: CN103324707A
Application number: CN2013102413856A
Authority: CN
Inventors: 杨静; 刘宁; 张健沛
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2013-06-18
Filing date: 2013-06-18
Publication date: 2013-09-25

Abstract

The invention provides a query expansion method based on semi-supervised clustering. The query expansion method includes the steps: (1) initially retrieving user queries by a query likelihood estimation language module and returning n front documents of retrieved results; (2) manually annotating k front documents in the initial retrieved results and dividing the k front documents into a relevant document set and an irrelevant document set; (3) analyzing the n front documents by a semi-supervised clustering algorithm for constraint and distance integration and extracting the documents related to the queries as feedback documents; (4) selecting expansion words by an expansion word selection module according to the feedback documents and forming new queries by the aid of the expansion words and original queries. By learning relevancy of a small number of annotated documents and query, relevancy of a large number of unknown documents and query can be accurately estimated, the quality of the feedback documents is improved, and accordingly, the recall ratio and precision of retrieval are effectively improved.

Description

A kind of enquiry expanding method based on semi-supervised cluster

Technical field

The present invention relates to a kind of enquiry expanding method based on semi-supervised cluster.

Background technology

Along with the development of infotech, the increase of quantity of information, information retrieval is more and more important in work and life.Find fast the information that needs by retrieving, thus convenient work and life.But because people often have little understanding to needed information, therefore the inquiry submitted to of user often because of too short can not be fully, the information of user's needs described out accurately, how to make searching system as much as possible the information relevant with user's query intention return to the user, simultaneously the least possible appearance has nothing to do or weak relevant information with inquiry, becomes the primary problem that solves of current searching field.Query expansion is the effective technology means that address this problem.Query expansion has solved query word and the unmatched problem of document word in the searching field, has improved the retrieval performance of infosystem.Its method mainly comprises: global analysis, partial analysis, based on the daily record of user's inquiry and semantic-based knowledge etc.The spurious correlation feedback is a kind of partial analysis method commonly used, and its feedback according to front n document in the first result for retrieval is revised the form of former inquiry, has overcome the shortcoming of user's query requests information deficiency, has improved retrieval performance.Traditional spurious correlation feedback model all is that front n document in the first result for retrieval of hypothesis is relevant with inquiry, the model that for example has Okapi, Lavrenko and Croft to propose.Yet owing to often exist some incoherent documents (noise) in the front n document, so traditional spurious correlation feedback model does not reach desirable effect.

The problems such as drift phenomenon for spurious correlation feedback model feedback document information quality discomfort poor and that expansion word is selected produces have had the scholar to adopt the method for cluster that it is improved.Clustering algorithm does not need to adopt the method for manual mark to put large volume document in order, has reduced to a great extent workload, therefore is widely used in information retrieval field.Current clustering method mostly still adopt traditional based on divide, based on level, based on model or based on density etc. method.These methods can be processed the data of classification structural similarity preferably.Yet text categories in the reality distributes and often presents diversity, and clustering algorithm is difficult to guarantee that the classification maintenance that the text that generates bunch and people define is highly consistent.

Summary of the invention

Goal of the invention of the present invention is to provide a kind of enquiry expanding method of semi-supervised cluster, can by the study of a small amount of mark sample being estimated more accurately the classification of a large amount of unknown sample, can solve the arrangement problem that nothing marks sample in the situation that greatly reduce hand labor.

Realize the technical scheme of the object of the invention:

A kind of enquiry expanding method based on semi-supervised cluster is characterized in that:

Step 1: inquiry likelihood estimation language module is retrieved for the first time to user's inquiry, returns front n document of result for retrieval;

Step 2: front k document in the first result for retrieval manually marked, be divided into set of relevant documents and uncorrelated document sets two classes;

Step 3: by the semi-supervised clustering algorithm that constraint and distance merge a front n document is analyzed, extracted document associated with the query as the feedback document;

Step 4: according to the feedback document, choose module with expansion word and choose expansion word, expansion word and original query are formed new inquiry.

In the step 3, obtain by the following method the feedback document:

Step 3.1: in a front n document, choose at random two documents and be bunch center;

Step 3.2: the Euclidean distance of calculating remaining document and each bunch center;

Step 3.3: according to the Euclidean distance at described remaining each document and each bunch center, last

Remaining document is given nearest bunch center;

Step 3.4: recomputate a bunch center;

Repeating step 3.1 is to step 3.4, until the objective function convergence is got the nearest class of bunch center and former inquiry as the relevant documentation class.

Said objective function is in the step 3:

J = \underset{x_{i} &Element; X}{Σ} {(x_{i} - μ_{l_{i}})}^{2} + \underset{(x_{i}, x_{j}) &Element; M}{Σ} w_{ij} Z (l_{i} &NotEqual; l_{j}) + \underset{(x_{i}, x_{j}) &Element; C}{Σ} s_{ij} Z (l_{i} = l_{j}) - \log (\det (A))

In the formula, M and C are respectively set of relevant documents and uncorrelated document sets, w _IjAnd s _IjBe respectively weights corresponding in aforementioned two set, l _iBe the class mark, Z is indicator function, μ _iBe a bunch center, x _i, x _jThe expression document, A is a symmetric positive definite matrix.

In the step 4, may further comprise the steps:

Step 4.1: according to the relevant documentation class, calculate the weight of word;

Step 4.2: the size according to the weight of word sorts successively, then extracts expansion word;

Step 4.3: expansion word and original query are formed new inquiry.

In the step 4.1, calculate the weight w of word by following formula _j:

f_{i, j}^{'} \overset{Δ}{=} \frac{{tf}_{i, j}}{{tf}_{i, j} + \frac{{dl}_{i}}{{dl}^{'}}}

w_{j} = \frac{1}{F} \times Σ_{i = 1}^{F} {tf}_{i, j}^{'}

In the formula, the number of F representative feedback document, dl _iRepresent the document length of document i, dl ＇ represents the average length of all feedback documents, tf _{I, j}Represent the word frequency of word j in document i.

In the step 1, may further comprise the steps:

Step 1.1: whole document sets is carried out pre-service, make up based on the text database of vector space model and the total characteristic dictionary of whole document sets;

Step 1.2: pre-service is carried out in the inquiry that the user submits to, formed the vector form of inquiry;

Step 1.3: the possibility that produces inquiry according to document sorts to document.

The invention has the beneficial effects as follows:

1, improves the spurious correlation feedback model, improve the quality of spurious correlation collection of document, and improve the quality of new query expression with this, thereby finally improve the quality of result for retrieval.

2, by to the study of a small amount of mark document with the inquiry correlativity, can estimate more accurately the correlativity of a large amount of unknown documents and inquiry.

3, the spurious correlation collection of document that adopts Novel semi-supervised to form more tallies with the actual situation, and efficient is high, and accuracy is also high.

Description of drawings

Fig. 1 is the system flowchart that the present invention is based on the enquiry expanding method of semi-supervised cluster;

Fig. 2 is the process flow diagram of the semi-supervised clustering algorithm of the present invention.

Embodiment

Below in conjunction with the drawings and specific embodiments implementation process of the present invention is described in further detail.

With reference to Fig. 1, Fig. 2, the present invention proposes a kind of enquiry expanding method based on semi-supervised cluster, the method comprises following several step:

Step 1: inquiry likelihood estimation language module is retrieved for the first time to user's inquiry, returns front n document of result for retrieval, specifically may further comprise the steps:

Step 1.1: whole document sets is gone the pre-service such as stop words, stem, make up based on the text database of vector space model and the total characteristic dictionary of whole document sets.

Step 1.2: the query contents of input is gone the pre-service such as stop words, stem, and remaining word consists of the vector form Q of inquiry.

Step 1.3: the possibility P (Q|D) that produces inquiry according to document sorts to document, and the P of document (Q|D) is larger, and specification documents is more relevant with inquiry, and rank is more forward.Formula is as follows:

P (Q | D) = Π_{i = 1}^{m} P (q_{i} | D)

P (w | D) = \frac{| D |}{| D | + μ} P_{ML} (w | D) + \frac{μ}{| D | + μ} P_{ML} (w | Coll)

P_{ML} (w | D) = \frac{freq (w, D)}{| D |}

P_{ML} (w | Coll) = \frac{freq (w, Coll)}{| Coll |}

In the formula, q _iBe i query word, m is the number of query word among the inquiry Q, and D is a document, P _ML(w|D) be that word w appears at the maximum likelihood in the document D, Coll is whole set, and μ is smoothing factor, value is 500, | D| and | Coll| represents respectively the length of D and Coll, freq (w, D) and freq (w, Coll) represent respectively the frequency that word w occurs in D and Coll.

Step 2: front k document in the first result for retrieval manually marked, form collection of document C associated with the query _rWith with the inquiry incoherent collection of document C _Ir

Step 3: by the semi-supervised clustering algorithm that constraint and distance merge a front n document is analyzed, extracted document associated with the query as feedback document Clu _f, specifically may further comprise the steps:

Step 3.1: in a front n document, choose at random two document x _i, x _jRespectively as bunch center μ ₁, μ ₂

Step 3.2: calculate remaining document x _iEuclidean distance op with each bunch center.

op = \sqrt{Σ_{i = 1}^{m} {(y_{i} - z_{i})}^{2}}

In the formula, y _iAnd z _iRepresent respectively i coordinate in document y and the z vector form.

Step 3.3: according to described remaining each document x _iWith the Euclidean distance op at each bunch center, remaining document x _iGive nearest bunch center; According to the Euclidean distance at described remaining each document and each bunch center, give nearest bunch center.

Step 3.4: recomputate a bunch center μ ₁, μ ₂

Repeating step 3.1 is to step 3.4, until a bunch center μ is got in objective function J convergence ₁, μ ₂The nearest class of the former inquiry that neutralizes is as relevant documentation class Clu _f

J = \underset{x_{i} &Element; X}{Σ} {(x_{i} - μ_{l_{i}})}^{2} + \underset{(x_{i}, x_{j}) &Element; M}{Σ} w_{ij} Z (l_{i} &NotEqual; l_{j}) + \underset{(x_{i}, x_{j}) &Element; C}{Σ} s_{ij} Z (l_{i} = l_{j}) - \log (\det (A))

In the formula, M and C are respectively must-link constrain set (set of relevant documents) and can-not-link constrain set (uncorrelated document sets), w _IjAnd s _IjBe respectively weights corresponding in aforementioned two set, l _iBe the class mark, Z is indicator function (Z (true)=1, Z (false)=0), μ _iBe a bunch center, x _i, x _jThe expression document, A is a symmetric positive definite matrix.

Step 4: according to the feedback document, choose module with expansion word and choose expansion word T={w ₁t ₁, w ₂t ₂W _nt _n, and former inquiry Q forms new inquiry Q ₀, specifically may further comprise the steps:

Step 4.1: according to the relevant documentation class, calculate the weight w of word by following formula _j:

f_{i, j}^{'} \overset{Δ}{=} \frac{{tf}_{i, j}}{{tf}_{i, j} + \frac{{dl}_{i}}{{dl}^{'}}}

w_{j} = \frac{1}{F} \times Σ_{i = 1}^{F} {tf}_{i, j}^{'}

Step 4.2: according to the weight w of word _jSize sort successively, then extract expansion word T={w ₁t ₁, w ₂t ₂W _nt _n.

Step 4.3: with expansion word T={w ₁t ₁, w ₂t ₂W _nt _nAnd the new inquiry Q of original query Q composition ₀

New inquiry Q ₀=α Q+ (1-α) T, wherein α=0.7.

Claims

1. enquiry expanding method based on semi-supervised cluster is characterized in that:

2. the enquiry expanding method based on semi-supervised cluster according to claim 1 is characterized in that: in the step 3, obtain by the following method the feedback document:

Step 3.3: according to the Euclidean distance at described remaining each document and each bunch center, remaining document is given nearest bunch center;

Step 3.4: recomputate a bunch center;

3. the enquiry expanding method based on semi-supervised cluster according to claim 2, it is characterized in that: said objective function is in the step 3:

J = \underset{x_{i} &Element; X}{Σ} {(x_{i} - μ_{l_{i}})}^{2} + \underset{(x_{i}, x_{j}) &Element; M}{Σ} w_{ij} Z (l_{i} &NotEqual; l_{j}) + \underset{(x_{i}, x_{j}) &Element; C}{Σ} s_{ij} Z (l_{i} = l_{j}) - \log (\det (A))

In the formula, M and C are respectively set of relevant documents and uncorrelated document sets, wi _jAnd si _jBe respectively weights corresponding in aforementioned two set, l _iBe the class mark, Z is indicator function, μ _iBe a bunch center, x _i, x _jThe expression document, A is a symmetric positive definite matrix.

4. the enquiry expanding method based on semi-supervised cluster according to claim 3 is characterized in that: in the step 4, may further comprise the steps:

Step 4.3: expansion word and original query are formed new inquiry.

5. the enquiry expanding method based on semi-supervised cluster according to claim 4 is characterized in that: in the step 4.1, calculate the weight w of word by following formula _j:

f_{i, j}^{'} \overset{Δ}{=} \frac{{tf}_{i, j}}{{tf}_{i, j} + \frac{{dl}_{i}}{{dl}^{'}}}

w_{j} = \frac{1}{F} \times Σ_{i = 1}^{F} {tf}_{i, j}^{'}

6. according to claim 1 to 5 any one described enquiry expanding method based on semi-supervised cluster, it is characterized in that: in the step 1, may further comprise the steps: