CN103324707A - Query expansion method based on semi-supervised clustering - Google Patents

Query expansion method based on semi-supervised clustering Download PDF

Info

Publication number
CN103324707A
CN103324707A CN2013102413856A CN201310241385A CN103324707A CN 103324707 A CN103324707 A CN 103324707A CN 2013102413856 A CN2013102413856 A CN 2013102413856A CN 201310241385 A CN201310241385 A CN 201310241385A CN 103324707 A CN103324707 A CN 103324707A
Authority
CN
China
Prior art keywords
document
documents
semi
inquiry
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102413856A
Other languages
Chinese (zh)
Inventor
杨静
刘宁
张健沛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN2013102413856A priority Critical patent/CN103324707A/en
Publication of CN103324707A publication Critical patent/CN103324707A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a query expansion method based on semi-supervised clustering. The query expansion method includes the steps: (1) initially retrieving user queries by a query likelihood estimation language module and returning n front documents of retrieved results; (2) manually annotating k front documents in the initial retrieved results and dividing the k front documents into a relevant document set and an irrelevant document set; (3) analyzing the n front documents by a semi-supervised clustering algorithm for constraint and distance integration and extracting the documents related to the queries as feedback documents; (4) selecting expansion words by an expansion word selection module according to the feedback documents and forming new queries by the aid of the expansion words and original queries. By learning relevancy of a small number of annotated documents and query, relevancy of a large number of unknown documents and query can be accurately estimated, the quality of the feedback documents is improved, and accordingly, the recall ratio and precision of retrieval are effectively improved.

Description

A kind of enquiry expanding method based on semi-supervised cluster
Technical field
The present invention relates to a kind of enquiry expanding method based on semi-supervised cluster.
Background technology
Along with the development of infotech, the increase of quantity of information, information retrieval is more and more important in work and life.Find fast the information that needs by retrieving, thus convenient work and life.But because people often have little understanding to needed information, therefore the inquiry submitted to of user often because of too short can not be fully, the information of user's needs described out accurately, how to make searching system as much as possible the information relevant with user's query intention return to the user, simultaneously the least possible appearance has nothing to do or weak relevant information with inquiry, becomes the primary problem that solves of current searching field.Query expansion is the effective technology means that address this problem.Query expansion has solved query word and the unmatched problem of document word in the searching field, has improved the retrieval performance of infosystem.Its method mainly comprises: global analysis, partial analysis, based on the daily record of user's inquiry and semantic-based knowledge etc.The spurious correlation feedback is a kind of partial analysis method commonly used, and its feedback according to front n document in the first result for retrieval is revised the form of former inquiry, has overcome the shortcoming of user's query requests information deficiency, has improved retrieval performance.Traditional spurious correlation feedback model all is that front n document in the first result for retrieval of hypothesis is relevant with inquiry, the model that for example has Okapi, Lavrenko and Croft to propose.Yet owing to often exist some incoherent documents (noise) in the front n document, so traditional spurious correlation feedback model does not reach desirable effect.
The problems such as drift phenomenon for spurious correlation feedback model feedback document information quality discomfort poor and that expansion word is selected produces have had the scholar to adopt the method for cluster that it is improved.Clustering algorithm does not need to adopt the method for manual mark to put large volume document in order, has reduced to a great extent workload, therefore is widely used in information retrieval field.Current clustering method mostly still adopt traditional based on divide, based on level, based on model or based on density etc. method.These methods can be processed the data of classification structural similarity preferably.Yet text categories in the reality distributes and often presents diversity, and clustering algorithm is difficult to guarantee that the classification maintenance that the text that generates bunch and people define is highly consistent.
Summary of the invention
Goal of the invention of the present invention is to provide a kind of enquiry expanding method of semi-supervised cluster, can by the study of a small amount of mark sample being estimated more accurately the classification of a large amount of unknown sample, can solve the arrangement problem that nothing marks sample in the situation that greatly reduce hand labor.
Realize the technical scheme of the object of the invention:
A kind of enquiry expanding method based on semi-supervised cluster is characterized in that:
Step 1: inquiry likelihood estimation language module is retrieved for the first time to user's inquiry, returns front n document of result for retrieval;
Step 2: front k document in the first result for retrieval manually marked, be divided into set of relevant documents and uncorrelated document sets two classes;
Step 3: by the semi-supervised clustering algorithm that constraint and distance merge a front n document is analyzed, extracted document associated with the query as the feedback document;
Step 4: according to the feedback document, choose module with expansion word and choose expansion word, expansion word and original query are formed new inquiry.
In the step 3, obtain by the following method the feedback document:
Step 3.1: in a front n document, choose at random two documents and be bunch center;
Step 3.2: the Euclidean distance of calculating remaining document and each bunch center;
Step 3.3: according to the Euclidean distance at described remaining each document and each bunch center, last
Remaining document is given nearest bunch center;
Step 3.4: recomputate a bunch center;
Repeating step 3.1 is to step 3.4, until the objective function convergence is got the nearest class of bunch center and former inquiry as the relevant documentation class.
Said objective function is in the step 3:
J = Σ x i ∈ X ( x i - μ l i ) 2 + Σ ( x i , x j ) ∈ M w ij Z ( l i ≠ l j ) + Σ ( x i , x j ) ∈ C s ij Z ( l i = l j ) - log ( det ( A ) )
In the formula, M and C are respectively set of relevant documents and uncorrelated document sets, w IjAnd s IjBe respectively weights corresponding in aforementioned two set, l iBe the class mark, Z is indicator function, μ iBe a bunch center, x i, x jThe expression document, A is a symmetric positive definite matrix.
In the step 4, may further comprise the steps:
Step 4.1: according to the relevant documentation class, calculate the weight of word;
Step 4.2: the size according to the weight of word sorts successively, then extracts expansion word;
Step 4.3: expansion word and original query are formed new inquiry.
In the step 4.1, calculate the weight w of word by following formula j:
f i , j ′ = Δ tf i , j tf i , j + dl i dl ′
w j = 1 F × Σ i = 1 F tf i , j ′
In the formula, the number of F representative feedback document, dl iRepresent the document length of document i, dl ' represents the average length of all feedback documents, tf I, jRepresent the word frequency of word j in document i.
In the step 1, may further comprise the steps:
Step 1.1: whole document sets is carried out pre-service, make up based on the text database of vector space model and the total characteristic dictionary of whole document sets;
Step 1.2: pre-service is carried out in the inquiry that the user submits to, formed the vector form of inquiry;
Step 1.3: the possibility that produces inquiry according to document sorts to document.
The invention has the beneficial effects as follows:
1, improves the spurious correlation feedback model, improve the quality of spurious correlation collection of document, and improve the quality of new query expression with this, thereby finally improve the quality of result for retrieval.
2, by to the study of a small amount of mark document with the inquiry correlativity, can estimate more accurately the correlativity of a large amount of unknown documents and inquiry.
3, the spurious correlation collection of document that adopts Novel semi-supervised to form more tallies with the actual situation, and efficient is high, and accuracy is also high.
Description of drawings
Fig. 1 is the system flowchart that the present invention is based on the enquiry expanding method of semi-supervised cluster;
Fig. 2 is the process flow diagram of the semi-supervised clustering algorithm of the present invention.
Embodiment
Below in conjunction with the drawings and specific embodiments implementation process of the present invention is described in further detail.
With reference to Fig. 1, Fig. 2, the present invention proposes a kind of enquiry expanding method based on semi-supervised cluster, the method comprises following several step:
Step 1: inquiry likelihood estimation language module is retrieved for the first time to user's inquiry, returns front n document of result for retrieval, specifically may further comprise the steps:
Step 1.1: whole document sets is gone the pre-service such as stop words, stem, make up based on the text database of vector space model and the total characteristic dictionary of whole document sets.
Step 1.2: the query contents of input is gone the pre-service such as stop words, stem, and remaining word consists of the vector form Q of inquiry.
Step 1.3: the possibility P (Q|D) that produces inquiry according to document sorts to document, and the P of document (Q|D) is larger, and specification documents is more relevant with inquiry, and rank is more forward.Formula is as follows:
P ( Q | D ) = Π i = 1 m P ( q i | D )
P ( w | D ) = | D | | D | + μ P ML ( w | D ) + μ | D | + μ P ML ( w | Coll )
P ML ( w | D ) = freq ( w , D ) | D |
P ML ( w | Coll ) = freq ( w , Coll ) | Coll |
In the formula, q iBe i query word, m is the number of query word among the inquiry Q, and D is a document, P ML(w|D) be that word w appears at the maximum likelihood in the document D, Coll is whole set, and μ is smoothing factor, value is 500, | D| and | Coll| represents respectively the length of D and Coll, freq (w, D) and freq (w, Coll) represent respectively the frequency that word w occurs in D and Coll.
Step 2: front k document in the first result for retrieval manually marked, form collection of document C associated with the query rWith with the inquiry incoherent collection of document C Ir
Step 3: by the semi-supervised clustering algorithm that constraint and distance merge a front n document is analyzed, extracted document associated with the query as feedback document Clu f, specifically may further comprise the steps:
Step 3.1: in a front n document, choose at random two document x i, x jRespectively as bunch center μ 1, μ 2
Step 3.2: calculate remaining document x iEuclidean distance op with each bunch center.
op = Σ i = 1 m ( y i - z i ) 2
In the formula, y iAnd z iRepresent respectively i coordinate in document y and the z vector form.
Step 3.3: according to described remaining each document x iWith the Euclidean distance op at each bunch center, remaining document x iGive nearest bunch center; According to the Euclidean distance at described remaining each document and each bunch center, give nearest bunch center.
Step 3.4: recomputate a bunch center μ 1, μ 2
Repeating step 3.1 is to step 3.4, until a bunch center μ is got in objective function J convergence 1, μ 2The nearest class of the former inquiry that neutralizes is as relevant documentation class Clu f
J = Σ x i ∈ X ( x i - μ l i ) 2 + Σ ( x i , x j ) ∈ M w ij Z ( l i ≠ l j ) + Σ ( x i , x j ) ∈ C s ij Z ( l i = l j ) - log ( det ( A ) )
In the formula, M and C are respectively must-link constrain set (set of relevant documents) and can-not-link constrain set (uncorrelated document sets), w IjAnd s IjBe respectively weights corresponding in aforementioned two set, l iBe the class mark, Z is indicator function (Z (true)=1, Z (false)=0), μ iBe a bunch center, x i, x jThe expression document, A is a symmetric positive definite matrix.
Step 4: according to the feedback document, choose module with expansion word and choose expansion word T={w 1t 1, w 2t 2W nt n, and former inquiry Q forms new inquiry Q 0, specifically may further comprise the steps:
Step 4.1: according to the relevant documentation class, calculate the weight w of word by following formula j:
f i , j ′ = Δ tf i , j tf i , j + dl i dl ′
w j = 1 F × Σ i = 1 F tf i , j ′
In the formula, the number of F representative feedback document, dl iRepresent the document length of document i, dl ' represents the average length of all feedback documents, tf I, jRepresent the word frequency of word j in document i.
Step 4.2: according to the weight w of word jSize sort successively, then extract expansion word T={w 1t 1, w 2t 2W nt n.
Step 4.3: with expansion word T={w 1t 1, w 2t 2W nt nAnd the new inquiry Q of original query Q composition 0
New inquiry Q 0=α Q+ (1-α) T, wherein α=0.7.

Claims (6)

1. enquiry expanding method based on semi-supervised cluster is characterized in that:
Step 1: inquiry likelihood estimation language module is retrieved for the first time to user's inquiry, returns front n document of result for retrieval;
Step 2: front k document in the first result for retrieval manually marked, be divided into set of relevant documents and uncorrelated document sets two classes;
Step 3: by the semi-supervised clustering algorithm that constraint and distance merge a front n document is analyzed, extracted document associated with the query as the feedback document;
Step 4: according to the feedback document, choose module with expansion word and choose expansion word, expansion word and original query are formed new inquiry.
2. the enquiry expanding method based on semi-supervised cluster according to claim 1 is characterized in that: in the step 3, obtain by the following method the feedback document:
Step 3.1: in a front n document, choose at random two documents and be bunch center;
Step 3.2: the Euclidean distance of calculating remaining document and each bunch center;
Step 3.3: according to the Euclidean distance at described remaining each document and each bunch center, remaining document is given nearest bunch center;
Step 3.4: recomputate a bunch center;
Repeating step 3.1 is to step 3.4, until the objective function convergence is got the nearest class of bunch center and former inquiry as the relevant documentation class.
3. the enquiry expanding method based on semi-supervised cluster according to claim 2, it is characterized in that: said objective function is in the step 3:
J = Σ x i ∈ X ( x i - μ l i ) 2 + Σ ( x i , x j ) ∈ M w ij Z ( l i ≠ l j ) + Σ ( x i , x j ) ∈ C s ij Z ( l i = l j ) - log ( det ( A ) )
In the formula, M and C are respectively set of relevant documents and uncorrelated document sets, wi jAnd si jBe respectively weights corresponding in aforementioned two set, l iBe the class mark, Z is indicator function, μ iBe a bunch center, x i, x jThe expression document, A is a symmetric positive definite matrix.
4. the enquiry expanding method based on semi-supervised cluster according to claim 3 is characterized in that: in the step 4, may further comprise the steps:
Step 4.1: according to the relevant documentation class, calculate the weight of word;
Step 4.2: the size according to the weight of word sorts successively, then extracts expansion word;
Step 4.3: expansion word and original query are formed new inquiry.
5. the enquiry expanding method based on semi-supervised cluster according to claim 4 is characterized in that: in the step 4.1, calculate the weight w of word by following formula j:
f i , j ′ = Δ tf i , j tf i , j + dl i dl ′
w j = 1 F × Σ i = 1 F tf i , j ′
In the formula, the number of F representative feedback document, dl iRepresent the document length of document i, dl ' represents the average length of all feedback documents, tf I, jRepresent the word frequency of word j in document i.
6. according to claim 1 to 5 any one described enquiry expanding method based on semi-supervised cluster, it is characterized in that: in the step 1, may further comprise the steps:
Step 1.1: whole document sets is carried out pre-service, make up based on the text database of vector space model and the total characteristic dictionary of whole document sets;
Step 1.2: pre-service is carried out in the inquiry that the user submits to, formed the vector form of inquiry;
Step 1.3: the possibility that produces inquiry according to document sorts to document.
CN2013102413856A 2013-06-18 2013-06-18 Query expansion method based on semi-supervised clustering Pending CN103324707A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102413856A CN103324707A (en) 2013-06-18 2013-06-18 Query expansion method based on semi-supervised clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102413856A CN103324707A (en) 2013-06-18 2013-06-18 Query expansion method based on semi-supervised clustering

Publications (1)

Publication Number Publication Date
CN103324707A true CN103324707A (en) 2013-09-25

Family

ID=49193450

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102413856A Pending CN103324707A (en) 2013-06-18 2013-06-18 Query expansion method based on semi-supervised clustering

Country Status (1)

Country Link
CN (1) CN103324707A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902694A (en) * 2014-03-28 2014-07-02 哈尔滨工程大学 Clustering and query behavior based retrieval result sorting method
CN105023026A (en) * 2015-08-18 2015-11-04 苏州大学张家港工业技术研究院 Semi-supervised clustering method and semi-supervised clustering system based on nonnegative matrix factorization
CN107169045A (en) * 2017-04-19 2017-09-15 中国人民解放军国防科学技术大学 A kind of query word method for automatically completing and device based on temporal signatures
CN107247745A (en) * 2017-05-23 2017-10-13 华中师范大学 A kind of information retrieval method and system based on pseudo-linear filter model
CN107544962A (en) * 2017-09-07 2018-01-05 电子科技大学 Social media text query extended method based on Similar Text feedback

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100145961A1 (en) * 2008-12-05 2010-06-10 International Business Machines Corporation System and method for adaptive categorization for use with dynamic taxonomies
CN102346753A (en) * 2010-08-01 2012-02-08 青岛理工大学 Semi-supervised text clustering method and device fusing pairwise constraints and keywords
CN102968410A (en) * 2012-12-04 2013-03-13 江南大学 Text classification method based on RBF (Radial Basis Function) neural network algorithm and semantic feature selection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100145961A1 (en) * 2008-12-05 2010-06-10 International Business Machines Corporation System and method for adaptive categorization for use with dynamic taxonomies
CN102346753A (en) * 2010-08-01 2012-02-08 青岛理工大学 Semi-supervised text clustering method and device fusing pairwise constraints and keywords
CN102968410A (en) * 2012-12-04 2013-03-13 江南大学 Text classification method based on RBF (Radial Basis Function) neural network algorithm and semantic feature selection

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
冯平,黄名选: "特征词抽取和相关性融合的伪相关反馈查询扩展", 《现代图书情报技术》 *
李昆仑,曹铮,曹丽苹,张超,刘明: "半监督聚类的若干新进展", 《模式识别与人工智能》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902694A (en) * 2014-03-28 2014-07-02 哈尔滨工程大学 Clustering and query behavior based retrieval result sorting method
CN103902694B (en) * 2014-03-28 2017-04-12 哈尔滨工程大学 Clustering and query behavior based retrieval result sorting method
CN105023026A (en) * 2015-08-18 2015-11-04 苏州大学张家港工业技术研究院 Semi-supervised clustering method and semi-supervised clustering system based on nonnegative matrix factorization
CN105023026B (en) * 2015-08-18 2018-08-17 苏州大学张家港工业技术研究院 A kind of Novel semi-supervised and system based on Non-negative Matrix Factorization
CN107169045A (en) * 2017-04-19 2017-09-15 中国人民解放军国防科学技术大学 A kind of query word method for automatically completing and device based on temporal signatures
CN107247745A (en) * 2017-05-23 2017-10-13 华中师范大学 A kind of information retrieval method and system based on pseudo-linear filter model
CN107247745B (en) * 2017-05-23 2018-07-03 华中师范大学 A kind of information retrieval method and system based on pseudo-linear filter model
CN107544962A (en) * 2017-09-07 2018-01-05 电子科技大学 Social media text query extended method based on Similar Text feedback

Similar Documents

Publication Publication Date Title
CN103778227B (en) The method screening useful image from retrieval image
CN103810299B (en) Image retrieval method on basis of multi-feature fusion
CN100570611C (en) A kind of methods of marking of the information retrieval document based on viewpoint searching
CN102902806B (en) A kind of method and system utilizing search engine to carry out query expansion
CN103593425B (en) Preference-based intelligent retrieval method and system
CN106649272B (en) A kind of name entity recognition method based on mixed model
CN104199857B (en) A kind of tax document hierarchy classification method based on multi-tag classification
CN106598950B (en) A kind of name entity recognition method based on hybrid laminated model
CN108846029B (en) Information correlation analysis method based on knowledge graph
CN106156272A (en) A kind of information retrieval method based on multi-source semantic analysis
CN105975596A (en) Query expansion method and system of search engine
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN105653706A (en) Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN102890711B (en) A kind of retrieval ordering method and system
CN102737042B (en) Method and device for establishing question generation model, and question generation method and device
CN101963971A (en) Use relevance feedback to carry out the method and the corresponding storage medium of database search
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
CN104008109A (en) User interest based Web information push service system
CN104484380A (en) Personalized search method and personalized search device
CN103123653A (en) Search engine retrieving ordering method based on Bayesian classification learning
CN102663447B (en) Cross-media searching method based on discrimination correlation analysis
CN102693316B (en) Linear generalization regression model based cross-media retrieval method
CN104899188A (en) Problem similarity calculation method based on subjects and focuses of problems
CN103324707A (en) Query expansion method based on semi-supervised clustering
CN102156728B (en) Improved personalized summary system based on user interest model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130925