CN106960021B

CN106960021B - A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source

Info

Publication number: CN106960021B
Application number: CN201710142050.7A
Authority: CN
Inventors: 杨震; 李超阳
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-03-10
Filing date: 2017-03-10
Publication date: 2019-06-21
Anticipated expiration: 2037-03-10
Also published as: CN106960021A

Abstract

The present invention discloses a kind of enquiry expanding method based on the positive and negative external feedback information of multi-source, reduces extension risk by introducing canonical constraint during merging external inquiry information；New inquiry can be quickly and effectively constructed, so that search result more meets user demand.Have the effect of that performance is obviously improved compared with traditional feedback searching method using technical solution of the present invention.

Description

A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source

Technical field

The invention belongs to Text extraction field more particularly to a kind of inquiries based on the positive and negative external feedback information of multi-source Extended method.

Background technique

The appearance of social media (such as Twitter, Facebook, Google+) profoundly change people production and The mode of consumption information, it is not both people in social networks that he and mainstream news media website (such as CNN or nytimes) be maximum It is the consumer of information is also the producer of information, due to the information in social networks not only source multiplicity and mixed and disorderly nothing Chapter, which increase the difficulty that user obtains information.

Conventional method enquiry expanding method is broadly divided into two kinds according to the difference of extended source: 1) being with local search document sets The local search extended method 2 of extended source) using external Knowledge-based as the global extended method of extended source.First method mostly uses Text cluster, latent the methods of semantic indexing (latent semantic indexing, abbreviation LSI) and similitude dictionary extension are looked into It askes, but since local corpus composition is relatively fixed, scale is smaller, cannot react the true query intention of user well.Second Kind method is often used the common datas such as WordNet, Wiki-pedia resource as external extended source, can state use more in detail Family is inquired, but the information of ambiguity error is inevitably introduced during introducing external information, increases query expansion Risk.

Summary of the invention

The technical problem to be solved by the present invention is to provide a kind of query expansion side based on the positive and negative external feedback information of multi-source Method reduces extension risk by introducing canonical constraint during merging external inquiry information, can quickly and effectively construct new Inquiry, so that search result more meets user demand.

To achieve the above object, the present invention adopts the following technical scheme that:

A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source the following steps are included:

Step (1) obtains Tweets blog article

Step (2) obtains user interest word

Step (3), Tweets pretreatment

Step (4) constructs local search engine

Use Apache open source retrieval frame Lucene as local search engine main program, with pretreated Tweets For blog article as index content, Tweets id is index target, constructs local search engine；

Step (5), expanding query comprising following steps:

Step (5.1) obtains first time query feedback using user interest word

User query interest word is indicated using Q, and Q is put into local search engine, obtains preceding 100 feedback results, as First time query feedback, the row of building lexical item document matrix L, L represent word, and column represent a feedback document, and the value of matrix indicates Word frequency of occurrence in a document；

Step (5.2) obtains external information

Using crawler technology, Q is put into multiple external search engines, obtains preceding 100 feedback results as external feedback Each search engine can be obtained collection of document and be expressed as E by information₁, E₂, E₃…E_m, take the preceding m of n-th of external feedback information Feedback result constructs lexical item document matrix P as positive feedback_n, preceding 2m~3m item of n-th of external feedback information is taken to feed back As a result it is fed back as negative sense, constructs lexical item document matrix N_n；Wherein m and n is natural integer, and value arrives just infinite for 1；

Step (5.3) is to feedback external information cluster

Respectively by sparse matrix L, P_n、N_nIt is decomposed into the form of two dense matrix multiplication, as shown in formula 1, wherein decomposing Matrix U afterwards, A_n, B_nIndicate that the distribution situation of feedback result, matrix U indicate that user query are intended to；Due to it is expected original feedback L Distribution situation and positive feedback P_nDistribution is as similar as possible and negative sense feeds back N_nDistribution be away as far as possible, while decomposing Decomposable process is constrained using identical cluster centre matrix H in the process, guarantees the stability and validity decomposed, because This, the final sparse study optimization aim of multi-source information query expansion modeling is formula (1), wherein α β γ is indicated to regular terms Degree of restraint adjustment parameter.

For this optimization aim application KKT (Karush-Kuhn-Tucker) condition, in the case where guaranteeing that matrix is non-negative, It is as follows to obtain iterated conditional, in formula 2-5, i and j respectively represent the ith row and jth column of matrix,

Using the general solution solving optimization function (1) of NMF, to guarantee that items are positive in formula in decomposable process, use Karush-Kuhn-Tucker (KKT) condition. obtains iterative formula 2~5, wherein KTT condition refers to be had in satisfaction Under conditions of rule, Non-Linear Programming (Nonlinear Programming) problem can have optimize one of solution must It wants and adequate condition.

Step (5.4) selects class center vector, finds out expansion word

If U is that each column of U are indicated a user according to the matrix of the i row * j column of user interest feedback building Query intention then selects the maximum column of wherein query word accounting as the final true query intention vector of user；In vector Value indicate the relationship that each word and user query are intended to.According to sorting from large to small, k is a as query expansion word before selecting；

Step (6) is retrieved again

Using original query word and expansion word as new term, it is put into local search engine and retrieves, the result of return As final search result.

Preferably, the Tweets blog article is made of number and text two parts.

Preferably, the user interest file is made of number, query word and interesting measure three parts, therefrom parse The query word of user is as user interest word.

Preferably, step (3) Tweets pretreatment includes:

Step (3.1) filters out the Tweets of non-English Tweets and length less than two words.

Step (3.2) removes the punctuation mark in Tweets, number, and all letters are converted to small letter by URL；

Step (3.3) is segmented Tweets based on simple space and removes stop words.

Enquiry expanding method based on the positive and negative external feedback information of multi-source of the invention, in the mistake of fusion external inquiry information Extension risk is reduced by introducing canonical constraint in journey.It can quickly and effectively construct new inquiry, so that search result more meets User demand.Using technical solution of the present invention, compared with traditional feedback searching method, the effect that is obviously improved with performance Fruit.

Detailed description of the invention

The schematic diagram of enquiry expanding method Fig. 1 of the invention；

Fig. 2 difference enquiry expanding method compares performance histogram；

Fig. 3 different parameters are compared with performance histogram

Specific embodiment

As shown in Figure 1, the embodiment of the present invention provides a kind of enquiry expanding method based on the positive and negative external feedback information of multi-source The following steps are included:

Step (1) obtains Tweets blog article

Tweets blog article is obtained, blog article is made of number and text two parts

Step (2) obtains user interest word

User interest file is by numbering, query word, and interesting measure three parts composition, the query word for therefrom parsing user is made For user interest word.

Step (3) Tweets pretreatment

Step (3.3) is based on simple space for Tweets and is segmented and removed stop words, and English different morphologies are worked as At different words, such as " organ " and " organs " treat as the word that be two different.

Step (4) constructs local search engine

Using Lucene using pretreated Tweets content as index content, Tweets id is index, uses BM25 phase Like degree model, local search engine is constructed.

Step (5) query expansion

Step (5.1) obtains first time query feedback by query statement of user interest word

User query interest word is indicated using Q, and Q is put into local search engine, obtains preceding 100 feedback results, as First time query feedback constructs lexical item document matrix L.The row of L represents word, and column represent a feedback document, and the value of matrix indicates word Frequency of occurrence in a document.

Step (5.2) obtains external information

Using crawler technology, Q is put into multiple external search engines, obtains preceding 100 feedback results as external feedback Each search engine can be obtained collection of document and be expressed as E by information₁, E₂, E₃…E_n, take the preceding m of n-th of external feedback information Feedback result constructs lexical item document matrix P as positive feedback_n, preceding 2m~3m item of n-th of external feedback information is taken to feed back As a result it is fed back as negative sense, constructs lexical item document matrix N_n；Wherein, m and n is natural integer, and value arrives just infinite for 1；

Step (5.3) clusters feedback

Respectively by sparse matrix L, P_n、N_nIt is decomposed into the form of two dense matrix multiplication, as shown in formula 1, wherein decomposing Matrix U afterwards, A_n, B_nIndicate that the distribution situation of feedback result, matrix U indicate that user query are intended to.It is desirable that original feedback L Distribution situation and positive feedback P_nDistribution is as similar as possible and negative sense feeds back N_nDistribution be away as far as possible, while decomposing Decomposable process is constrained using identical cluster centre matrix H in the process, guarantees the stability and validity decomposed.Cause This, the final sparse study optimization aim of multi-source information query expansion modeling is formula (1), wherein α β γ is indicated to regular terms Degree of restraint adjustment parameter.

For this optimization aim, we apply KKT (Karush-Kuhn-Tucker) condition, are guaranteeing the non-negative feelings of matrix Under condition, it is as follows to obtain iterated conditional, and in formula 2-5, i and j respectively represent the ith row and jth column of matrix,

We use the general solution solving optimization function (1) of NMF, to guarantee that items are positive in formula in decomposable process, We obtain iterative formula Eq. (2~5) wherein using Karush-Kuhn-Tucker (KKT) condition., and KTT condition is Refer under the conditions of meeting some well-regulated, Non-Linear Programming (Nonlinear Programming) problem can have optimal One necessary and sufficient conditions of neutralizing method

Step (5.4) selects class center vector, finds out expansion word

Because U is according to the matrix of the i row * j column of user interest feedback building, then we can be by each of U Column indicate that a user query are intended to, then select the maximum column of wherein query word accounting as final user and really inquire meaning Figure vector.Value in vector indicates each word and the relationship that user query are intended to.According to sorting from large to small, preceding 3 works are selected For query expansion word

Step (6) is retrieved again

It using original query word and expansion word as new term, is put into local search engine and retrieves, returned data is Final search result.Fig. 2 gives blog article Testing index NDCG, Map and F value, and as can be seen from the figure this method is compared with its other party Method performance has very big promotion.Fig. 3 is illustrated under the conditions of choosing different extended source number different parameters, and algorithm performance compares, can To find out, when selecting two extended sources, performance is best, this is because excessive extended source can bring more noises, is unfavorable for Approaching to reality user interest.

Claims

1. a kind of enquiry expanding method based on the positive and negative external feedback information of multi-source, which comprises the following steps:

Step (1) obtains Tweets blog article

Step (2) obtains user interest word

Step (3), Tweets pretreatment

Step (4) constructs local search engine

Use Apache open source retrieval frame Lucene as local search engine main program, with pretreated Tweets blog article As index content, Tweets id is index target, constructs local search engine；

Step (5), expanding query comprising following steps:

Step (5.1) obtains first time query feedback using user interest word

User query interest word is indicated using Q, and Q is put into local search engine, preceding 100 feedback results are obtained, as first Secondary query feedback, the row of building lexical item document matrix L, L represent word, and column represent a feedback document, and the value of matrix indicates that word exists Frequency of occurrence in document；

Step (5.2) obtains external information

Using crawler technology, Q is put into multiple external search engines, obtains preceding 100 feedback results as external feedback information Each search engine can be obtained into collection of document and be expressed as E₁, E₂, E₃…E_n, take the preceding m item of n-th of external feedback information anti- Result is presented as positive feedback, constructs lexical item document matrix P_n, take the preceding 2m~3m feedback result of n-th of external feedback information It is fed back as negative sense, constructs lexical item document matrix N_n；Wherein m and n is natural integer, and value arrives just infinite for 1；

Step (5.3) is to feedback external information cluster

Respectively by sparse matrix L, P_n、N_nIt is decomposed into the form of two dense matrix multiplication, as shown in formula 1, wherein after decomposing Matrix U, A_n, B_nIndicate that the distribution situation of feedback result, matrix U indicate that user query are intended to；Due to point of expectation original feedback L Cloth situation and positive feedback P_nDistribution is as similar as possible and negative sense feeds back N_nDistribution be away as far as possible, while in decomposable process It is middle that decomposable process is constrained using identical cluster centre matrix H, guarantee the stability and validity decomposed, it is therefore, more The final sparse study optimization aim of source information query expansion modeling is formula (1), wherein α β γ is indicated to canonical item constraint journey Spend adjustment parameter；

It is obtained for this optimization aim application KKT (Karush-Kuhn-Tucker) condition in the case where guaranteeing that matrix is non-negative Iterated conditional is as follows, and in formula 2-5, i and j respectively represent the ith row and jth column of matrix,

Using the general solution solving optimization function (1) of NMF, to guarantee that items are positive in formula in decomposable process, use Karush-Kuhn-Tucker (KKT) condition. obtains iterative formula 2~5, wherein KTT condition refers to be had in satisfaction Under conditions of rule, Non-Linear Programming (Nonlinear Programming) problem can have optimize one of solution must It wants and adequate condition；

Step (5.4) selects class center vector, finds out expansion word

If U is that each column of U are indicated a user query according to the matrix of the i row * j column of user interest feedback building It is intended to, then selects the maximum column of wherein query word accounting as the final true query intention vector of user；Value in vector Indicate each word and the relationship that user query are intended to；According to sorting from large to small, k is a as query expansion word before selecting；

Step (6) is retrieved again

It using original query word and expansion word as new term, is put into local search engine and retrieves, the result of return is Final search result.

2. the enquiry expanding method as described in claim 1 based on the positive and negative external feedback information of multi-source, which is characterized in that described Tweets blog article is made of number and text two parts.

3. the enquiry expanding method as described in claim 1 based on the positive and negative external feedback information of multi-source, which is characterized in that step (3) Tweets, which is pre-processed, includes:

Step (3.1) filters out the Tweets of non-English Tweets and length less than two words；

Step (3.3) is segmented Tweets based on simple space and removes stop words.