CN106960021B - A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source - Google Patents
A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source Download PDFInfo
- Publication number
- CN106960021B CN106960021B CN201710142050.7A CN201710142050A CN106960021B CN 106960021 B CN106960021 B CN 106960021B CN 201710142050 A CN201710142050 A CN 201710142050A CN 106960021 B CN106960021 B CN 106960021B
- Authority
- CN
- China
- Prior art keywords
- feedback
- matrix
- query
- word
- tweets
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3325—Reformulation based on results of preceding query
- G06F16/3326—Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of enquiry expanding method based on the positive and negative external feedback information of multi-source, reduces extension risk by introducing canonical constraint during merging external inquiry information;New inquiry can be quickly and effectively constructed, so that search result more meets user demand.Have the effect of that performance is obviously improved compared with traditional feedback searching method using technical solution of the present invention.
Description
Technical field
The invention belongs to Text extraction field more particularly to a kind of inquiries based on the positive and negative external feedback information of multi-source
Extended method.
Background technique
The appearance of social media (such as Twitter, Facebook, Google+) profoundly change people production and
The mode of consumption information, it is not both people in social networks that he and mainstream news media website (such as CNN or nytimes) be maximum
It is the consumer of information is also the producer of information, due to the information in social networks not only source multiplicity and mixed and disorderly nothing
Chapter, which increase the difficulty that user obtains information.
Conventional method enquiry expanding method is broadly divided into two kinds according to the difference of extended source: 1) being with local search document sets
The local search extended method 2 of extended source) using external Knowledge-based as the global extended method of extended source.First method mostly uses
Text cluster, latent the methods of semantic indexing (latent semantic indexing, abbreviation LSI) and similitude dictionary extension are looked into
It askes, but since local corpus composition is relatively fixed, scale is smaller, cannot react the true query intention of user well.Second
Kind method is often used the common datas such as WordNet, Wiki-pedia resource as external extended source, can state use more in detail
Family is inquired, but the information of ambiguity error is inevitably introduced during introducing external information, increases query expansion
Risk.
Summary of the invention
The technical problem to be solved by the present invention is to provide a kind of query expansion side based on the positive and negative external feedback information of multi-source
Method reduces extension risk by introducing canonical constraint during merging external inquiry information, can quickly and effectively construct new
Inquiry, so that search result more meets user demand.
To achieve the above object, the present invention adopts the following technical scheme that:
A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source the following steps are included:
Step (1) obtains Tweets blog article
Step (2) obtains user interest word
Step (3), Tweets pretreatment
Step (4) constructs local search engine
Use Apache open source retrieval frame Lucene as local search engine main program, with pretreated Tweets
For blog article as index content, Tweets id is index target, constructs local search engine;
Step (5), expanding query comprising following steps:
Step (5.1) obtains first time query feedback using user interest word
User query interest word is indicated using Q, and Q is put into local search engine, obtains preceding 100 feedback results, as
First time query feedback, the row of building lexical item document matrix L, L represent word, and column represent a feedback document, and the value of matrix indicates
Word frequency of occurrence in a document;
Step (5.2) obtains external information
Using crawler technology, Q is put into multiple external search engines, obtains preceding 100 feedback results as external feedback
Each search engine can be obtained collection of document and be expressed as E by information1, E2, E3…Em, take the preceding m of n-th of external feedback information
Feedback result constructs lexical item document matrix P as positive feedbackn, preceding 2m~3m item of n-th of external feedback information is taken to feed back
As a result it is fed back as negative sense, constructs lexical item document matrix Nn;Wherein m and n is natural integer, and value arrives just infinite for 1;
Step (5.3) is to feedback external information cluster
Respectively by sparse matrix L, Pn、NnIt is decomposed into the form of two dense matrix multiplication, as shown in formula 1, wherein decomposing
Matrix U afterwards, An, BnIndicate that the distribution situation of feedback result, matrix U indicate that user query are intended to;Due to it is expected original feedback L
Distribution situation and positive feedback PnDistribution is as similar as possible and negative sense feeds back NnDistribution be away as far as possible, while decomposing
Decomposable process is constrained using identical cluster centre matrix H in the process, guarantees the stability and validity decomposed, because
This, the final sparse study optimization aim of multi-source information query expansion modeling is formula (1), wherein α β γ is indicated to regular terms
Degree of restraint adjustment parameter.
For this optimization aim application KKT (Karush-Kuhn-Tucker) condition, in the case where guaranteeing that matrix is non-negative,
It is as follows to obtain iterated conditional, in formula 2-5, i and j respectively represent the ith row and jth column of matrix,
Using the general solution solving optimization function (1) of NMF, to guarantee that items are positive in formula in decomposable process, use
Karush-Kuhn-Tucker (KKT) condition. obtains iterative formula 2~5, wherein KTT condition refers to be had in satisfaction
Under conditions of rule, Non-Linear Programming (Nonlinear Programming) problem can have optimize one of solution must
It wants and adequate condition.
Step (5.4) selects class center vector, finds out expansion word
If U is that each column of U are indicated a user according to the matrix of the i row * j column of user interest feedback building
Query intention then selects the maximum column of wherein query word accounting as the final true query intention vector of user;In vector
Value indicate the relationship that each word and user query are intended to.According to sorting from large to small, k is a as query expansion word before selecting;
Step (6) is retrieved again
Using original query word and expansion word as new term, it is put into local search engine and retrieves, the result of return
As final search result.
Preferably, the Tweets blog article is made of number and text two parts.
Preferably, the user interest file is made of number, query word and interesting measure three parts, therefrom parse
The query word of user is as user interest word.
Preferably, step (3) Tweets pretreatment includes:
Step (3.1) filters out the Tweets of non-English Tweets and length less than two words.
Step (3.2) removes the punctuation mark in Tweets, number, and all letters are converted to small letter by URL;
Step (3.3) is segmented Tweets based on simple space and removes stop words.
Enquiry expanding method based on the positive and negative external feedback information of multi-source of the invention, in the mistake of fusion external inquiry information
Extension risk is reduced by introducing canonical constraint in journey.It can quickly and effectively construct new inquiry, so that search result more meets
User demand.Using technical solution of the present invention, compared with traditional feedback searching method, the effect that is obviously improved with performance
Fruit.
Detailed description of the invention
The schematic diagram of enquiry expanding method Fig. 1 of the invention;
Fig. 2 difference enquiry expanding method compares performance histogram;
Fig. 3 different parameters are compared with performance histogram
Specific embodiment
As shown in Figure 1, the embodiment of the present invention provides a kind of enquiry expanding method based on the positive and negative external feedback information of multi-source
The following steps are included:
Step (1) obtains Tweets blog article
Tweets blog article is obtained, blog article is made of number and text two parts
Step (2) obtains user interest word
User interest file is by numbering, query word, and interesting measure three parts composition, the query word for therefrom parsing user is made
For user interest word.
Step (3) Tweets pretreatment
Step (3.1) filters out the Tweets of non-English Tweets and length less than two words.
Step (3.2) removes the punctuation mark in Tweets, number, and all letters are converted to small letter by URL;
Step (3.3) is based on simple space for Tweets and is segmented and removed stop words, and English different morphologies are worked as
At different words, such as " organ " and " organs " treat as the word that be two different.
Step (4) constructs local search engine
Using Lucene using pretreated Tweets content as index content, Tweets id is index, uses BM25 phase
Like degree model, local search engine is constructed.
Step (5) query expansion
Step (5.1) obtains first time query feedback by query statement of user interest word
User query interest word is indicated using Q, and Q is put into local search engine, obtains preceding 100 feedback results, as
First time query feedback constructs lexical item document matrix L.The row of L represents word, and column represent a feedback document, and the value of matrix indicates word
Frequency of occurrence in a document.
Step (5.2) obtains external information
Using crawler technology, Q is put into multiple external search engines, obtains preceding 100 feedback results as external feedback
Each search engine can be obtained collection of document and be expressed as E by information1, E2, E3…En, take the preceding m of n-th of external feedback information
Feedback result constructs lexical item document matrix P as positive feedbackn, preceding 2m~3m item of n-th of external feedback information is taken to feed back
As a result it is fed back as negative sense, constructs lexical item document matrix Nn;Wherein, m and n is natural integer, and value arrives just infinite for 1;
Step (5.3) clusters feedback
Respectively by sparse matrix L, Pn、NnIt is decomposed into the form of two dense matrix multiplication, as shown in formula 1, wherein decomposing
Matrix U afterwards, An, BnIndicate that the distribution situation of feedback result, matrix U indicate that user query are intended to.It is desirable that original feedback L
Distribution situation and positive feedback PnDistribution is as similar as possible and negative sense feeds back NnDistribution be away as far as possible, while decomposing
Decomposable process is constrained using identical cluster centre matrix H in the process, guarantees the stability and validity decomposed.Cause
This, the final sparse study optimization aim of multi-source information query expansion modeling is formula (1), wherein α β γ is indicated to regular terms
Degree of restraint adjustment parameter.
For this optimization aim, we apply KKT (Karush-Kuhn-Tucker) condition, are guaranteeing the non-negative feelings of matrix
Under condition, it is as follows to obtain iterated conditional, and in formula 2-5, i and j respectively represent the ith row and jth column of matrix,
We use the general solution solving optimization function (1) of NMF, to guarantee that items are positive in formula in decomposable process,
We obtain iterative formula Eq. (2~5) wherein using Karush-Kuhn-Tucker (KKT) condition., and KTT condition is
Refer under the conditions of meeting some well-regulated, Non-Linear Programming (Nonlinear Programming) problem can have optimal
One necessary and sufficient conditions of neutralizing method
Step (5.4) selects class center vector, finds out expansion word
Because U is according to the matrix of the i row * j column of user interest feedback building, then we can be by each of U
Column indicate that a user query are intended to, then select the maximum column of wherein query word accounting as final user and really inquire meaning
Figure vector.Value in vector indicates each word and the relationship that user query are intended to.According to sorting from large to small, preceding 3 works are selected
For query expansion word
Step (6) is retrieved again
It using original query word and expansion word as new term, is put into local search engine and retrieves, returned data is
Final search result.Fig. 2 gives blog article Testing index NDCG, Map and F value, and as can be seen from the figure this method is compared with its other party
Method performance has very big promotion.Fig. 3 is illustrated under the conditions of choosing different extended source number different parameters, and algorithm performance compares, can
To find out, when selecting two extended sources, performance is best, this is because excessive extended source can bring more noises, is unfavorable for
Approaching to reality user interest.
Claims (3)
1. a kind of enquiry expanding method based on the positive and negative external feedback information of multi-source, which comprises the following steps:
Step (1) obtains Tweets blog article
Step (2) obtains user interest word
Step (3), Tweets pretreatment
Step (4) constructs local search engine
Use Apache open source retrieval frame Lucene as local search engine main program, with pretreated Tweets blog article
As index content, Tweets id is index target, constructs local search engine;
Step (5), expanding query comprising following steps:
Step (5.1) obtains first time query feedback using user interest word
User query interest word is indicated using Q, and Q is put into local search engine, preceding 100 feedback results are obtained, as first
Secondary query feedback, the row of building lexical item document matrix L, L represent word, and column represent a feedback document, and the value of matrix indicates that word exists
Frequency of occurrence in document;
Step (5.2) obtains external information
Using crawler technology, Q is put into multiple external search engines, obtains preceding 100 feedback results as external feedback information
Each search engine can be obtained into collection of document and be expressed as E1, E2, E3…En, take the preceding m item of n-th of external feedback information anti-
Result is presented as positive feedback, constructs lexical item document matrix Pn, take the preceding 2m~3m feedback result of n-th of external feedback information
It is fed back as negative sense, constructs lexical item document matrix Nn;Wherein m and n is natural integer, and value arrives just infinite for 1;
Step (5.3) is to feedback external information cluster
Respectively by sparse matrix L, Pn、NnIt is decomposed into the form of two dense matrix multiplication, as shown in formula 1, wherein after decomposing
Matrix U, An, BnIndicate that the distribution situation of feedback result, matrix U indicate that user query are intended to;Due to point of expectation original feedback L
Cloth situation and positive feedback PnDistribution is as similar as possible and negative sense feeds back NnDistribution be away as far as possible, while in decomposable process
It is middle that decomposable process is constrained using identical cluster centre matrix H, guarantee the stability and validity decomposed, it is therefore, more
The final sparse study optimization aim of source information query expansion modeling is formula (1), wherein α β γ is indicated to canonical item constraint journey
Spend adjustment parameter;
It is obtained for this optimization aim application KKT (Karush-Kuhn-Tucker) condition in the case where guaranteeing that matrix is non-negative
Iterated conditional is as follows, and in formula 2-5, i and j respectively represent the ith row and jth column of matrix,
Using the general solution solving optimization function (1) of NMF, to guarantee that items are positive in formula in decomposable process, use
Karush-Kuhn-Tucker (KKT) condition. obtains iterative formula 2~5, wherein KTT condition refers to be had in satisfaction
Under conditions of rule, Non-Linear Programming (Nonlinear Programming) problem can have optimize one of solution must
It wants and adequate condition;
Step (5.4) selects class center vector, finds out expansion word
If U is that each column of U are indicated a user query according to the matrix of the i row * j column of user interest feedback building
It is intended to, then selects the maximum column of wherein query word accounting as the final true query intention vector of user;Value in vector
Indicate each word and the relationship that user query are intended to;According to sorting from large to small, k is a as query expansion word before selecting;
Step (6) is retrieved again
It using original query word and expansion word as new term, is put into local search engine and retrieves, the result of return is
Final search result.
2. the enquiry expanding method as described in claim 1 based on the positive and negative external feedback information of multi-source, which is characterized in that described
Tweets blog article is made of number and text two parts.
3. the enquiry expanding method as described in claim 1 based on the positive and negative external feedback information of multi-source, which is characterized in that step
(3) Tweets, which is pre-processed, includes:
Step (3.1) filters out the Tweets of non-English Tweets and length less than two words;
Step (3.2) removes the punctuation mark in Tweets, number, and all letters are converted to small letter by URL;
Step (3.3) is segmented Tweets based on simple space and removes stop words.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710142050.7A CN106960021B (en) | 2017-03-10 | 2017-03-10 | A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710142050.7A CN106960021B (en) | 2017-03-10 | 2017-03-10 | A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106960021A CN106960021A (en) | 2017-07-18 |
CN106960021B true CN106960021B (en) | 2019-06-21 |
Family
ID=59470088
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710142050.7A Active CN106960021B (en) | 2017-03-10 | 2017-03-10 | A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106960021B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110888946A (en) * | 2019-12-05 | 2020-03-17 | 电子科技大学广东电子信息工程研究院 | Entity linking method based on knowledge-driven query |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
CN106294688A (en) * | 2016-08-05 | 2017-01-04 | 浪潮软件集团有限公司 | Query expansion method, device and system based on user characteristic analysis |
-
2017
- 2017-03-10 CN CN201710142050.7A patent/CN106960021B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
CN106294688A (en) * | 2016-08-05 | 2017-01-04 | 浪潮软件集团有限公司 | Query expansion method, device and system based on user characteristic analysis |
Non-Patent Citations (3)
Title |
---|
BJUT at TREC 2015 Microblog Track:Real-Time Filtering Using Non-negative Matrix Factorization;Li chaoyang,et al.;《BJUT at TREC 2015 Microblog Track》;20151231;第1-3页 |
一种受限非负矩阵分解方法;黄钢石,等.;《东南大学学报(自然科学版)》;20040331;第34卷(第2期);第189-193页 |
基于用户隐式兴趣模型的信息推荐;杨震,等.;《山东大学学报(理学版)》;20170131;第52卷(第1期);第15-22页 |
Also Published As
Publication number | Publication date |
---|---|
CN106960021A (en) | 2017-07-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xu et al. | CN-DBpedia: A never-ending Chinese knowledge extraction system | |
CN107066553B (en) | Short text classification method based on convolutional neural network and random forest | |
CN103390051B (en) | A kind of topic detection and tracking method based on microblog data | |
CN102929873B (en) | Method and device for extracting searching value terms based on context search | |
Nabli et al. | Efficient cloud service discovery approach based on LDA topic modeling | |
Wu et al. | Structured microblog sentiment classification via social context regularization | |
CN107885749B (en) | Ontology semantic expansion and collaborative filtering weighted fusion process knowledge retrieval method | |
CN108681557B (en) | Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint | |
El-Fishawy et al. | Arabic summarization in twitter social network | |
CN105447080B (en) | A kind of inquiry complementing method in community's question and answer search | |
CN102708100A (en) | Method and device for digging relation keyword of relevant entity word and application thereof | |
CN104281565A (en) | Semantic dictionary constructing method and device | |
Zhao et al. | Keyword extraction for social media short text | |
Liu et al. | Sentiment classification of micro‐blog comments based on Randomforest algorithm | |
Marujo et al. | Hourly traffic prediction of news stories | |
Jalil et al. | Comparative study of clustering algorithms in text mining context | |
Wohlgenannt | Leveraging and balancing heterogeneous sources of evidence in ontology learning | |
CN106960021B (en) | A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source | |
CN109858035A (en) | A kind of sensibility classification method, device, electronic equipment and readable storage medium storing program for executing | |
Tang et al. | Labeled phrase latent Dirichlet allocation | |
Zhang et al. | A deep recommendation framework for completely new users in mashup creation | |
Poibeau et al. | Generating navigable semantic maps from social sciences corpora | |
El Abdouli et al. | A distributed approach for mining moroccan hashtags using Twitter platform | |
CN104331472A (en) | Construction method and device of word segmentation training data | |
US20230146292A1 (en) | Multi-task machine learning with heterogeneous data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |