CN106960021B - A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source - Google Patents

A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source Download PDF

Info

Publication number
CN106960021B
CN106960021B CN201710142050.7A CN201710142050A CN106960021B CN 106960021 B CN106960021 B CN 106960021B CN 201710142050 A CN201710142050 A CN 201710142050A CN 106960021 B CN106960021 B CN 106960021B
Authority
CN
China
Prior art keywords
feedback
matrix
query
word
tweets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710142050.7A
Other languages
Chinese (zh)
Other versions
CN106960021A (en
Inventor
杨震
李超阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201710142050.7A priority Critical patent/CN106960021B/en
Publication of CN106960021A publication Critical patent/CN106960021A/en
Application granted granted Critical
Publication of CN106960021B publication Critical patent/CN106960021B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • G06F16/3326Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of enquiry expanding method based on the positive and negative external feedback information of multi-source, reduces extension risk by introducing canonical constraint during merging external inquiry information;New inquiry can be quickly and effectively constructed, so that search result more meets user demand.Have the effect of that performance is obviously improved compared with traditional feedback searching method using technical solution of the present invention.

Description

A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source
Technical field
The invention belongs to Text extraction field more particularly to a kind of inquiries based on the positive and negative external feedback information of multi-source Extended method.
Background technique
The appearance of social media (such as Twitter, Facebook, Google+) profoundly change people production and The mode of consumption information, it is not both people in social networks that he and mainstream news media website (such as CNN or nytimes) be maximum It is the consumer of information is also the producer of information, due to the information in social networks not only source multiplicity and mixed and disorderly nothing Chapter, which increase the difficulty that user obtains information.
Conventional method enquiry expanding method is broadly divided into two kinds according to the difference of extended source: 1) being with local search document sets The local search extended method 2 of extended source) using external Knowledge-based as the global extended method of extended source.First method mostly uses Text cluster, latent the methods of semantic indexing (latent semantic indexing, abbreviation LSI) and similitude dictionary extension are looked into It askes, but since local corpus composition is relatively fixed, scale is smaller, cannot react the true query intention of user well.Second Kind method is often used the common datas such as WordNet, Wiki-pedia resource as external extended source, can state use more in detail Family is inquired, but the information of ambiguity error is inevitably introduced during introducing external information, increases query expansion Risk.
Summary of the invention
The technical problem to be solved by the present invention is to provide a kind of query expansion side based on the positive and negative external feedback information of multi-source Method reduces extension risk by introducing canonical constraint during merging external inquiry information, can quickly and effectively construct new Inquiry, so that search result more meets user demand.
To achieve the above object, the present invention adopts the following technical scheme that:
A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source the following steps are included:
Step (1) obtains Tweets blog article
Step (2) obtains user interest word
Step (3), Tweets pretreatment
Step (4) constructs local search engine
Use Apache open source retrieval frame Lucene as local search engine main program, with pretreated Tweets For blog article as index content, Tweets id is index target, constructs local search engine;
Step (5), expanding query comprising following steps:
Step (5.1) obtains first time query feedback using user interest word
User query interest word is indicated using Q, and Q is put into local search engine, obtains preceding 100 feedback results, as First time query feedback, the row of building lexical item document matrix L, L represent word, and column represent a feedback document, and the value of matrix indicates Word frequency of occurrence in a document;
Step (5.2) obtains external information
Using crawler technology, Q is put into multiple external search engines, obtains preceding 100 feedback results as external feedback Each search engine can be obtained collection of document and be expressed as E by information1, E2, E3…Em, take the preceding m of n-th of external feedback information Feedback result constructs lexical item document matrix P as positive feedbackn, preceding 2m~3m item of n-th of external feedback information is taken to feed back As a result it is fed back as negative sense, constructs lexical item document matrix Nn;Wherein m and n is natural integer, and value arrives just infinite for 1;
Step (5.3) is to feedback external information cluster
Respectively by sparse matrix L, Pn、NnIt is decomposed into the form of two dense matrix multiplication, as shown in formula 1, wherein decomposing Matrix U afterwards, An, BnIndicate that the distribution situation of feedback result, matrix U indicate that user query are intended to;Due to it is expected original feedback L Distribution situation and positive feedback PnDistribution is as similar as possible and negative sense feeds back NnDistribution be away as far as possible, while decomposing Decomposable process is constrained using identical cluster centre matrix H in the process, guarantees the stability and validity decomposed, because This, the final sparse study optimization aim of multi-source information query expansion modeling is formula (1), wherein α β γ is indicated to regular terms Degree of restraint adjustment parameter.
For this optimization aim application KKT (Karush-Kuhn-Tucker) condition, in the case where guaranteeing that matrix is non-negative, It is as follows to obtain iterated conditional, in formula 2-5, i and j respectively represent the ith row and jth column of matrix,
Using the general solution solving optimization function (1) of NMF, to guarantee that items are positive in formula in decomposable process, use Karush-Kuhn-Tucker (KKT) condition. obtains iterative formula 2~5, wherein KTT condition refers to be had in satisfaction Under conditions of rule, Non-Linear Programming (Nonlinear Programming) problem can have optimize one of solution must It wants and adequate condition.
Step (5.4) selects class center vector, finds out expansion word
If U is that each column of U are indicated a user according to the matrix of the i row * j column of user interest feedback building Query intention then selects the maximum column of wherein query word accounting as the final true query intention vector of user;In vector Value indicate the relationship that each word and user query are intended to.According to sorting from large to small, k is a as query expansion word before selecting;
Step (6) is retrieved again
Using original query word and expansion word as new term, it is put into local search engine and retrieves, the result of return As final search result.
Preferably, the Tweets blog article is made of number and text two parts.
Preferably, the user interest file is made of number, query word and interesting measure three parts, therefrom parse The query word of user is as user interest word.
Preferably, step (3) Tweets pretreatment includes:
Step (3.1) filters out the Tweets of non-English Tweets and length less than two words.
Step (3.2) removes the punctuation mark in Tweets, number, and all letters are converted to small letter by URL;
Step (3.3) is segmented Tweets based on simple space and removes stop words.
Enquiry expanding method based on the positive and negative external feedback information of multi-source of the invention, in the mistake of fusion external inquiry information Extension risk is reduced by introducing canonical constraint in journey.It can quickly and effectively construct new inquiry, so that search result more meets User demand.Using technical solution of the present invention, compared with traditional feedback searching method, the effect that is obviously improved with performance Fruit.
Detailed description of the invention
The schematic diagram of enquiry expanding method Fig. 1 of the invention;
Fig. 2 difference enquiry expanding method compares performance histogram;
Fig. 3 different parameters are compared with performance histogram
Specific embodiment
As shown in Figure 1, the embodiment of the present invention provides a kind of enquiry expanding method based on the positive and negative external feedback information of multi-source The following steps are included:
Step (1) obtains Tweets blog article
Tweets blog article is obtained, blog article is made of number and text two parts
Step (2) obtains user interest word
User interest file is by numbering, query word, and interesting measure three parts composition, the query word for therefrom parsing user is made For user interest word.
Step (3) Tweets pretreatment
Step (3.1) filters out the Tweets of non-English Tweets and length less than two words.
Step (3.2) removes the punctuation mark in Tweets, number, and all letters are converted to small letter by URL;
Step (3.3) is based on simple space for Tweets and is segmented and removed stop words, and English different morphologies are worked as At different words, such as " organ " and " organs " treat as the word that be two different.
Step (4) constructs local search engine
Using Lucene using pretreated Tweets content as index content, Tweets id is index, uses BM25 phase Like degree model, local search engine is constructed.
Step (5) query expansion
Step (5.1) obtains first time query feedback by query statement of user interest word
User query interest word is indicated using Q, and Q is put into local search engine, obtains preceding 100 feedback results, as First time query feedback constructs lexical item document matrix L.The row of L represents word, and column represent a feedback document, and the value of matrix indicates word Frequency of occurrence in a document.
Step (5.2) obtains external information
Using crawler technology, Q is put into multiple external search engines, obtains preceding 100 feedback results as external feedback Each search engine can be obtained collection of document and be expressed as E by information1, E2, E3…En, take the preceding m of n-th of external feedback information Feedback result constructs lexical item document matrix P as positive feedbackn, preceding 2m~3m item of n-th of external feedback information is taken to feed back As a result it is fed back as negative sense, constructs lexical item document matrix Nn;Wherein, m and n is natural integer, and value arrives just infinite for 1;
Step (5.3) clusters feedback
Respectively by sparse matrix L, Pn、NnIt is decomposed into the form of two dense matrix multiplication, as shown in formula 1, wherein decomposing Matrix U afterwards, An, BnIndicate that the distribution situation of feedback result, matrix U indicate that user query are intended to.It is desirable that original feedback L Distribution situation and positive feedback PnDistribution is as similar as possible and negative sense feeds back NnDistribution be away as far as possible, while decomposing Decomposable process is constrained using identical cluster centre matrix H in the process, guarantees the stability and validity decomposed.Cause This, the final sparse study optimization aim of multi-source information query expansion modeling is formula (1), wherein α β γ is indicated to regular terms Degree of restraint adjustment parameter.
For this optimization aim, we apply KKT (Karush-Kuhn-Tucker) condition, are guaranteeing the non-negative feelings of matrix Under condition, it is as follows to obtain iterated conditional, and in formula 2-5, i and j respectively represent the ith row and jth column of matrix,
We use the general solution solving optimization function (1) of NMF, to guarantee that items are positive in formula in decomposable process, We obtain iterative formula Eq. (2~5) wherein using Karush-Kuhn-Tucker (KKT) condition., and KTT condition is Refer under the conditions of meeting some well-regulated, Non-Linear Programming (Nonlinear Programming) problem can have optimal One necessary and sufficient conditions of neutralizing method
Step (5.4) selects class center vector, finds out expansion word
Because U is according to the matrix of the i row * j column of user interest feedback building, then we can be by each of U Column indicate that a user query are intended to, then select the maximum column of wherein query word accounting as final user and really inquire meaning Figure vector.Value in vector indicates each word and the relationship that user query are intended to.According to sorting from large to small, preceding 3 works are selected For query expansion word
Step (6) is retrieved again
It using original query word and expansion word as new term, is put into local search engine and retrieves, returned data is Final search result.Fig. 2 gives blog article Testing index NDCG, Map and F value, and as can be seen from the figure this method is compared with its other party Method performance has very big promotion.Fig. 3 is illustrated under the conditions of choosing different extended source number different parameters, and algorithm performance compares, can To find out, when selecting two extended sources, performance is best, this is because excessive extended source can bring more noises, is unfavorable for Approaching to reality user interest.

Claims (3)

1. a kind of enquiry expanding method based on the positive and negative external feedback information of multi-source, which comprises the following steps:
Step (1) obtains Tweets blog article
Step (2) obtains user interest word
Step (3), Tweets pretreatment
Step (4) constructs local search engine
Use Apache open source retrieval frame Lucene as local search engine main program, with pretreated Tweets blog article As index content, Tweets id is index target, constructs local search engine;
Step (5), expanding query comprising following steps:
Step (5.1) obtains first time query feedback using user interest word
User query interest word is indicated using Q, and Q is put into local search engine, preceding 100 feedback results are obtained, as first Secondary query feedback, the row of building lexical item document matrix L, L represent word, and column represent a feedback document, and the value of matrix indicates that word exists Frequency of occurrence in document;
Step (5.2) obtains external information
Using crawler technology, Q is put into multiple external search engines, obtains preceding 100 feedback results as external feedback information Each search engine can be obtained into collection of document and be expressed as E1, E2, E3…En, take the preceding m item of n-th of external feedback information anti- Result is presented as positive feedback, constructs lexical item document matrix Pn, take the preceding 2m~3m feedback result of n-th of external feedback information It is fed back as negative sense, constructs lexical item document matrix Nn;Wherein m and n is natural integer, and value arrives just infinite for 1;
Step (5.3) is to feedback external information cluster
Respectively by sparse matrix L, Pn、NnIt is decomposed into the form of two dense matrix multiplication, as shown in formula 1, wherein after decomposing Matrix U, An, BnIndicate that the distribution situation of feedback result, matrix U indicate that user query are intended to;Due to point of expectation original feedback L Cloth situation and positive feedback PnDistribution is as similar as possible and negative sense feeds back NnDistribution be away as far as possible, while in decomposable process It is middle that decomposable process is constrained using identical cluster centre matrix H, guarantee the stability and validity decomposed, it is therefore, more The final sparse study optimization aim of source information query expansion modeling is formula (1), wherein α β γ is indicated to canonical item constraint journey Spend adjustment parameter;
It is obtained for this optimization aim application KKT (Karush-Kuhn-Tucker) condition in the case where guaranteeing that matrix is non-negative Iterated conditional is as follows, and in formula 2-5, i and j respectively represent the ith row and jth column of matrix,
Using the general solution solving optimization function (1) of NMF, to guarantee that items are positive in formula in decomposable process, use Karush-Kuhn-Tucker (KKT) condition. obtains iterative formula 2~5, wherein KTT condition refers to be had in satisfaction Under conditions of rule, Non-Linear Programming (Nonlinear Programming) problem can have optimize one of solution must It wants and adequate condition;
Step (5.4) selects class center vector, finds out expansion word
If U is that each column of U are indicated a user query according to the matrix of the i row * j column of user interest feedback building It is intended to, then selects the maximum column of wherein query word accounting as the final true query intention vector of user;Value in vector Indicate each word and the relationship that user query are intended to;According to sorting from large to small, k is a as query expansion word before selecting;
Step (6) is retrieved again
It using original query word and expansion word as new term, is put into local search engine and retrieves, the result of return is Final search result.
2. the enquiry expanding method as described in claim 1 based on the positive and negative external feedback information of multi-source, which is characterized in that described Tweets blog article is made of number and text two parts.
3. the enquiry expanding method as described in claim 1 based on the positive and negative external feedback information of multi-source, which is characterized in that step (3) Tweets, which is pre-processed, includes:
Step (3.1) filters out the Tweets of non-English Tweets and length less than two words;
Step (3.2) removes the punctuation mark in Tweets, number, and all letters are converted to small letter by URL;
Step (3.3) is segmented Tweets based on simple space and removes stop words.
CN201710142050.7A 2017-03-10 2017-03-10 A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source Active CN106960021B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710142050.7A CN106960021B (en) 2017-03-10 2017-03-10 A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710142050.7A CN106960021B (en) 2017-03-10 2017-03-10 A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source

Publications (2)

Publication Number Publication Date
CN106960021A CN106960021A (en) 2017-07-18
CN106960021B true CN106960021B (en) 2019-06-21

Family

ID=59470088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710142050.7A Active CN106960021B (en) 2017-03-10 2017-03-10 A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source

Country Status (1)

Country Link
CN (1) CN106960021B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110888946A (en) * 2019-12-05 2020-03-17 电子科技大学广东电子信息工程研究院 Entity linking method based on knowledge-driven query

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN106294688A (en) * 2016-08-05 2017-01-04 浪潮软件集团有限公司 Query expansion method, device and system based on user characteristic analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN106294688A (en) * 2016-08-05 2017-01-04 浪潮软件集团有限公司 Query expansion method, device and system based on user characteristic analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BJUT at TREC 2015 Microblog Track:Real-Time Filtering Using Non-negative Matrix Factorization;Li chaoyang,et al.;《BJUT at TREC 2015 Microblog Track》;20151231;第1-3页
一种受限非负矩阵分解方法;黄钢石,等.;《东南大学学报(自然科学版)》;20040331;第34卷(第2期);第189-193页
基于用户隐式兴趣模型的信息推荐;杨震,等.;《山东大学学报(理学版)》;20170131;第52卷(第1期);第15-22页

Also Published As

Publication number Publication date
CN106960021A (en) 2017-07-18

Similar Documents

Publication Publication Date Title
Xu et al. CN-DBpedia: A never-ending Chinese knowledge extraction system
CN107066553B (en) Short text classification method based on convolutional neural network and random forest
CN103390051B (en) A kind of topic detection and tracking method based on microblog data
CN102929873B (en) Method and device for extracting searching value terms based on context search
Nabli et al. Efficient cloud service discovery approach based on LDA topic modeling
Wu et al. Structured microblog sentiment classification via social context regularization
CN107885749B (en) Ontology semantic expansion and collaborative filtering weighted fusion process knowledge retrieval method
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
El-Fishawy et al. Arabic summarization in twitter social network
CN105447080B (en) A kind of inquiry complementing method in community's question and answer search
CN102708100A (en) Method and device for digging relation keyword of relevant entity word and application thereof
CN104281565A (en) Semantic dictionary constructing method and device
Zhao et al. Keyword extraction for social media short text
Liu et al. Sentiment classification of micro‐blog comments based on Randomforest algorithm
Marujo et al. Hourly traffic prediction of news stories
Jalil et al. Comparative study of clustering algorithms in text mining context
Wohlgenannt Leveraging and balancing heterogeneous sources of evidence in ontology learning
CN106960021B (en) A kind of enquiry expanding method based on the positive and negative external feedback information of multi-source
CN109858035A (en) A kind of sensibility classification method, device, electronic equipment and readable storage medium storing program for executing
Tang et al. Labeled phrase latent Dirichlet allocation
Zhang et al. A deep recommendation framework for completely new users in mashup creation
Poibeau et al. Generating navigable semantic maps from social sciences corpora
El Abdouli et al. A distributed approach for mining moroccan hashtags using Twitter platform
CN104331472A (en) Construction method and device of word segmentation training data
US20230146292A1 (en) Multi-task machine learning with heterogeneous data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant