CN106294662A - Inquiry based on context-aware theme represents and mixed index method for establishing model - Google Patents

Inquiry based on context-aware theme represents and mixed index method for establishing model Download PDF

Info

Publication number
CN106294662A
CN106294662A CN201610634174.2A CN201610634174A CN106294662A CN 106294662 A CN106294662 A CN 106294662A CN 201610634174 A CN201610634174 A CN 201610634174A CN 106294662 A CN106294662 A CN 106294662A
Authority
CN
China
Prior art keywords
context
query
topic
aware
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610634174.2A
Other languages
Chinese (zh)
Inventor
贺樑
陈琴
胡琴敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201610634174.2A priority Critical patent/CN106294662A/en
Publication of CN106294662A publication Critical patent/CN106294662A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of inquiry based on context-aware theme to represent and mixed index method for establishing model, comprise the steps: step one: based on the keyword set inquired about, obtain the pseudo-linear filter document of inquiry, from pseudo-linear filter document, choose context associated with the query;Step 2: introduce context-aware topic model, context is incorporated in context-aware topic model, the subject information implied based on corpus Topics Crawling contextual window, obtain its corresponding theme vector;Step 3: inquiry is combined with keyword set with theme vector and represents, based on theme vector and keyword set, set up mixed index model, obtain final retrieval score.

Description

Query expression and mixed retrieval model establishment method based on context-aware theme
Technical Field
The invention relates to the technical field of internet information retrieval, in particular to a query expression and mixed retrieval model building method based on a context-aware topic model.
Background
Query representation has been the core of the information retrieval field, wherein the most common problem is that the user query is too short (only contains a few key words), which easily causes the relevant documents not to match with the query in the retrieval process. For example, for the user query of "water shortage", if the document contains words related to the query, such as "drought", and the like, although the relevance is high, the final matching degree will be low because the document does not contain the original query keyword "water shortage", and the accuracy of the query is further influenced.
A common solution is query expansion based on pseudo-correlation feedback. The method is based on the preliminary retrieval result, and the K documents (referred to as 'pseudo-relevant feedback documents') arranged in the front are assumed to be relevant to the original query, wherein the keywords can be extracted by adopting a relevant algorithm for query expansion representation. However, this method is unsupervised and tends to bring in some terms that are not relevant to the query. Although in theory, supervised classification methods may be employed, taking into account a variety of features of the expanded words, to pick out the words that are truly relevant to the query. However, the method depends on feature engineering and label training sets, and the cost of practical application is high.
Some recent studies have focused on how to mitigate the problem of irrelevant expansion word introduction in query representations using various contextual information. The context information sources mainly comprise high-quality external data sources (such as encyclopedias, domain ontologies and the like) and pseudo-relevant feedback documents based on the data sets themselves. The former is only suitable for partial query, and the external data source is mostly slow to update and difficult to acquire, so the practical application is not wide. The latter, based on the pseudo-relevant feedback documents of the data set itself, also actually provides a contextual background description of the query, with greater research prospects. For example, for the query "lack of water", the pseudo-relevance feedback document 1 describes: "the uk will face the problem of water shortage in the coming years, so please save water and repair your faucet. "; the pseudo-correlation feedback document 2 describes: "dry farming: a method to alleviate the problems of drought and water deficit. Both of these are countermeasures to the problem of water shortage, and these context information can be used to assist in query representation. However, the existing method for selecting the extension word only considers the co-occurrence degree of the extension word and the original query word in the context window of the pseudo-correlation feedback, and still has the following problems: (1) it is necessary to explicitly select which words are used as final query expansion, and some irrelevant words, even "harmful words", are still introduced without supervision. Such as: in articles related to various environmental resources, a keyword ' water shortage ' appears frequently, but similar ' hydroelectric power generation ', natural gas ' and the like can also appear in the articles, so that the original query can be deviated, and the query accuracy is reduced; (2) the final query representation is still based on a dictionary space, and the semantic information implied by the query, such as potential topics, is ignored; (3) search models based on such query representations primarily consider keyword matches, while ignoring document and query matches at the semantic level.
Disclosure of Invention
The invention aims to provide a method for designing a query expression and mixed retrieval model based on a context-aware topic model, aiming at the defects of the prior art, wherein context topic information based on pseudo-correlation feedback is integrated into the query expression, so that topic matching is increased on the basis of the original retrieval model based on keyword matching, and the accuracy of a retrieval result is improved.
The invention provides a query expression and mixed retrieval model building method based on context-aware theme, which comprises the following steps:
the method comprises the following steps: acquiring a pseudo-relevant feedback document of the query based on a queried keyword set, and selecting a context relevant to the query from the pseudo-relevant feedback document;
step two: introducing a context-aware topic model, merging the context into the context-aware topic model, mining topic information implied by the context window based on the topic of a corpus, and obtaining a corresponding topic vector;
step three: representing the query in the topic vector jointly with the set of keywords; and establishing a mixed retrieval model based on the theme vector and the keyword set to obtain a final retrieval score.
In the method for establishing the query expression and mixed retrieval model based on the context-aware theme, in the first step, the pseudo-relevant feedback document is divided into a plurality of sliding windows, the relevance between each window and the query is calculated, and the window with the relevance higher than a threshold value is taken as a context window relevant to the query.
In the method for establishing the query expression and mixed retrieval model based on the context-aware theme, the context selection threshold value related to the query is an average value of the relevance of all windows under the query.
In the method for establishing the query expression and mixed retrieval model based on the context-aware theme, the context-aware theme model is designed according to the query-related context and the whole corpus, and the context-aware theme model is used for assuming that a context window and a pseudo-related feedback document where the context window is located share the same theme distribution in the theme modeling process to obtain a theme vector of the context.
In the method for establishing the query expression and mixed retrieval model based on the context-aware theme, the pseudo-relevant feedback documents are obtained by calculating the keyword matching scores of the retrieval model.
In the method for establishing the query expression and hybrid retrieval model based on the context-aware theme, the retrieval score is expressed by the following formula:
S = ( 1 - λ ) Σ q i ∈ Q s ( q i , d ) + λ · s ′ ( Q ′ , d )
wherein s represents the score based on keyword matching in the traditional retrieval model, s 'represents the topic matching score based on new query representation Q', and λ is the weighting parameter between the two scores and also the weighting coefficient of the two matching modes.
The invention has the beneficial effects that: the method fully utilizes the context information of the corpus based on the pseudo-correlation feedback, and solves the problem that high-quality external data sources are difficult to acquire. And the pseudo-relevant feedback document is divided into context windows, and the context segments relevant to the query are selected from the context windows for query representation, so that introduction of noise and query drift are reduced, and the pseudo-relevant feedback document is an innovative measure for controlling the quality of the query representation. The context-aware topic model provided by the invention fully excavates the topic information corresponding to the context related to the query, breaks through the traditional understanding only based on the keyword level, and is beneficial to more comprehensively and deeply understanding the user query. Traditional search models are mainly based on keyword matching, and ignore deep semantic relevance. The hybrid retrieval model designed by the invention comprehensively considers keyword matching and topic matching, and the diversified matching mode is helpful for promoting the improvement of the retrieval effect. The query representation method and the hybrid retrieval model provided by the invention are proved to be effective on the data set of Microblock Track 2011-2014, context topic information is blended in the query, and the MAP value finally retrieved exceeds some latest query representation methods.
Drawings
FIG. 1 is a flowchart of a method for building a context-aware topic-based query representation and hybrid search model according to the present invention.
Fig. 2 is a flow diagram of context selection based on pseudo-correlation feedback.
FIG. 3 is a graphical model representation of a context-aware topic model.
Detailed Description
The present invention will be described in further detail with reference to the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.
As shown in FIG. 1, the method for establishing a query expression and hybrid retrieval model based on context-aware topics comprises the following steps:
the method comprises the following steps: acquiring a pseudo-relevant feedback document of the query based on the queried keyword set, and selecting a context relevant to the query from the pseudo-relevant feedback document;
step two: introducing a context-aware topic model, fusing context into the context-aware topic model, mining topic information implied by a context window based on the topic of a corpus, and obtaining a corresponding topic vector;
step three: jointly representing the query by a topic vector and a keyword set; and establishing a mixed retrieval model based on the topic vector and the keyword set to obtain a final retrieval score.
(one) relevant context selection based on pseudo-relevant feedback
Since the pseudo-relevant feedback document is easy to obtain and contains much content relevant to the query, the context relevant to the query is selected from the pseudo-relevant feedback document and used for query representation, and the specific flow of the method is shown in fig. 2.
Firstly, segmenting the pseudo-related feedback document to obtain a plurality of context windows with the size of n. Definition Q ═ { Q ═ Q1,q2,...,q|Q|Is a query, where q isiRepresenting a query keyword, | Q | represents the number of keywords in the query.The document set is a pseudo-relevant feedback document set corresponding to the query Q, namely, the document ranked at top k in the first retrieval. For a pseudo-relevant feedback documentIt will be divided into several n-sized context windows (containing n words) as shown in fig. 2 in the form of sliding window, i.e. Qc1,Qc2,...,QclAnd I denotes the number of context windows.
Second, the relevance of the context window to the original query is computed. For a query and contextual window pair (Q, Q)c) The invention combines multiple methods to calculate the correlation R (Q, Q) between themc) Such as mean Mutual Information (poitwise Mutual Information) based on word co-occurrence, Jaccard similarity based on word sets, semantic similarity based on word vectors word2vec, and the like, and finally, the mean value is taken.
Contexts relevant to the query are then screened out. The correlation obtained above is first normalized. Then, a threshold value is set as the average value of the relevance of all windows under the query, a context window with the relevance lower than the threshold value is filtered, and the rest of the context which is more relevant to the query is further used as context-aware topic modeling.
(II) context topic perception modeling and query representation
Given the query-related context and the entire corpus obtained in (a), the present invention designs a context-aware topic model to incorporate the query-related context information into the topic model to generate a new query representation.
Inspired by relevant research, since the selected context window in (one) and the pseudo-relevant feedback document where it is located are both closely related to the query, it is assumed that they share the same topic distribution. Under this assumption, the traditional LDA topic model is improved to obtain a context-aware topic model CAT, which is shown in fig. 3. The relevant symbols involved in the model are illustrated in table 1. The model is a generative model, and the specific modeling process is shown in algorithm 1.
TABLE 1 description of related symbols in context aware topic model CAT
To solve for the parameters in the model, the present invention employs a widely used Gibbs sampling (Gibbs sampling) algorithm.
First, according to the gibbs sampling algorithm, the probability that the first word in the document is assigned to the topic is expressed by the following formula (1):
wherein,a topic assignment vector representing all other words not including the current ith word,representing the number of words in document d assigned to topic k (excluding the current word),the expression wiThe number of times assigned to topic k (excluding the current word) in the entire corpus. For missing superscripts or subscripts in the notation (e.g. forAnd) Representing the summation over the missing dimensions, 1 is a vector with all 1 elements.
Similarly, the probability that the jth query-relevant context window in document d is assigned to topic k can be expressed by the following equation (2):
wherein,the topic assignment vectors representing all other windows that do not include the current jth query-related contextual window,indicates the number of all contextual windows in topic k (excluding the current window), θ, associated with query Qd,kThe probability of the topic k in the document d can be further calculated by the following formula:
θ d , k = n k d + α k n ( · ) d + α T 1 - - - ( 3 )
wherein,representing the total number of words in document d that are assigned to topic k.
When the model converges or reaches a preset number of iterations, the following distributions will result: "document-topic" distribution θ, "topic-word" distribution Φ, and "topic-query context" distribution η. Each column of η represents the distribution of all relevant contexts of a query over the topic, which is also the resulting new query representation. It can be seen that the representation naturally fuses together context information and topic information at the same time, and theoretically would be superior to the representation methods that model each separately.
(III) design of hybrid search model
The invention designs a mixed retrieval model considering keyword matching and topic matching simultaneously based on the obtained new query expression, and the retrieval score calculation formula is as follows:
S = ( 1 - λ ) Σ q i ∈ Q s ( q i , d ) + λ · s ′ ( Q ′ , d ) - - - ( 4 )
where s represents a score based on keyword matching in the conventional search model, such as a language model search score or a BM25 search score, s 'represents a topic matching score based on a new query representation Q', and λ is a weighting parameter between the two scores and also a weighting coefficient of the two matching modes.
Regarding the topic matching score, various calculation methods may be employed. Specifically, given the topic distribution vector of the new query representation and the document, it can be derived by calculating topic distribution similarity between the two, such as Jensen-Shannon university (JSD) and Cosine similarity (Cosine similarity).
The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, and the scope of the appended claims is intended to be protected.

Claims (6)

1. A query expression and mixed retrieval model building method based on context-aware topics is characterized by comprising the following steps:
the method comprises the following steps: acquiring a pseudo-relevant feedback document of the query based on a queried keyword set, and selecting a context relevant to the query from the pseudo-relevant feedback document;
step two: introducing a context-aware topic model, merging the context into the context-aware topic model, mining topic information implied by the context window based on the topic of a corpus, and obtaining a corresponding topic vector;
step three: and jointly representing the query by the topic vector and the keyword set, and establishing a mixed retrieval model based on the topic vector and the keyword set to obtain a final retrieval score.
2. The method according to claim 1, wherein the pseudo-relevance feedback document is divided into a plurality of sliding windows, the relevance of each window to the query is calculated, and the window with the relevance higher than a threshold value is taken as the contextual window relevant to the query.
3. The method of claim 2, wherein the context selection threshold associated with the query is an average of the correlations of all windows under the query.
4. The method for building a query representation and hybrid retrieval model based on context-aware topics as claimed in claim 1, wherein the context-aware topic model is designed according to a query-related context and a whole corpus, and a topic vector of a context is obtained by assuming that a context window and a pseudo-related feedback document where the context window is located share the same topic distribution in a topic modeling process by using the context-aware topic model.
5. The method of claim 1, wherein the pseudo-relevant feedback documents are computed using a search model keyword matching score.
6. The method for building a query representation and hybrid retrieval model based on context-aware topics as claimed in claim 1, wherein the retrieval score is expressed by the following formula:
S = ( 1 - λ ) Σ q i ∈ Q s ( q i , d ) + λ · s ′ ( Q ′ , d )
wherein s represents the score based on keyword matching in the traditional retrieval model, s 'represents the topic matching score based on new query representation Q', and λ is the weighting parameter between the two scores and also the weighting coefficient of the two matching modes.
CN201610634174.2A 2016-08-05 2016-08-05 Inquiry based on context-aware theme represents and mixed index method for establishing model Pending CN106294662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610634174.2A CN106294662A (en) 2016-08-05 2016-08-05 Inquiry based on context-aware theme represents and mixed index method for establishing model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610634174.2A CN106294662A (en) 2016-08-05 2016-08-05 Inquiry based on context-aware theme represents and mixed index method for establishing model

Publications (1)

Publication Number Publication Date
CN106294662A true CN106294662A (en) 2017-01-04

Family

ID=57664982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610634174.2A Pending CN106294662A (en) 2016-08-05 2016-08-05 Inquiry based on context-aware theme represents and mixed index method for establishing model

Country Status (1)

Country Link
CN (1) CN106294662A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108121699A (en) * 2017-12-21 2018-06-05 北京百度网讯科技有限公司 For the method and apparatus of output information
CN108520033A (en) * 2018-03-28 2018-09-11 华中师范大学 Enhancing pseudo-linear filter model information search method based on superspace simulation language
CN108710611A (en) * 2018-05-17 2018-10-26 南京大学 A kind of short text topic model generation method of word-based network and term vector
CN108804443A (en) * 2017-04-27 2018-11-13 安徽富驰信息技术有限公司 A kind of judicial class case searching method based on multi-feature fusion
CN110333700A (en) * 2019-05-24 2019-10-15 蓝炬兴业(赤壁)科技有限公司 Industrial computer server remote management platform system and method
CN110427400A (en) * 2019-06-21 2019-11-08 贵州电网有限责任公司 Search method is excavated based on operation of power networks information interactive information user's demand depth
CN111897928A (en) * 2020-08-04 2020-11-06 广西财经学院 Chinese query expansion method for embedding expansion words into query words and counting expansion word union
CN112685440A (en) * 2020-12-31 2021-04-20 王程 Structural query information expression method for marking search semantic role
WO2021250488A1 (en) * 2020-06-08 2021-12-16 International Business Machines Corporation Refining a search request to a content provider

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750315A (en) * 2012-04-25 2012-10-24 北京航空航天大学 Rapid discovering method of conceptual relations based on sovereignty iterative search
CN103678412A (en) * 2012-09-21 2014-03-26 北京大学 Document retrieval method and device
CN103927177A (en) * 2014-04-18 2014-07-16 扬州大学 Characteristic-interface digraph establishment method based on LDA model and PageRank algorithm
CN104050235A (en) * 2014-03-27 2014-09-17 浙江大学 Distributed information retrieval method based on set selection
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750315A (en) * 2012-04-25 2012-10-24 北京航空航天大学 Rapid discovering method of conceptual relations based on sovereignty iterative search
CN103678412A (en) * 2012-09-21 2014-03-26 北京大学 Document retrieval method and device
CN104050235A (en) * 2014-03-27 2014-09-17 浙江大学 Distributed information retrieval method based on set selection
CN103927177A (en) * 2014-04-18 2014-07-16 扬州大学 Characteristic-interface digraph establishment method based on LDA model and PageRank algorithm
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804443A (en) * 2017-04-27 2018-11-13 安徽富驰信息技术有限公司 A kind of judicial class case searching method based on multi-feature fusion
CN108121699A (en) * 2017-12-21 2018-06-05 北京百度网讯科技有限公司 For the method and apparatus of output information
CN108520033A (en) * 2018-03-28 2018-09-11 华中师范大学 Enhancing pseudo-linear filter model information search method based on superspace simulation language
CN108710611B (en) * 2018-05-17 2021-08-03 南京大学 Short text topic model generation method based on word network and word vector
CN108710611A (en) * 2018-05-17 2018-10-26 南京大学 A kind of short text topic model generation method of word-based network and term vector
CN110333700A (en) * 2019-05-24 2019-10-15 蓝炬兴业(赤壁)科技有限公司 Industrial computer server remote management platform system and method
CN110427400A (en) * 2019-06-21 2019-11-08 贵州电网有限责任公司 Search method is excavated based on operation of power networks information interactive information user's demand depth
WO2021250488A1 (en) * 2020-06-08 2021-12-16 International Business Machines Corporation Refining a search request to a content provider
US11238052B2 (en) 2020-06-08 2022-02-01 International Business Machines Corporation Refining a search request to a content provider
GB2611237A (en) * 2020-06-08 2023-03-29 Ibm Refining a search request to a content provider
AU2021289542B2 (en) * 2020-06-08 2023-06-01 International Business Machines Corporation Refining a search request to a content provider
CN111897928A (en) * 2020-08-04 2020-11-06 广西财经学院 Chinese query expansion method for embedding expansion words into query words and counting expansion word union
CN112685440A (en) * 2020-12-31 2021-04-20 王程 Structural query information expression method for marking search semantic role
CN112685440B (en) * 2020-12-31 2022-03-22 上海欣兆阳信息科技有限公司 Structural query information expression method for marking search semantic role

Similar Documents

Publication Publication Date Title
CN106294662A (en) Inquiry based on context-aware theme represents and mixed index method for establishing model
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN111241294B (en) Relationship extraction method of graph convolution network based on dependency analysis and keywords
Fang et al. Word-sentence co-ranking for automatic extractive text summarization
CN109241538B (en) Chinese entity relation extraction method based on dependency of keywords and verbs
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
Min et al. Nonparametric masked language modeling
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN103150382B (en) Automatic short text semantic concept expansion method and system based on open knowledge base
Jafari et al. Automatic text summarization using fuzzy inference
CN103544242A (en) Microblog-oriented emotion entity searching system
Sadr et al. Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN109597995A (en) A kind of document representation method based on BM25 weighted combination term vector
Liu et al. Enhanced word embedding similarity measures using fuzzy rules for query expansion
CN106372122A (en) Wiki semantic matching-based document classification method and system
Zhao et al. Keyword extraction for social media short text
CN114265943A (en) Causal relationship event pair extraction method and system
CN106776569A (en) Tourist hot spot and its Feature Extraction Method and system in mass text
El Mahdaouy et al. Semantically enhanced term frequency based on word embeddings for Arabic information retrieval
Ezzikouri et al. Fuzzy-semantic similarity for automatic multilingual plagiarism detection
Pang et al. Query expansion and query fuzzy with large-scale click-through data for microblog retrieval
Darling et al. Pathsum: A summarization framework based on hierarchical topics
Albathan et al. Enhanced n-gram extraction using relevance feature discovery

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170104