CN106294662A - Inquiry based on context-aware theme represents and mixed index method for establishing model - Google Patents
Inquiry based on context-aware theme represents and mixed index method for establishing model Download PDFInfo
- Publication number
- CN106294662A CN106294662A CN201610634174.2A CN201610634174A CN106294662A CN 106294662 A CN106294662 A CN 106294662A CN 201610634174 A CN201610634174 A CN 201610634174A CN 106294662 A CN106294662 A CN 106294662A
- Authority
- CN
- China
- Prior art keywords
- context
- query
- topic
- aware
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 19
- 238000009826 distribution Methods 0.000 claims description 10
- 238000005065 mining Methods 0.000 claims description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 8
- 238000013461 design Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- VNWKTOKETHGBQD-UHFFFAOYSA-N methane Chemical compound C VNWKTOKETHGBQD-UHFFFAOYSA-N 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 239000003345 natural gas Substances 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基于上下文感知主题的查询表示及混合检索模型建立方法,包括如下步骤:步骤一:基于查询的关键词集合,获取查询的伪相关反馈文档,从伪相关反馈文档中选取与查询相关的上下文;步骤二:引入上下文感知主题模型,将上下文融入上下文感知主题模型中,基于语料库主题挖掘上下文窗口所隐含的主题信息,得到其相应的主题向量;步骤三:将查询以主题向量与关键词集合联合表示,基于主题向量和关键词集合,建立混合检索模型,得到最终的检索得分。
The invention discloses a context-aware topic-based query representation and hybrid retrieval model building method, including the following steps: Step 1: Based on the keyword set of the query, obtain the pseudo-relevant feedback documents of the query, and select the pseudo-relevant feedback documents from the pseudo-related feedback documents Query the relevant context; Step 2: Introduce the context-aware topic model, integrate the context into the context-aware topic model, mine the topic information hidden in the context window based on the corpus topic, and obtain its corresponding topic vector; Step 3: Convert the query to the topic Vectors and keyword sets are jointly represented, and based on topic vectors and keyword sets, a hybrid retrieval model is established to obtain the final retrieval score.
Description
技术领域technical field
本发明涉及互联网信息检索技术领域,尤其涉及一种基于上下文感知主题模型的查询表示及混合检索模型建立方法。The invention relates to the technical field of Internet information retrieval, in particular to a method for establishing a query representation and a hybrid retrieval model based on a context-aware topic model.
背景技术Background technique
查询表示一直是信息检索领域的核心,其中最常见的问题是用户查询太短(仅包含几个关键词),容易造成检索过程中相关文档与查询不匹配。比如对于“缺水”这个用户查询,如果文档中含有“干旱”等与查询相关的词,虽然相关性很高,但由于不含原始的查询关键词“缺水”,最终匹配度将会很低,进而影响查询的准确率。Query representation has always been the core of the field of information retrieval, and the most common problem is that the user query is too short (contains only a few keywords), and it is easy to cause the relevant documents in the retrieval process to not match the query. For example, for the user query "short of water", if the document contains words related to the query such as "drought", although the correlation is high, but because the original query keyword "short of water" is not included, the final matching degree will be very low. low, thereby affecting the accuracy of the query.
常见的解决方法是基于伪相关反馈的查询扩展。该方法建立在初步检索结果的基础上,假设排在前面的K个文档(简称为“伪相关反馈文档”)是与原查询相关的,其中的关键词可以采用相关算法提取出来用于查询扩展表示。然而该方法是无监督的,容易带来一些与查询无关的词。虽然理论上可以采用有监督的分类方法,综合考虑扩展词的多种特征,挑选出真正与查询相关的词。然而,这种方法依赖于特征工程和标注训练集,实际应用的代价较高。A common solution is query expansion based on pseudo-relevance feedback. This method is based on the preliminary retrieval results, assuming that the top K documents (referred to as "pseudo-relevant feedback documents") are related to the original query, and the keywords in them can be extracted by using a correlation algorithm for query expansion. express. However, this method is unsupervised, and it is easy to bring some words irrelevant to the query. Although in theory, a supervised classification method can be used to comprehensively consider the various characteristics of the expanded words to select the words that are really related to the query. However, this method relies on feature engineering and labeled training sets, and the cost of practical application is high.
最近一些研究开始关注如何利用各种上下文信息来缓解查询表示中的无关扩展词引入问题。上下文信息来源主要包括高质量的外部数据源(如百科全书,领域本体等)和基于数据集本身的伪相关反馈文档。前者由于仅适用部分查询,且外部数据源大多情况下更新慢,获取困难,所以实际应用并不广泛。而后者基于数据集自身的伪相关反馈文档实际上也提供了对查询的上下文背景描述,具有更大的研究前景。比如,对于“缺水”这一查询,伪相关反馈文档1描述:“英国未来几年将面临缺水问题,所以请节约用水,修复好你的水龙头。”;伪相关反馈文档2描述:“旱作农业:一种缓解干旱和缺水问题的方法”。这两篇都是关于缺水问题的应对措施,这些上下文信息都可以用来辅助查询表示。然而现有的扩展词选取方法一般只考虑了扩展词与原查询词在伪相关反馈的上下文窗口中的共现度,仍然存在以下问题:(1)需要显式地选择哪些词用作最终查询扩展,在无监督的情况下依然会引入一些无关词,甚至是“有害词”。比如:在涉及各种环境资源的文章中,关键词“缺水”出现较频繁,但其上下文中也会出现类似的“水力发电”、“天然气”等,会偏离原始查询,降低查询的准确度;(2)最终查询表示依然基于词典空间,忽略了查询隐含的语义信息,如潜在的主题;(3)基于这种查询表示的检索模型主要考虑关键词匹配,而忽略了文档与查询在语义层次上的匹配。Some recent studies have focused on how to exploit various contextual information to alleviate the problem of irrelevant expansion word introduction in query representation. Sources of contextual information mainly include high-quality external data sources (such as encyclopedias, domain ontologies, etc.) and pseudo-relevant feedback documents based on the dataset itself. The former is not widely used in practice because it is only applicable to some queries, and external data sources are often updated slowly and difficult to obtain. The latter, based on the pseudo-relevant feedback documents of the dataset itself, actually also provides a contextual description of the query, which has a greater research prospect. For example, for the query "water shortage", pseudo-relevant feedback document 1 describes: "The UK will face water shortages in the next few years, so please save water and fix your taps."; pseudo-relevant feedback document 2 describes: " Rainfed agriculture: a way to mitigate drought and water scarcity". These two articles are all about the response to the water shortage problem, and these contextual information can be used to assist query representation. However, the existing extended word selection methods generally only consider the co-occurrence of the extended word and the original query word in the context window of the pseudo-relevance feedback, and there are still the following problems: (1) It is necessary to explicitly select which words to use as the final query Expansion, some irrelevant words, even "harmful words" will still be introduced in the unsupervised situation. For example, in articles involving various environmental resources, the keyword "water shortage" appears frequently, but similar words such as "hydroelectric power" and "natural gas" also appear in its context, which will deviate from the original query and reduce the accuracy of the query. (2) The final query representation is still based on the dictionary space, ignoring the semantic information implicit in the query, such as potential topics; (3) The retrieval model based on this query representation mainly considers keyword matching, but ignores the relationship between documents and queries. Matching at the semantic level.
发明内容Contents of the invention
本发明的目的是针对现有技术的不足而提出的一种基于上下文感知主题模型的查询表示及混合检索模型设计方法,在查询表示中融入基于伪相关反馈的上下文主题信息,从而在原有基于关键词匹配的检索模型基础上增加主题匹配,提升检索结果的准确性。The purpose of the present invention is to propose a query representation and hybrid retrieval model design method based on the context-aware topic model in view of the deficiencies of the prior art. Based on the word matching retrieval model, topic matching is added to improve the accuracy of retrieval results.
本发明提出了一种基于上下文感知主题的查询表示及混合检索模型建立方法,包括如下步骤:The present invention proposes a query representation based on context-aware topics and a method for establishing a hybrid retrieval model, which includes the following steps:
步骤一:基于查询的关键词集合,获取所述查询的伪相关反馈文档,从所述伪相关反馈文档中选取与所述查询相关的上下文;Step 1: Obtain a pseudo-relevant feedback document of the query based on the keyword set of the query, and select a context related to the query from the pseudo-relevant feedback document;
步骤二:引入上下文感知主题模型,将所述上下文融入所述上下文感知主题模型中,基于语料库主题挖掘所述上下文窗口所隐含的主题信息,得到其相应的主题向量;Step 2: introducing a context-aware topic model, integrating the context into the context-aware topic model, mining the topic information implied by the context window based on the corpus topic, and obtaining its corresponding topic vector;
步骤三:将所述查询以所述主题向量与所述关键词集合联合表示;基于所述主题向量和所述关键词集合,建立混合检索模型,得到最终的检索得分。Step 3: The query is jointly represented by the topic vector and the keyword set; based on the topic vector and the keyword set, a hybrid retrieval model is established to obtain a final retrieval score.
本发明提出的所述基于上下文感知主题的查询表示及混合检索模型建立方法中,步骤一中将所述伪相关反馈文档划分成多个滑动窗口,并计算出每个窗口与所述查询的相关性,取相关性高于阈值的窗口作为与所述查询相关的上下文窗口。In the context-aware topic-based query representation and hybrid retrieval model building method proposed by the present invention, in step 1, the pseudo-relevance feedback document is divided into a plurality of sliding windows, and the correlation between each window and the query is calculated. Relevance, take the window whose correlation is higher than the threshold as the context window related to the query.
本发明提出的所述基于上下文感知主题的查询表示及混合检索模型建立方法中,所述与查询相关的上下文选取阈值为该查询下所有窗口相关性的平均值。In the context-aware topic-based query representation and hybrid retrieval model building method proposed by the present invention, the query-related context selection threshold is the average value of all window correlations under the query.
本发明提出的所述基于上下文感知主题的查询表示及混合检索模型建立方法中,所述上下文感知主题模型是根据查询相关上下文及整个语料库所设计,利用所述上下文感知主题模型在主题建模过程中假设上下文窗口和其所在的伪相关反馈文档共享同样的主题分布,得到上下文的主题向量。In the context-aware topic-based query representation and hybrid retrieval model building method proposed by the present invention, the context-aware topic model is designed according to the relevant context of the query and the entire corpus, and the context-aware topic model is used in the topic modeling process Assuming that the context window and the pseudo-relevant feedback document where it is located share the same topic distribution, the topic vector of the context is obtained.
本发明提出的所述基于上下文感知主题的查询表示及混合检索模型建立方法中,所述伪相关反馈文档使用检索模型关键词匹配得分计算获得。In the context-aware topic-based query representation and hybrid retrieval model building method proposed by the present invention, the pseudo-relevant feedback document is obtained by calculating the keyword matching score of the retrieval model.
本发明提出的所述基于上下文感知主题的查询表示及混合检索模型建立方法中,所述检索得分以如下公式表示:In the context-aware topic-based query representation and hybrid retrieval model building method proposed by the present invention, the retrieval score is represented by the following formula:
其中,s表示传统检索模型中基于关键词匹配的得分,s′表示基于新查询表示Q′的主题匹配得分,λ是这两种得分之间的权重参数,也是两种匹配方式的权衡系数。Among them, s represents the score based on keyword matching in the traditional retrieval model, s' represents the topic matching score based on the new query representation Q', and λ is the weight parameter between these two scores, and it is also the trade-off coefficient of the two matching methods.
本发明的有益效果在于:本发明充分利用了语料库本身基于伪相关反馈的上下文信息,解决了高质量外部数据源难以获取的问题。且通过将伪相关反馈文档分割成一个个上下文窗口,并从中选取出与查询比较相关的上下文片段用于查询表示,减少了“噪声”引入和查询漂移,是一种查询表示质量控制的创新性举措。本发明中提出的上下文感知主题模型,充分挖掘了与查询相关的上下文对应的主题信息,突破了传统仅基于关键词层面的理解,有助于更全面、更深入地理解用户查询。传统的检索模型主要基于关键词匹配,而忽略了深层次的语义相关性。本发明设计的混合检索模型综合考虑了关键词匹配和主题匹配,这种多样化的匹配方式有助于促进检索效果的提升。本发明提出的查询表示方法及混合检索模型在Microblog Track 2011-2014的数据集上都被证明是有效的,在查询中融入上下文主题信息,其最终检索的MAP值超过了最新的一些查询表示方法。The beneficial effect of the present invention is that: the present invention makes full use of the context information of the corpus itself based on pseudo-correlation feedback, and solves the problem that high-quality external data sources are difficult to obtain. And by dividing the pseudo-relevant feedback documents into context windows, and selecting context fragments that are more relevant to the query for query representation, the introduction of "noise" and query drift is reduced, which is an innovation of query representation quality control move. The context-aware topic model proposed in the present invention fully excavates the topic information corresponding to the context related to the query, breaks through the traditional understanding based only on the keyword level, and helps to understand user queries more comprehensively and deeply. Traditional retrieval models are mainly based on keyword matching, while ignoring the deep semantic correlation. The hybrid retrieval model designed by the present invention takes keyword matching and topic matching into consideration, and this diversified matching mode helps to promote the improvement of retrieval effect. The query representation method and hybrid retrieval model proposed by the present invention have been proved to be effective on the Microblog Track 2011-2014 data set, and the context topic information is integrated into the query, and the MAP value of the final retrieval exceeds some of the latest query representation methods .
附图说明Description of drawings
图1是本发明基于上下文感知主题的查询表示及混合检索模型建立方法的流程图。FIG. 1 is a flow chart of the context-aware topic-based query representation and hybrid retrieval model building method of the present invention.
图2是基于伪相关反馈的上下文选取流程图。Fig. 2 is a flowchart of context selection based on pseudo-correlation feedback.
图3是上下文感知主题模型的图模型表示。Figure 3 is a graphical representation of a context-aware topic model.
具体实施方式detailed description
结合以下具体实施例和附图,对本发明作进一步的详细说明。实施本发明的过程、条件、实验方法等,除以下专门提及的内容之外,均为本领域的普遍知识和公知常识,本发明没有特别限制内容。The present invention will be further described in detail in conjunction with the following specific embodiments and accompanying drawings. The process, conditions, experimental methods, etc. for implementing the present invention, except for the content specifically mentioned below, are common knowledge and common knowledge in this field, and the present invention has no special limitation content.
如图1所示,本发明基于上下文感知主题的查询表示及混合检索模型建立方法包括如下步骤:As shown in Figure 1, the query representation and hybrid retrieval model building method based on context-aware topics of the present invention includes the following steps:
步骤一:基于查询的关键词集合,获取查询的伪相关反馈文档,从伪相关反馈文档中选取与查询相关的上下文;Step 1: Based on the keyword set of the query, obtain the pseudo-relevant feedback document of the query, and select the context related to the query from the pseudo-related feedback document;
步骤二:引入上下文感知主题模型,将上下文融入上下文感知主题模型中,基于语料库主题挖掘上下文窗口所隐含的主题信息,得到其相应的主题向量;Step 2: Introduce the context-aware topic model, integrate the context into the context-aware topic model, mine the topic information hidden in the context window based on the corpus topic, and obtain its corresponding topic vector;
步骤三:将查询以主题向量与关键词集合联合表示;基于主题向量和关键词集合,建立混合检索模型,得到最终的检索得分。Step 3: The query is represented by the subject vector and the keyword set; based on the subject vector and the keyword set, a hybrid retrieval model is established to obtain the final retrieval score.
(一)、基于伪相关反馈的相关上下文选取(1) Relevant context selection based on pseudo-relevant feedback
由于伪相关反馈文档易于获取且包含很多与查询相关的内容,本发明将从中选取出与查询比较相关的上下文用于查询表示,其具体流程见附图2。Since the pseudo-relevant feedback documents are easy to obtain and contain a lot of content related to the query, the present invention will select contexts that are relatively relevant to the query for query representation. The specific process is shown in Figure 2.
首先,对伪相关反馈文档进行切分,得到多个大小为n的上下文窗口。定义Q={q1,q2,...,q|Q|}为一个查询,其中qi表示一个查询关键词,|Q|表示该查询中关键词的个数。是查询Q对应的伪相关反馈文档集合,即第一次检索时排在top k的文档。对于一个伪相关反馈文档将以滑动窗口的形式,把它分割成如图2所示的若干个大小为n的上下文窗口(包含n个词),即Qc1,Qc2,...,Qcl,I表示上下文窗口的数目。First, the pseudo-relevant feedback documents are segmented to obtain multiple context windows of size n. Define Q={q 1 , q 2 , . . . , q |Q| } as a query, where q i represents a query keyword, and |Q| represents the number of keywords in the query. is the set of pseudo-relevant feedback documents corresponding to the query Q, that is, the documents ranked top k in the first retrieval. For a pseudo-relevant feedback document In the form of a sliding window, it will be divided into several context windows of size n (including n words) as shown in Figure 2, that is, Q c1 , Q c2 ,..., Q cl , and I represents the context window Number of.
其次,计算上下文窗口与原查询的相关性。对于一个查询和上下文窗口对(Q,Qc),本发明综合使用多种方法来计算它们之间的相关性R(Q,Qc),如基于词共现的平均点互信息(Pointwise Mutual Information)、基于词集合的Jaccard相似度、基于词向量word2vec的语义相似度等,最后取其平均值。Second, the relevance of the context window to the original query is calculated. For a query and context window pair (Q, Q c ), the present invention uses a variety of methods to calculate the correlation R(Q, Q c ) between them, such as the average point mutual information (Pointwise Mutual Information) based on word co-occurrence Information), the Jaccard similarity based on the word set, the semantic similarity based on the word vector word2vec, etc., and finally take the average value.
然后,筛选出与查询相关的上下文。先对以上得到的相关性进行归一化处理。接着,设置阈值为该查询下所有窗口相关性的平均值,过滤掉相关性低于该阈值的上下文窗口,其余的与查询比较相关的上下文将进一步用作上下文感知主题建模。Then, filter out contexts that are relevant to the query. First, normalize the correlation obtained above. Next, set the threshold as the average of the correlations of all windows under the query, and filter out the context windows whose correlations are lower than the threshold, and the remaining contexts that are more relevant to the query will be further used for context-aware topic modeling.
(二)、上下文主题感知建模及查询表示(2) Contextual topic-aware modeling and query representation
给定(一)中得到的与查询相关的上下文和整个语料库,本发明设计一个上下文感知主题模型,以便将与查询相关的上下文信息融入到主题模型中,生成新的查询表示。Given the query-related context and the entire corpus obtained in (1), the present invention designs a context-aware topic model to incorporate query-related context information into the topic model to generate new query representations.
受相关研究的启发,由于(一)中选取的上下文窗口和其所在的伪相关反馈文档都是与查询密切相关的,因此,假设它们共享同样的主题分布。在此假设下,改进传统的LDA主题模型,从而得到上下文感知主题模型CAT,其图模型表示如附图3。模型中涉及的相关符号说明如表1。该模型是一个生成模型,具体建模过程见算法1。Inspired by related studies, since both the context window selected in (1) and the pseudo-relevant feedback document where it is located are closely related to the query, it is assumed that they share the same topic distribution. Under this assumption, the traditional LDA topic model is improved to obtain the context-aware topic model CAT, and its graphical model is shown in Figure 3. The relevant symbols involved in the model are shown in Table 1. The model is a generative model, and the specific modeling process is shown in Algorithm 1.
表1上下文感知主题模型CAT中的相关符号说明Table 1 Explanation of related symbols in the context-aware topic model CAT
为了求解模型中的参数,本发明采用广泛使用的吉布斯采样(Gibbs sampling)算法。In order to solve the parameters in the model, the present invention adopts the widely used Gibbs sampling (Gibbs sampling) algorithm.
首先,根据吉布斯采样算法,文档中第个词被分配给主题的概率以如下公式(1)表示:First, according to the Gibbs sampling algorithm, the probability that the th word in the document is assigned to the topic is expressed by the following formula (1):
其中,表示不包括当前第i个词的其他所有词的主题分配向量,表示文档d中被分配给主题k的词数(不包括当前词),表示词wi在整个语料中被分配给主题k的次数(不包括当前词)。对于符号表示中缺失的上标或下标(如和)表示对该缺失维度求和,1是一个元素全为1的向量。in, Represents the topic assignment vectors of all other words excluding the current i-th word, Indicates the number of words (excluding the current word) assigned to topic k in document d, Indicates the number of times word w i is assigned to topic k in the entire corpus (excluding the current word). For missing superscripts or subscripts in symbolic representations (such as and ) means to sum the missing dimension, and 1 is a vector with all 1 elements.
类似地,文档d中第j个与查询相关的上下文窗口被分配给主题k的概率可以用下面的公式(2)表示:Similarly, the probability that the j-th query-related context window in document d is assigned to topic k can be expressed by the following formula (2):
其中,表示不包括当前第j个与查询相关的上下文窗口的其他所有窗口的主题分配向量,表示主题k中与查询Q相关的所有上下文窗口的个数(不包括当前窗口),θd,k表示文档d中主题k的概率,可以进一步用如下公式计算:in, Indicates the topic assignment vector for all other windows excluding the current j-th context window relevant to the query, Indicates the number of all context windows (excluding the current window) related to query Q in topic k, θ d,k represents the probability of topic k in document d, which can be further calculated by the following formula:
其中,表示文档d中被分配给主题k的总词数。in, Indicates the total number of words in document d that are assigned to topic k.
当模型收敛或达到预设的迭代次数时,将得到以下几个分布:“文档-主题”分布θ,“主题-词”分布Φ及“主题-查询上下文”分布η。η的每一列表示某查询的所有相关上下文在主题上的分布情况,这也是得到的新查询表示。可见,该表示很自然地同时将上下文信息和主题信息融合在一起,理论上将优于分别对各自建模的表示方法。When the model converges or reaches the preset number of iterations, the following distributions will be obtained: "document-topic" distribution θ, "topic-word" distribution Φ and "topic-query context" distribution η. Each column of η represents the distribution of all relevant contexts for a certain query over topics, which is also the resulting new query representation. It can be seen that the representation naturally fuses contextual information and topic information at the same time, which is theoretically superior to representations that model each separately.
(三)、混合检索模型设计(3) Hybrid retrieval model design
本发明基于得到的新查询表示,设计一种同时考虑关键词匹配和主题匹配的混合检索模型,其检索得分计算公式如下:Based on the obtained new query representation, the present invention designs a hybrid retrieval model that simultaneously considers keyword matching and topic matching, and its retrieval score calculation formula is as follows:
其中s表示传统检索模型中基于关键词匹配的得分,如language model检索得分或BM25检索得分,s′表示基于新查询表示Q′的主题匹配得分,λ是这两种得分之间的权重参数,也是两种匹配方式的权衡系数。Where s represents the score based on keyword matching in the traditional retrieval model, such as language model retrieval score or BM25 retrieval score, s′ represents the topic matching score based on the new query representation Q′, and λ is the weight parameter between these two scores, It is also the trade-off coefficient of the two matching methods.
关于主题匹配得分,可以采用多种计算方法。具体地,给定新查询表示和文档的主题分布向量,可以通过计算两者之间的主题分布相似度来得到,如Jensen-Shannondivergence(JSD)和余弦相似度(Cosine similarity)。Regarding the topic matching score, various calculation methods can be adopted. Specifically, given the new query representation and the topic distribution vector of the document, it can be obtained by calculating the topic distribution similarity between the two, such as Jensen-Shannon divergence (JSD) and cosine similarity (Cosine similarity).
本发明的保护内容不局限于以上实施例。在不背离发明构思的精神和范围下,本领域技术人员能够想到的变化和优点都被包括在本发明中,并且以所附的权利要求书为保护范围。The protection content of the present invention is not limited to the above embodiments. Without departing from the spirit and scope of the inventive concept, changes and advantages conceivable by those skilled in the art are all included in the present invention, and the appended claims are the protection scope.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610634174.2A CN106294662A (en) | 2016-08-05 | 2016-08-05 | Inquiry based on context-aware theme represents and mixed index method for establishing model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610634174.2A CN106294662A (en) | 2016-08-05 | 2016-08-05 | Inquiry based on context-aware theme represents and mixed index method for establishing model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106294662A true CN106294662A (en) | 2017-01-04 |
Family
ID=57664982
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610634174.2A Pending CN106294662A (en) | 2016-08-05 | 2016-08-05 | Inquiry based on context-aware theme represents and mixed index method for establishing model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106294662A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108121699A (en) * | 2017-12-21 | 2018-06-05 | 北京百度网讯科技有限公司 | For the method and apparatus of output information |
CN108520033A (en) * | 2018-03-28 | 2018-09-11 | 华中师范大学 | Information Retrieval Method of Enhanced Pseudo Relevance Feedback Model Based on Hyperspace Simulation Language |
CN108710611A (en) * | 2018-05-17 | 2018-10-26 | 南京大学 | A kind of short text topic model generation method of word-based network and term vector |
CN108804443A (en) * | 2017-04-27 | 2018-11-13 | 安徽富驰信息技术有限公司 | A kind of judicial class case searching method based on multi-feature fusion |
CN110333700A (en) * | 2019-05-24 | 2019-10-15 | 蓝炬兴业(赤壁)科技有限公司 | Industrial computer server remote management platform system and method |
CN110427400A (en) * | 2019-06-21 | 2019-11-08 | 贵州电网有限责任公司 | Search method is excavated based on operation of power networks information interactive information user's demand depth |
CN111897928A (en) * | 2020-08-04 | 2020-11-06 | 广西财经学院 | A Chinese query expansion method based on the union of query word embedding expansion words and statistical expansion words |
CN112685440A (en) * | 2020-12-31 | 2021-04-20 | 王程 | Structural query information expression method for marking search semantic role |
WO2021250488A1 (en) * | 2020-06-08 | 2021-12-16 | International Business Machines Corporation | Refining a search request to a content provider |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102750315A (en) * | 2012-04-25 | 2012-10-24 | 北京航空航天大学 | Rapid discovering method of conceptual relations based on sovereignty iterative search |
CN103678412A (en) * | 2012-09-21 | 2014-03-26 | 北京大学 | Document retrieval method and device |
CN103927177A (en) * | 2014-04-18 | 2014-07-16 | 扬州大学 | Characteristic-interface digraph establishment method based on LDA model and PageRank algorithm |
CN104050235A (en) * | 2014-03-27 | 2014-09-17 | 浙江大学 | Distributed information retrieval method based on set selection |
CN104391942A (en) * | 2014-11-25 | 2015-03-04 | 中国科学院自动化研究所 | Short text characteristic expanding method based on semantic atlas |
-
2016
- 2016-08-05 CN CN201610634174.2A patent/CN106294662A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102750315A (en) * | 2012-04-25 | 2012-10-24 | 北京航空航天大学 | Rapid discovering method of conceptual relations based on sovereignty iterative search |
CN103678412A (en) * | 2012-09-21 | 2014-03-26 | 北京大学 | Document retrieval method and device |
CN104050235A (en) * | 2014-03-27 | 2014-09-17 | 浙江大学 | Distributed information retrieval method based on set selection |
CN103927177A (en) * | 2014-04-18 | 2014-07-16 | 扬州大学 | Characteristic-interface digraph establishment method based on LDA model and PageRank algorithm |
CN104391942A (en) * | 2014-11-25 | 2015-03-04 | 中国科学院自动化研究所 | Short text characteristic expanding method based on semantic atlas |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804443A (en) * | 2017-04-27 | 2018-11-13 | 安徽富驰信息技术有限公司 | A kind of judicial class case searching method based on multi-feature fusion |
CN108121699A (en) * | 2017-12-21 | 2018-06-05 | 北京百度网讯科技有限公司 | For the method and apparatus of output information |
CN108520033A (en) * | 2018-03-28 | 2018-09-11 | 华中师范大学 | Information Retrieval Method of Enhanced Pseudo Relevance Feedback Model Based on Hyperspace Simulation Language |
CN108710611B (en) * | 2018-05-17 | 2021-08-03 | 南京大学 | A short text topic model generation method based on word network and word vector |
CN108710611A (en) * | 2018-05-17 | 2018-10-26 | 南京大学 | A kind of short text topic model generation method of word-based network and term vector |
CN110333700A (en) * | 2019-05-24 | 2019-10-15 | 蓝炬兴业(赤壁)科技有限公司 | Industrial computer server remote management platform system and method |
CN110427400A (en) * | 2019-06-21 | 2019-11-08 | 贵州电网有限责任公司 | Search method is excavated based on operation of power networks information interactive information user's demand depth |
WO2021250488A1 (en) * | 2020-06-08 | 2021-12-16 | International Business Machines Corporation | Refining a search request to a content provider |
US11238052B2 (en) | 2020-06-08 | 2022-02-01 | International Business Machines Corporation | Refining a search request to a content provider |
CN115605857A (en) * | 2020-06-08 | 2023-01-13 | 国际商业机器公司(Us) | Refine search requests for content providers |
GB2611237A (en) * | 2020-06-08 | 2023-03-29 | Ibm | Refining a search request to a content provider |
AU2021289542B2 (en) * | 2020-06-08 | 2023-06-01 | International Business Machines Corporation | Refining a search request to a content provider |
CN111897928A (en) * | 2020-08-04 | 2020-11-06 | 广西财经学院 | A Chinese query expansion method based on the union of query word embedding expansion words and statistical expansion words |
CN112685440A (en) * | 2020-12-31 | 2021-04-20 | 王程 | Structural query information expression method for marking search semantic role |
CN112685440B (en) * | 2020-12-31 | 2022-03-22 | 上海欣兆阳信息科技有限公司 | Structural query information expression method for marking search semantic role |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108052593B (en) | A topic keyword extraction method based on topic word vector and network structure | |
CN106294662A (en) | Inquiry based on context-aware theme represents and mixed index method for establishing model | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN103123685B (en) | Text mode recognition method | |
CN103049501B (en) | Based on mutual information and the Chinese domain term recognition method of conditional random field models | |
CN108681557B (en) | Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint | |
CN106055604B (en) | Short Text Topic Model Mining Method Based on Feature Expansion Based on Word Network | |
CN104778204B (en) | More document subject matters based on two layers of cluster find method | |
CN107133213A (en) | A kind of text snippet extraction method and system based on algorithm | |
CN103150382B (en) | Automatic short text semantic concept expansion method and system based on open knowledge base | |
CN102637192A (en) | Method for answering with natural language | |
CN108710611B (en) | A short text topic model generation method based on word network and word vector | |
CN106445920A (en) | Sentence similarity calculation method based on sentence meaning structure characteristics | |
CN106372061A (en) | Short text similarity calculation method based on semantics | |
CN104008090A (en) | Multi-subject extraction method based on concept vector model | |
CN103226580A (en) | Interactive-text-oriented topic detection method | |
CN103970730A (en) | Method for extracting multiple subject terms from single Chinese text | |
CN112100317B (en) | A Feature Keyword Extraction Method Based on Topic Semantic Awareness | |
CN113032541B (en) | Answer extraction method based on bert and fusing sentence group retrieval | |
CN107943919A (en) | A kind of enquiry expanding method of session-oriented formula entity search | |
CN101667201A (en) | Integration method of Deep Web query interface based on tree merging | |
CN105955975A (en) | Knowledge recommendation method for academic literature | |
CN107992549A (en) | Dynamic short text stream Clustering Retrieval method | |
CN104915405A (en) | Microblog query expansion method based on multiple layers | |
CN105095271A (en) | Microblog retrieval method and microblog retrieval apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170104 |
|
WD01 | Invention patent application deemed withdrawn after publication |