CN108491462A

CN108491462A - A kind of semantic query expansion method and device based on word2vec

Info

Publication number: CN108491462A
Application number: CN201810179478.3A
Authority: CN
Inventors: 章露露; 贾连印; 李孟娟; 丁家满; 李晓武; 陈文焰; 吕晓伟
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-03-05
Filing date: 2018-03-05
Publication date: 2018-09-04
Anticipated expiration: 2038-03-05
Also published as: CN108491462B

Abstract

The invention discloses a kind of semantic query expansion method and device based on word2vec, belongs to technical field of information retrieval.The method of the present invention includes：The pre-treatment step of the given inquiry of user：Word segmentation processing is carried out to inquiry, remove stop words and carries out stem reduction；Expansion word Candidate Set selecting step：Initial extension word is chosen using word2vec tools；Establish extension vocabulary step：Expansion word Candidate Set is filtered, true extension vocabulary is established；Query expansion step：User's inquiry and its expansion word are matched with indexed set, return to relevant documentation and sorted.The present invention proposes a kind of query vector generation method towards expansion word to filter candidate expansion word and build extension vocabulary, to the correlation for preferably embodying expansion word and entirely inquiring, and then the effect of raising query expansion.

Description

A semantic query extension method and device based on word2vec

技术领域technical field

本发明涉及一种基于word2vec的语义查询扩展方法及装置，属于信息检索技术领域。The invention relates to a word2vec-based semantic query extension method and device, belonging to the technical field of information retrieval.

背景技术Background technique

查询扩展技术是信息检索领域的一个重要问题。在当前的信息检索模型和系统中，信息都是以字、词或是词组的形式来存储的，当用户给定一个查询后，只有当查询集中的查询词出现在文档中时，才有可能检索到相关的文档。但是在人类的自然语言中，同一个概念经常有很多种不同的表达方式，比如说查找automobile时，如果不进行扩展，那么那些包含car、sedan、Ford等与用户原查询相关度很高但是由于用词不同而无法被检索出来，从而使用户无法得到满意的结果。正是由于这种查询词不匹配问题的存在，用户有时不得不变换查询词才能找到所需要的信息，所以为了减轻用户的这种负担，需要信息检索系统自动的选择一些与查询相关的其他词语来辅助查询，即通过查询扩展技术来解决这种词不匹配的问题。Query expansion technology is an important issue in the field of information retrieval. In the current information retrieval model and system, information is stored in the form of words, words or phrases. When the user gives a query, only when the query words in the query set appear in the document, it is possible to Related documents were retrieved. However, in human natural language, the same concept often has many different expressions. For example, when searching for automobile, if it is not expanded, then those that include car, sedan, Ford, etc. are highly relevant to the user's original query but due to Words are different and cannot be retrieved, so that users cannot get satisfactory results. It is precisely because of the mismatch of query words that users sometimes have to change the query words to find the information they need. Therefore, in order to reduce the burden on users, it is necessary for the information retrieval system to automatically select some other words related to the query. To assist the query, that is, to solve this word mismatch problem through query expansion technology.

用户提交一个查询，搜索引擎为了提高用户的检索满意度，通常将查询扩展作为一个必不可少的模块，目前常用的查询扩展方法主要有以下几种：When a user submits a query, search engines usually use query expansion as an indispensable module in order to improve the user's search satisfaction. At present, the commonly used query expansion methods mainly include the following:

1、基于语义知识词典的查询扩展方法：1. Query expansion method based on semantic knowledge dictionary:

基于语义知识词典的方法主要是借助WordNet、HowNet或其他的同义词词林等语义知识词典，选出与查询词存在一定语义关联性的词来进行扩展，这种方法的依据一般是查询词的上下义词、同义词等，此方法过分依赖于完备的语义体系，而且独立于待检索的语料集，因此选出来的扩展词通常难以反映语料集的特性，难以取得好的查询效果。The method based on the semantic knowledge dictionary mainly uses WordNet, HowNet or other semantic knowledge dictionaries such as Cilin to select words that have a certain semantic correlation with the query word for expansion. This method is generally based on the context of the query word. Synonyms, synonyms, etc. This method relies too much on a complete semantic system and is independent of the corpus to be retrieved. Therefore, the selected extended words usually cannot reflect the characteristics of the corpus, and it is difficult to achieve good query results.

2、基于全局分析的查询扩展：2. Query expansion based on global analysis:

全局分析是首先对全部文档中的词或词组进行相关分析，计算每对词的关联程度，然后再将与查询词关联性最高的词加入到初始查询中生成新的查询。这种方法的优点是可以最大限度的探求词之间的关系，特别是在建立词典之后能以较高的效率进行查询扩展；不足的是当文档集很大时，建立全部的词关系词典不论是在时间还是空间上往往都是不大可行的，而且文档集改变的话更新的代价更为巨大。The global analysis is to first conduct a correlation analysis on the words or phrases in all documents, calculate the correlation degree of each pair of words, and then add the words with the highest correlation with the query words to the initial query to generate a new query. The advantage of this method is that the relationship between words can be explored to the maximum extent, especially after the dictionary is established, the query can be expanded with high efficiency; the disadvantage is that when the document set is large, the establishment of all word relationship dictionaries regardless of It is often not feasible in terms of time or space, and the cost of updating if the document set changes is even greater.

3、基于局部分析的查询扩展：3. Query expansion based on local analysis:

局部分析方法主要是利用二次检索的方法解决扩展问题，利用初次给定的查询直接检索，得到与原查询最相关的n篇文档作为扩展词的来源，在这n篇文档里找与原查询最相关的词加入到初始查询中来建立新的查询。目前比较流行的基于局部分析的查询扩展方法是伪相关反馈，它是在相关反馈的基础上发展起来的，这两种反馈的不同在于相关反馈对初次检索的结果需要由用户判定，将用户认为的相关文档作为扩展词的来源，而伪相关反馈不需要与用户交互，直接将返回的前n篇文档认为是相关文章。虽然局部分析方法是目前应用最广泛的查询扩展方法，但是它的不足之处在于当初次检索的文档排在前面的与原查询相关度不大时，容易将大量无关的词加入查询，造成“查询漂移”问题。The local analysis method mainly uses the secondary retrieval method to solve the expansion problem, uses the query given for the first time to directly retrieve, and obtains the n documents most relevant to the original query as the source of the expansion word, and finds the original query in these n documents. The most relevant words are added to the initial query to build a new query. The current popular query expansion method based on partial analysis is pseudo-relevance feedback, which is developed on the basis of relevant feedback. The related documents are used as the source of extended words, and the pseudo-related feedback does not need to interact with the user, and directly considers the returned first n documents as related articles. Although the local analysis method is currently the most widely used query expansion method, its shortcoming is that when the documents retrieved for the first time are not highly relevant to the original query, it is easy to add a large number of irrelevant words to the query, resulting in " Query Drift" problem.

随着Word2Vec、Glove等语义模型的提出，近年来词嵌入技术在自然语言处理的多个领域引起了众多研究者的关注。通过word2vec、Glove提供的训练模型训练得到的词向量反映了自然语言中的语义和语法关系，可以通过计算词向量之间的余弦值来判断词项之间的相似性，因此可很好的用于查询扩展。With the introduction of semantic models such as Word2Vec and Glove, word embedding technology has attracted the attention of many researchers in many fields of natural language processing in recent years. The word vectors trained by the training model provided by word2vec and Glove reflect the semantic and grammatical relationship in natural language, and the similarity between terms can be judged by calculating the cosine value between word vectors, so it can be used very well for query expansion.

目前基于Word2Vec的查询扩展的研究工作，但多数工作多存在以下主要两个的不足：At present, the research work on query expansion based on Word2Vec, but most of the work has the following two main deficiencies:

(1)在构建扩展词表时，仅选取与查询词相关的词作为扩展词，而没有考虑到与整个查询的相关性。(1) When constructing the expanded vocabulary, only the words related to the query words are selected as the expanded words, without considering the relevance to the entire query.

(2)即使考虑与整个查询的相关性的工作也多认为查询向量对所有替换词而言是固定不变的，故其查询向量多为各查询词向量的简单加和或均值。(2) Even if the correlation with the entire query is considered, most of the work considers that the query vector is fixed for all replacement words, so the query vector is mostly the simple sum or mean value of each query word vector.

但通常情况下，对查询词q的某个扩展词而言，其它查询词对该扩展词的影响不应和q对该扩展词的影响相当。以查询中不同的词为中心词生成不同的查询向量的思想广泛应用于语义消歧等其它基于词嵌入的信息检索领域且取得了更好的效果，但尚未有效应用于查询扩展领域。But usually, for a certain extension word of query word q, the influence of other query words on this extension word should not be equal to the influence of q on this extension word. The idea of generating different query vectors with different words in the query as the center word is widely used in semantic disambiguation and other word embedding-based information retrieval fields and has achieved better results, but it has not been effectively applied in the field of query expansion.

发明内容Contents of the invention

本发明要解决的技术问题是提供一种基于word2vec的语义查询扩展方法及装置，目的在于构建与查询相关性更高的扩展词表，从而更全面的返回与用户查询相关的文档。The technical problem to be solved by the present invention is to provide a semantic query expansion method and device based on word2vec, with the purpose of constructing an extended vocabulary with higher relevance to the query, so as to more comprehensively return documents related to the user query.

本发明的技术方案是：一种基于word2vec的语义查询扩展方法，包括：The technical scheme of the present invention is: a kind of semantic query extension method based on word2vec, comprising:

查询和文档预处理步骤：对于用户提交的查询分词、去除停用词，提取出用户查询的关键词并进行词干还原，组成查询Q；对文档集做同样的预处理得到文档集D；Query and document preprocessing steps: for the query submitted by the user, remove the stop words, extract the keywords of the user query and restore the word stem to form the query Q; perform the same preprocessing on the document set to obtain the document set D;

扩展词候选集的选取步骤：对于预处理之后的查询Q，利用基于word2vec模型训练的词向量计算并获取每个查询关键词的n个最相似的词项，构成扩展词候选集CThe selection step of the expanded word candidate set: for the query Q after preprocessing, use the word vector trained based on the word2vec model to calculate and obtain the n most similar terms of each query keyword to form the expanded word candidate set C

建立扩展词表步骤：对C中的每个词项，计算其与整个查询的相似度，选取相似度最高的k个扩展词来构造扩展词表T；Establishing an extended vocabulary step: For each term in C, calculate its similarity with the entire query, and select k extended words with the highest similarity to construct an extended vocabulary T;

建立文档集倒排索引步骤：对预处理之后的文档集D建立倒排索引；The step of establishing an inverted index of a document set: establishing an inverted index for the preprocessed document set D;

扩展检索步骤：计算扩展后的查询与对应倒排索引中的文档的相关度，根据相关度对文档进行排序。Extended retrieval step: calculate the correlation between the expanded query and the documents in the corresponding inverted index, and sort the documents according to the correlation.

所述的查询和文档预处理步骤，具体包括以下步骤：The query and document preprocessing steps specifically include the following steps:

(1)对用户提交的查询通过空格符和标点符号进行分词处理；(1) Carry out word segmentation processing on the query submitted by the user through spaces and punctuation marks;

(2)分词之后去除停用词，将那些不代表概念的词语过滤掉；(2) After word segmentation, stop words are removed, and words that do not represent concepts are filtered out;

(3)去除停用词后进行词干还原，生成查询Q；(3) Stem reduction is performed after removing stop words to generate query Q;

(4)对文档集做同样的预处理生成新的文档集D。(4) Do the same preprocessing on the document set to generate a new document set D.

所述扩展词候选集选取步骤，具体包括以下步骤：The selection step of the expanded word candidate set specifically includes the following steps:

(1)给定一个语料库，通过word2vec提供的训练模型训练词向量。词向量是一组多维的实数值向量，向量反映了自然语言中的语义和语法关系，因此可以通过计算词向量之间的余弦值来判断词项之间的相似性；(1) Given a corpus, train word vectors through the training model provided by word2vec. Word vectors are a set of multidimensional real-valued vectors, which reflect the semantic and grammatical relationships in natural language, so the similarity between terms can be judged by calculating the cosine value between word vectors;

(2)得到词向量之后，对Q中每个关键词q_i，通过词向量的余弦相似度计算并获取与q_i最相似的n个词，构成查询的扩展词候选集。(2) After the word vector is obtained, for each keyword q _i in Q, calculate the cosine similarity of the word vector and obtain the n words most similar to q _i to form an expanded word candidate set for the query.

所述扩展词表的建立步骤，具体包括以下步骤：The step of establishing the expanded vocabulary specifically includes the following steps:

(1)对上述处理形成的查询Q，对Q中的每个关键词q_i，按以下公式生成一个Q相对于q_i的查询向量 (1) For the query Q formed by the above processing, for each keyword q _i in Q, generate a query vector of Q relative to q _i according to the following formula

式中vec(q_i)表示查询词q_i的向量，sim(q_i,q_j)表示q_i和q_j的相似度。In the formula, vec(q _i ) represents the vector of query word q _i , and sim(q _i , q _j ) represents the similarity between q _i and q _j .

(2)对q_i的每个候选扩展词t，按以下公式计算t与查询Q的相似度：(2) For each candidate expansion word t of q _i , calculate the similarity between t and query Q according to the following formula:

对不同查询词的候选扩展词而言，采用不同的查询向量计算扩展词和查询Q的相似度，故本发明将生成查询向量的方法称作面向扩展词的查询向量生成方法，相应地，也被称作面向扩展词的查询向量；For the candidate expansion words of different query words, different query vectors are used Calculate the similarity between the expanded word and the query Q, so the present invention will generate the query vector The method is called the query vector generation method for expansion words, correspondingly, Also known as query vectors for expanded words;

(3)每个查询词的扩展词根据以上模型计算相对于整个查询Q的相似度，然后对扩展词根据相似度重新排序，返回相似度最高的k个扩展词，作为最终的扩展词集T；(3) The extended words of each query word calculate the similarity with respect to the entire query Q according to the above model, and then reorder the extended words according to the similarity, and return the k extended words with the highest similarity as the final extended word set T ;

(4)生成扩展查询Q_exp＝Q∪T。(4) Generate an extended query Q _exp =Q∪T.

所述的建立文档集倒排索引步骤，具体包括以下步骤：The step of establishing an inverted index of a document set specifically includes the following steps:

(1)对预处理后的文档集D，统计D的所有单词并去重，生成文档词集V；(1) For the preprocessed document set D, count all the words of D and remove the duplicates to generate a document word set V;

(2)对V中的每个词项v，构造一个由所有包含v的文档d(其中d∈D)的ID(d_id)以及v在d中出现次数tf_v,d组成的倒排列表，列表中每个项表示为二元组＜d_id,tf_v,d＞的形式，所有倒排列表的集合构成倒排索引集I；(2) For each term v in V, construct an inverted list consisting of the IDs (d _id ) of all documents d (where d∈D) containing v and the number of occurrences of v in d tf _v,d , each item in the list is expressed as a binary group <d _id ,tf _v,d >, and the set of all inverted lists constitutes the inverted index set I;

(3)对每个词项v，统计其出现的文档数量m，并根据以下公式计算v的idf得分：(3) For each term v, count the number m of documents where it appears, and calculate the idf score of v according to the following formula:

其中|D|表示D中文档的总数量。where |D| denotes the total number of documents in D.

所述扩展检索文档步骤，具体包括以下步骤：The step of extending the retrieved document specifically includes the following steps:

(1)(1)对Q_exp中的每个关键词，查询倒排索引集I，获取该关键词对应的倒排列表，记这些倒排列表的集合为 (1)(1) For each keyword in Q _exp , query the inverted index set I to obtain the inverted list corresponding to the keyword, and record the set of these inverted lists as

(2)对出现在中的每个文档d，累加其在中各列表的tf-idf得分，获得Q_exp与文档d的相关度R(Q_exp,d)，计算R(Q_exp,d)的公式如下：(2) Contrast present For each document d in The tf-idf score of each list in the list is obtained to obtain the correlation R(Q _exp ,d) between Q _exp and document d. The formula for calculating R(Q _exp ,d) is as follows:

式中，λ表示调节参数，用于控制查询词和扩展词在计算相关度时的权重。In the formula, λ represents an adjustment parameter, which is used to control the weight of query words and expansion words when calculating the correlation.

(3)根据相关度的大小对这些文档进行排序，从而返回与原查询最相关的N个文档。(3) The documents are sorted according to the degree of relevance, so as to return the N documents most relevant to the original query.

一种基于word2vec的语义查询扩展装置，包括：A semantic query expansion device based on word2vec, comprising:

查询和文档集预处理模块，用于对文档集和用户提交的查询进行分词、去停用词和词干还原等处理形成查询Q和文档集D；The query and document set preprocessing module is used to perform word segmentation, stop word removal and word stem restoration on the document set and user-submitted queries to form query Q and document set D;

扩展词候选集选取模块，用于将查询Q中的每个关键词，利用基于word2vec模型训练的词向量计算并获取每个查询关键词的n个最相似的词项，构成扩展词候选集C；The extended word candidate set selection module is used to query each keyword in Q, use the word vector based on word2vec model training to calculate and obtain the n most similar terms of each query keyword, and form the expanded word candidate set C ;

扩展词表构造模块，用于对扩展词候选集中的每个词项，计算其与整个查询的相似度，选取相似度较高的一些扩展词来构造扩展词表T；The extended vocabulary construction module is used to calculate the similarity between it and the entire query for each term in the extended word candidate set, and select some extended words with higher similarity to construct the extended vocabulary T;

文档集倒排索引模块，用于对预处理之后的文档集D建立倒排索引；The document set inverted index module is used to establish an inverted index for the preprocessed document set D;

扩展检索模块，用于计算扩展后的查询与对应倒排索引中的文档的相关度，获取相关文档。The extended retrieval module is used to calculate the correlation between the extended query and the documents in the corresponding inverted index, and obtain relevant documents.

本发明的有益效果是：提出基于word2vec的语义查询扩展方法，考虑替换词对整个查询的相似度，且引入面向扩展词的查询向量生成方法，为不同查询词对应的扩展词词生成不同的查询向量，获得与查询相关性更高的扩展词集，进而获得更好的查询扩展效果。The beneficial effects of the present invention are: a word2vec-based semantic query expansion method is proposed, the similarity of the replacement word to the entire query is considered, and a query vector generation method oriented to the expansion word is introduced to generate different queries for the expansion words corresponding to different query words Vector, to obtain an expanded word set that is more relevant to the query, and then to obtain a better query expansion effect.

附图说明Description of drawings

图1是本发明基于word2vec的语义查询扩展的功能模块图；Fig. 1 is the functional block diagram of the semantic query expansion based on word2vec of the present invention;

图2是本发明查询集中各个关键词的扩展词候选集图；Fig. 2 is the expansion word candidate set diagram of each keyword in the query set of the present invention;

图3是本发明倒排索引集图。Fig. 3 is a diagram of an inverted index set in the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施方式，对本发明作进一步说明。The present invention will be further described below in combination with the accompanying drawings and specific embodiments.

实施例1：如图1-3所示，一种基于word2vec的语义查询扩展方法，包括：Embodiment 1: As shown in Figure 1-3, a semantic query expansion method based on word2vec, including:

查询和文档预处理步骤：Query and document preprocessing steps:

(3)去除停用词后进行词干还原，生成查询Q。(3) Stem reduction is performed after removing stop words to generate query Q.

示例1：查询预处理：假设用户提交的查询为“problems associated with highspeed aircraft”Example 1: Query preprocessing: Suppose the query submitted by the user is "problems associated with highspeed aircraft"

(1)首先对用户提交的查询进行分词，分词之后的查询表现为：{problems，associated，with，high，speed，aircraft}；(1) First, word segmentation is performed on the query submitted by the user, and the query performance after word segmentation is: {problems, associated, with, high, speed, aircraft};

(2)去除停用词，然后选取查询中的名词构成最终的查询，查询表现为：{problems，speed，aircraft}；(2) Remove the stop words, and then select the nouns in the query to form the final query. The query performance is: {problems, speed, aircraft};

(3)对查询中的关键词进行词干还原，problems是名词复数，还原后的查询关键词集Q＝{problem，speed，aircraft}。(3) Perform word stemming on keywords in the query, problems are plural nouns, and the restored query keyword set Q={problem, speed, aircraft}.

示例2：文档集预处理：假设有以下四篇文档组成的文档集：Example 2: Document set preprocessing: Suppose there is a document set consisting of the following four documents:

D₀＝"The main problem limiting the high velocity performance ofhelicopter is resistance"D ₀ ="The main problem limiting the high velocity performance of helicopter is resistance"

D₁＝"high altitude and high speed flying aircraft are often moreslender shape"D ₁ ="high altitude and high speed flying aircraft are often moreslender shape"

D₂＝"There are many airplanes in the sky that make up a row"D ₂ ="There are many airplanes in the sky that make up a row"

D₃＝"whether to fly today is a problem"D ₃ ="whether to fly today is a problem"

按空格和分隔符找出字符串中的所有单词，去除停用词并进行词干还原，形成的新的文档集为：Find all words in the string by spaces and delimiters, remove stop words and perform stemming, and the new document set formed is:

D₀＝"problem,limit,velocity,performance,helicopter,resistance"D ₀ ="problem,limit,velocity,performance,helicopter,resistance"

D₁＝"altitude,speed,fly,aircraft,slender,shape"D ₁ = "altitude,speed,fly,aircraft,slender,shape"

D₂＝"airplane,sky,row"D ₂ = "airplane, sky, row"

D₃＝"fly,problem"D ₃ = "fly, problem"

选取扩展词候选集步骤：Steps for selecting the extended word candidate set:

(1)选定维基百科语料库，通过word2vec提供的CBOW模型训练出200维的词向量文件；(1) Select the Wikipedia corpus, and train a 200-dimensional word vector file through the CBOW model provided by word2vec;

(2)得到词向量之后，对Q中的每个关键词，通过计算词向量的余弦相似度获取n个最相似的词，作为查询的扩展词候选集。(2) After the word vector is obtained, for each keyword in Q, the n most similar words are obtained by calculating the cosine similarity of the word vector, as the extended word candidate set of the query.

对于查询Q＝{problem，speed，aircraft}中的每个关键词，通过训练好的词向量选取前10个语义最相关的扩展词，扩展词候选集的情况如图3所示。For each keyword in the query Q={problem, speed, aircraft}, select the top 10 most semantically relevant extended words through the trained word vector, and the situation of the extended word candidate set is shown in Figure 3.

构造扩展词表T步骤：Construct the extended vocabulary T step:

(1)对Q中的每个关键词q_i，按以下公式生成一个Q相对于q_i的查询向量 (1) For each keyword q _i in Q, generate a query vector of Q relative to q _i according to the following formula

(3)每个查询词的扩展词根据以上模型计算相对于整个查询Q的相似度，然后对相似度重新排序，返回相似度最高的k个扩展词，作为最终的扩展词集T；(3) The extended words of each query word calculate the similarity relative to the entire query Q according to the above model, then reorder the similarity, and return the k extended words with the highest similarity as the final extended word set T;

示例：Example:

(1)首先根据训练好的词向量可以得到查询Q中每个关键词的200维词向量：(1) First, the 200-dimensional word vector of each keyword in the query Q can be obtained according to the trained word vector:

vec(problem)＝[0.29686138,1.71120727,...,-0.6585713,-1.86508703]vec(problem)＝[0.29686138, 1.71120727,...,-0.6585713,-1.86508703]

vec(speed)＝[-2.00363445,1.05960512,...,-0.475373,-4.39991331]vec(speed)＝[-2.00363445,1.05960512,...,-0.475373,-4.39991331]

vec(aircraft)＝[-3.54158616,3.28720021,...,-2.34602952,-3.29022384]vec(aircraft)＝[-3.54158616,3.28720021,...,-2.34602952,-3.29022384]

然后计算Q中每个关键词面向扩展词的查询向量，计算过程如下：Then calculate the query vector for each keyword in Q to expand words, the calculation process is as follows:

2)以查询Q中的关键词aircraft为例，即q₃＝aircraft，计算q₃的每个扩展词t与查询Q的相似度：2) Take the keyword aircraft in the query Q as an example, that is, q ₃ =aircraft, calculate the similarity between each expanded word t of q ₃ and the query Q:

........ …

(3)以此类推，计算图2中每个扩展词与原查询Q的相似度，然后根据相似度对候选集中的扩展词进行排序，得到和查询Q最相似的k个扩展词，以k＝4为例，最终得到的扩展词表T如下所示：(3) By analogy, calculate the similarity between each extended word in Figure 2 and the original query Q, and then sort the extended words in the candidate set according to the similarity, and obtain k extended words that are most similar to the query Q, with k =4 as an example, the final expanded vocabulary T is as follows:

T＝{helicopter,airplane,velocity,altitude}T = {helicopter, airplane, velocity, altitude}

(4)将查询词和扩展词合并，得到扩展查询Q_exp：(4) Merge query words and expansion words to obtain extended query Q _exp :

Q_exp＝Q∪TQ _exp = Q∪T

＝{problem,speed,aircraft}∪{helicopter,airplane,velocity,altitude}＝{problem,speed,aircraft}∪{helicopter,airplane,velocity,altitude}

＝{problem,speed,aircraft,helicopter,airplane,velocity,altitude}＝{problem,speed,aircraft,helicopter,airplane,velocity,altitude}

文档集倒排索引建立包括以下步骤：Building an inverted index of a document set includes the following steps:

(1)对预处理后的文档集D，统计D中的独立词项，生成词汇表V；(1) For the preprocessed document set D, count the independent terms in D to generate a vocabulary V;

示例：Example:

(1)文档集经过分词、去停用词等预处理后得到如下的文档集D：(1) After the document set is preprocessed by word segmentation and stop words removal, the following document set D is obtained:

D₂＝"airplane,sky,row"D ₂ = "airplane, sky, row"

D₃＝"fly,problem"D ₃ = "fly, problem"

统计D中的独立词项，生成词汇表V：Count the independent terms in D to generate vocabulary V:

V＝{altitude,speed,fly,aircraft,slender,shape,problem,limit,velocity,performance,V＝{altitude, speed, fly, aircraft, slender, shape, problem, limit, velocity, performance,

helicopter,resistance,airplane,sky,row}helicopter, resistance, airplane, sky, row}

(2)以词汇表V中单词velocity为例，遍历文档集D找到包含velocity的文档有D₁，记录其ID＝D₁，统计它在文档D₁中出现的次数为1，则velocity的倒排列表的表示形式为＜D₁,1＞；依此类推计算并建立V中所有词项的倒排列表的集合，构成倒排索引集I；(2) Taking the word velocity in the vocabulary V as an example, traverse the document set D to find the document containing velocity D ₁ , record its ID=D ₁ , and count the number of times it appears in document D ₁ as 1, then the inverse of velocity The expression form of the sorting table is <D ₁ ,1>; by analogy, calculate and establish a set of inverted lists of all terms in V to form an inverted index set I;

(3)对V中的每个单词v，统计其出现的文档数量m(即v的倒排列表长度)，计算idf得分：(3) For each word v in V, count the number of documents m (that is, the length of the inverted list of v), and calculate the idf score:

如v＝velocity，倒排列表长度为1，即文档集中包含problem的文档只有1个，m＝1,因此单词velocity的idf得分计算为：For example, v=velocity, the length of the inverted list is 1, that is, there is only one document containing problem in the document set, and m=1, so the idf score of the word velocity is calculated as:

依此计算所有单词的idf得分，并在索引中记录idf，最终的倒排索引集I如图3所示。Calculate the idf scores of all words accordingly, and record the idf in the index, and the final inverted index set I is shown in Figure 3.

扩展检索步骤：Extended search steps:

(1)对Q_exp中的每个关键词，查询倒排索引集I，获取该关键词对应的倒排列表，记这些倒排列表的集合为 (1) For each keyword in Q _exp , query the inverted index set I to obtain the inverted list corresponding to the keyword, and record the set of these inverted lists as

示例：Example:

(1)对上述生成的Q_exp，查询图3的倒排索引集，获取Q_exp中所有关键词对应的倒排列表，求并集I_Qexp：(1) For the Q _exp generated above, query the inverted index set in Figure 3, obtain the inverted list corresponding to all keywords in Q _exp , and find the union I _Qexp :

I_Qexp＝I(problem)∪I(speed)∪......∪I(airplane)∪I(altitude)I _Qexp ＝I(problem)∪I(speed)∪......∪I(airplane)∪I(altitude)

＝{D₁,D₃}∪{D₀}∪......∪{D₂}∪{D₀}＝{D ₁ ,D ₃ }∪{D ₀ }∪...∪{D ₂ }∪{D ₀ }

＝{D₀,D₁,D₂,D₃}={D ₀ ,D ₁ ,D ₂ ,D ₃ }

(2)对D₀,D₁,D₂和D₃号文档，计算Q_exp与其相关度R(Q_exp,d)，其中此处令调节参数λ＝0.6，计算过程如下：(2) For documents D ₀ , D ₁ , D ₂ and D ₃ , calculate Q _exp and its correlation degree R(Q _exp ,d), where the adjustment parameter λ=0.6, and the calculation process is as follows:

(3)根据相关度的大小对这些文档进行排序，有D₁＞D₀＞D₂＞D₃；若N＝3，则返回D₁,D₀,D₂号文档。(3) Sort these documents according to the degree of relevance, such as D ₁ >D ₀ >D ₂ >D ₃ ; if N=3, then return the documents D ₁ , D ₀ , and D ₂ .

实施例2：一种基于word2vec的语义查询扩展装置，包括：Embodiment 2: a kind of semantic query extension device based on word2vec, comprising:

以上结合附图对本发明的具体实施方式作了详细说明，但是本发明并不限于上述实施方式，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下作出各种变化。The specific embodiments of the present invention have been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above embodiments. Variations.

Claims

1. a kind of semantic query expansion method based on word2vec, it is characterised in that：It the described method comprises the following steps：

(1) inquiry and document pretreatment：Inquiry participle, the removal stop words submitted for user, extract the pass of user's inquiry Keyword simultaneously carries out stem reduction, composition inquiry Q；Same pretreatment is done to document sets and obtains document sets D；

(2) selection of expansion word Candidate Set：For the inquiry Q after pretreatment, the word based on word2vec model trainings is utilized Vector calculates and obtains n most like lexical items of each searching keyword, constitutes expansion word Candidate Set C；

(3) extension vocabulary is established：To each lexical item in C, it is calculated and the similarity entirely inquired, it is highest to choose similarity K expansion word extends vocabulary T to construct；

(4) document sets inverted index is established：Inverted index is established to the document sets D after pretreatment；

(5) query expansion：The inquiry after extension and the degree of correlation of the document in corresponding inverted index are calculated, according to the degree of correlation to text Shelves are ranked up.

2. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that：Inquiry and document Pre-treatment step specifically includes following steps：

(1) word segmentation processing is carried out by space character and punctuation mark to the inquiry that user submits；

(2) stop words is removed after participle, the word that those are not represented to concept filters out；

(3) stem reduction is carried out after removing stop words, generates inquiry Q；

(4) same pretreatment is done to document sets and generates new document sets D.

3. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that：Expansion word is candidate The selecting step of collection, specifically includes following steps：

(1) corpus is given, term vector is trained by the training pattern that word2vec is provided, term vector is one group of multidimensional Real number value vector, vector reflect semanteme and grammatical relation in natural language, therefore can be by between calculating term vector Cosine value judges the similitude between lexical item；

(2) after obtaining term vector, to each keyword q in Q_i, calculated and obtained and q by the cosine similarity of term vector_iMost Similar n word constitutes the expansion word Candidate Set of inquiry.

4. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that：Extend vocabulary Establishment step specifically includes following steps：

(1) the inquiry Q that above-mentioned processing is formed, to each keyword q in Q_i, a Q is generated as follows relative to q_i's Query vector vec (Q_qi)：

In formula, vec (q_i) indicate query word q_iVector, sim (q_i,q_j) indicate q_iAnd q_jSimilarity.

(2) to q_iEach of candidate expansion word t, calculate t as follows and inquire the similarity of Q：

Sim (t, Q)=cos (vec (t), vec (Q_qi))

For the candidate expansion word of different query words, using different query vector vec (Q_qi) calculate expansion word and inquire Q's Similarity will generate query vector vec (Q_qi) method be referred to as the query vector generation method towards expansion word, correspondingly, vec (Q_qi) it is also referred to as the query vector towards expansion word；

(3) expansion word of each query word calculates the similarity relative to entire inquiry Q according to model above, then to expansion word It is resequenced according to similarity, the highest k expansion word of similarity is returned to, as final expansion-word set T；

(4) expanding query Q is generated_exp=Q ∪ T.

5. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that：Establish document sets Inverted index specifically includes following steps：

(1) to pretreated document sets D, all words and duplicate removal of D is counted, document word set V is generated；

(2) to each lexical item v in V, construction one is by all document d comprising v, the wherein ID (d of d ∈ D_id) and v in d Occurrence number tf_v,dThe Inverted List of composition, each item is expressed as two tuple ＜ d in list_id,tf_v,dThe form of ＞, all rows of falling The set of list constitutes inverted index collection I；

(3) to each lexical item v, the number of documents m of its appearance is counted, and calculates the idf scores of v according to following formula：

Wherein, | D | indicate the total quantity of document in D.

6. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that：Query expansion has Body includes the following steps：

(1) to Q_expIn each keyword, inquire inverted index collection I, obtain the corresponding Inverted List of the keyword, remember these The collection of Inverted List is combined into I_Qexp；

(2) to appearing in I_QexpIn each document d, add up its in I_QexpIn each list tf-idf scores, obtain Q_expWith text Degree of correlation R (the Q of shelves d_exp, d), calculate R (Q_exp, d) formula it is as follows：

In formula, λ indicates adjustment parameter, for controlling the weight of query word and expansion word when calculating the degree of correlation.

(3) these documents are ranked up according to the size of the degree of correlation, maximally related N number of document is inquired with former to return.

7. a kind of semantic query expanding unit based on word2vec, it is characterised in that including：

Inquiry and document sets preprocessing module, for being segmented to the inquiry of document sets and user's submission, removing stop words and word The processing such as dry reduction form inquiry Q and document sets D；

Expansion word Candidate Set chooses module, for that will inquire each keyword in Q, using based on word2vec model trainings Term vector calculates and obtains n most like lexical items of each searching keyword, constitutes expansion word Candidate Set C；

Vocabulary constructing module is extended, the similarity for each lexical item in expansion word Candidate Set, calculating it with entirely inquiring, Some higher expansion words of similarity are chosen to construct extension vocabulary T；

Document sets inverted index module, for establishing inverted index to the document sets D after pretreatment；

Query expansion module obtains related for calculating inquiry and the degree of correlation of the document in corresponding inverted index after extending Document.