基于词向量相似度的检索方法和系统Search method and system based on word vector similarity
技术领域Technical field
本发明涉及信息检索技术领域,特别是涉及基于词向量相似度的检索方法和基于词向量相似度的检索系统。The invention relates to the field of information retrieval technology, in particular to a retrieval method based on word vector similarity and a retrieval system based on word vector similarity.
背景技术Background technique
现有的对简历搜索匹配过程的技术,通常是通过多个关键词进行检索。通过用户提供一组关键词在检索库中进行检索,以匹配词命中的数量作为匹配分值,根据匹配分值由高到低的排列输出检索结果,默认排在前的结果更符合用户要求。然而,这种检索方式存在以下缺点:The existing techniques for the resume search matching process are usually searched by multiple keywords. The user provides a set of keywords to search in the search library, and the number of matching word hits is used as the matching score, and the search result is output according to the ranking of the matching scores from high to low, and the default ranked first is more in line with the user requirements. However, this search method has the following disadvantages:
(1)没能考虑到不同检索库的用词特点,例如英文的大小写,字符的全角半角等;(1) failed to take into account the characteristics of the different search terms, such as the capitalization of English, the full-width half-width of characters, etc.;
(2)不能考虑到词与词之间的关系,导致检索过程中,对与关键词存在很强联系的其它词缺乏信息匹配能力;例如关键词设为“程序”,却无法对检索库中“软件”的信息进行检索匹配;(2) The relationship between words and words cannot be considered, resulting in the lack of information matching ability for other words that are strongly related to keywords in the retrieval process; for example, the keyword is set to "program" but cannot be searched in the library. The information of the "software" is searched and matched;
(3)对关键词选取的要求高,检索鲁棒性差;如果关键词遗漏或者输错,对最终检索结果会产生很大影响。(3) The requirements for keyword selection are high, and the retrieval robustness is poor; if the keywords are missing or mistyped, it will have a great impact on the final search results.
综上所述,现有的基于关键词的检索方法,其检索召回率和检索结果准确率都不够理想,同时存在鲁棒性和适应性较差的问题。In summary, the existing keyword-based retrieval method has poor retrieval retrieval rate and retrieval result accuracy, and has problems of poor robustness and adaptability.
发明内容Summary of the invention
基于此,本发明提供一种基于词向量相似度的检索方法和系统,能够提高检索准确率和鲁棒性。Based on this, the present invention provides a retrieval method and system based on word vector similarity, which can improve retrieval accuracy and robustness.
本发明一方面提供一种基于词向量相似度的检索方法,包括:
An aspect of the present invention provides a retrieval method based on word vector similarity, including:
对检索库进行词向量训练,建立所述检索库对应的训练模型;Performing a word vector training on the search library, and establishing a training model corresponding to the search library;
接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;Receiving an input search keyword, and obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;
用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;Searching and matching the search library with the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;
根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。Sorting the files in the search library according to the matching scores from high to low, and outputting the search results according to the sorting result.
优选的,所述对检索库进行词向量训练,之前包括:Preferably, the performing word vector training on the search library comprises:
对检索库中各文件分别进行预处理,将各文件预处理后的数据存储到一对应的训练样本文件中;所述预处理包括数据清洗和提取数据描述;Performing pre-processing on each file in the search library, and storing the pre-processed data of each file into a corresponding training sample file; the pre-processing includes data cleaning and extracting data description;
所述对检索库进行词向量训练包括:The word vector training for the search library includes:
基于所述训练样本文件对所述检索库进行词向量训练。Word vector training is performed on the search library based on the training sample file.
优选的,所述数据清洗包括统一大小写、消除多余空格、统一标点符号、统一全半角格式中至少一种;Preferably, the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;
所述提取数据描述包括通过添加用户词典进行分词。The extracting the data description includes segmentation by adding a user dictionary.
优选的,所述对检索库进行词向量训练包括:Preferably, the performing word vector training on the search library comprises:
通过word2vec对所述训练样本文件进行词向量训练。Word vector training is performed on the training sample file by word2vec.
优选的,用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值,包括:Preferably, the search library is searched and matched by using the related words, and the matching scores of each file in the search library and the related words are respectively counted according to the similarity, including:
用所述相关词分别对所述检索库中各文件进行检索匹配,得到各文件与所述相关词的匹配结果;Searching and matching each file in the search library by using the related words, and obtaining matching results of each file and the related words;
将各相关词对应的相似度作为累加权值,结合所述匹配结果分别得出各文件与所述相关词的匹配分值。
The similarity corresponding to each related word is taken as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
本发明另一方面提供一种基于词向量相似度的检索系统,包括:Another aspect of the present invention provides a retrieval system based on word vector similarity, comprising:
模型训练单元,用于对检索库进行词向量训练,建立所述检索库对应的训练模型;a model training unit, configured to perform word vector training on the search library, and establish a training model corresponding to the search library;
生成相关词单元,用于接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;Generating a related word unit for receiving an input search keyword, and obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;
检索匹配单元,用于用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;Searching for a matching unit for searching and matching the search library with the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;
结果输出单元,用于根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。a result output unit, configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
优选的,所述模型训练单元,还用于对检索库进行词向量训练之前,对检索库中各文件分别进行预处理,将各文件预处理后的数据存储到一对应的训练样本文件中;所述预处理包括数据清洗和提取数据描述;Preferably, the model training unit is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training sample file; The pre-processing includes data cleaning and extracting data descriptions;
所述对检索库进行词向量训练包括:The word vector training for the search library includes:
基于所述训练样本文件对所述检索库进行词向量训练。Word vector training is performed on the search library based on the training sample file.
优选的,所述数据清洗包括统一大小写、消除多余空格、统一标点符号、统一全半角格式中至少一种;Preferably, the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;
所述提取数据描述包括通过添加用户词典进行分词。The extracting the data description includes segmentation by adding a user dictionary.
优选的,所述对检索库进行词向量训练包括:Preferably, the performing word vector training on the search library comprises:
通过word2vec对所述训练样本文件进行词向量训练。Word vector training is performed on the training sample file by word2vec.
优选的,所述检索匹配单元包括:Preferably, the search matching unit comprises:
匹配模块,用于用所述相关词分别对所述检索库中各文件进行检索匹配,得到各文件与所述相关词的匹配结果;a matching module, configured to perform a search and match on each file in the search library by using the related words, to obtain a matching result of each file and the related words;
统计模块,用于将各相关词对应的相似度作为累加权值,结合所述匹配结
果分别得出各文件与所述相关词的匹配分值。a statistic module, configured to use the similarity corresponding to each related word as a cumulative weighting value, and combine the matching knot
The matching scores of each file and the related words are respectively obtained.
上述技术方案的基于词向量相似度的检索方法和系统,通过对检索库进行词向量训练,建立所述检索库对应的训练模型;接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。首先由于训练模型是基于检索库训练得到的,因此能很好反映检索库的用词特点,有利于提高检索准确率;其次将关键词以词向量的形式进行表示,检索时根据关键词的相关词进行检索匹配,增加了对相关词的检索匹配能力,从而提高了检索鲁棒性。The search method and system based on word vector similarity of the above technical solution establishes a training model corresponding to the search library by performing word vector training on the search library; receiving an input search keyword, and obtaining the search through the training model a related word of the keyword, and a similarity between each related word and the search keyword; searching and matching the search library with the related word, and separately counting each file in the search library according to the similarity a matching score of the related word; sorting the files in the search library according to the matching score from high to low, and outputting the search result according to the sorting result. Firstly, because the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy. Secondly, the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.
附图说明DRAWINGS
图1为本发明实施例的基于词向量相似度的检索方法的示意性流程图;1 is a schematic flowchart of a method for retrieving a word vector similarity according to an embodiment of the present invention;
图2为本发明实施例的基于词向量相似度的检索系统的示意性结构图。2 is a schematic structural diagram of a word vector similarity-based retrieval system according to an embodiment of the present invention.
具体实施方式detailed description
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
本发明提供的实施例包括基于词向量相似度的检索方法实施例,还包括相应的基于词向量相似度的检索系统实施例。以下分别进行详细说明。Embodiments provided by the present invention include a retrieval method embodiment based on word vector similarity, and a corresponding retrieval system embodiment based on word vector similarity. The details are described below separately.
图1为本发明实施例的基于词向量相似度的检索方法的示意性流程图;如
图1所示,本实施例的基于词向量相似度的检索方法包括如下步骤S1至S4,各步骤详述如下:1 is a schematic flowchart of a method for retrieving a similarity based on word vectors according to an embodiment of the present invention;
As shown in FIG. 1, the word vector similarity-based retrieval method of the present embodiment includes the following steps S1 to S4, and the steps are detailed as follows:
S1,对检索库进行词向量训练,建立所述检索库对应的训练模型;S1, performing word vector training on the search library, and establishing a training model corresponding to the search library;
自然语言理解的问题要转化为机器学习的问题,第一步需要找一种方法把这些符号数学化,例如把每个词都表示为一个特有的向量。词向量是“Word Representation”或“Word Embedding”的中文俗称。The problem of natural language understanding translates into machine learning problems. The first step is to find a way to mathematicalize these symbols, such as expressing each word as a unique vector. The word vector is a common Chinese name for "Word Representation" or "Word Embedding".
本实施例中的词向量应当具有的特点包括:让相关或者相似的词,在距离上更接近,例如“麦克”和“话筒”的距离会远小于“麦克”和“天气”的距离。向量的距离可以用传统的欧氏距离来衡量,也可以用cos夹角来衡量。The word vector in this embodiment should have the following features: Let related or similar words be closer in distance, for example, the distance between "Mike" and "Microphone" will be much smaller than the distance between "Mike" and "Weather". The distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.
优选的,所述词向量可为用Distributed Representation表示的词向量。Distributed Representation表示的词向量为一种低维实数向量,这种向量一般形式为[0.792,-0.177,-0.107,0.109,-0.542,…],维度以50维和100维比较常见。Preferably, the word vector may be a word vector represented by a Distributed Representation. The word vector represented by Distributed Representation is a low-dimensional real number vector. The general form of this vector is [0.792, -0.177, -0.17, 0.109, -0.542,...], and the dimensions are more common in 50-dimensional and 100-dimensional.
作为一优选实施方式,在对检索库进行词向量训练之前,还可对检索库中各文件分别进行预处理,将各文件预处理后的数据存储到一对应的训练样本文件中。As a preferred embodiment, before the word vector training is performed on the search library, each file in the search library may be separately preprocessed, and the preprocessed data of each file is stored in a corresponding training sample file.
优选的,其中所述预处理包括数据清洗和提取数据描述。其中数据清洗主要用于实现检索库中数据的一致性,具体可包括统一大小写、消除多余空格、统一标点符号、统一全半角格式中至少一种;所述提取数据描述包括通过添加用户词典进行分词,具体可为添加用户词典并通过NLPIR(又名ICTCLAS2013,汉语分词系统)进行分词。Preferably, wherein said pre-processing comprises data cleaning and extracting data descriptions. The data cleaning is mainly used to implement the consistency of the data in the search library, and may specifically include at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format; the extracting data description includes adding a user dictionary. The word segmentation can be specifically added to the user dictionary and segmented by NLPIR (also known as ICTCLAS2013, Chinese word segmentation system).
进一步的,基于所述训练样本文件对所述检索库进行词向量训练,以建立所述检索库对应的训练模型。具体方式可为:通过word2vec对所述训练样本
文件进行词向量训练,训练设置如下:Further, word vector training is performed on the search library based on the training sample file to establish a training model corresponding to the search library. The specific manner may be: using the word2vec to the training sample
The file is trained in word vector, and the training settings are as follows:
./word2vec-train result_cropus.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1e-3-threads 4-binary 1-min_count 3;./word2vec-train result_cropus.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1e-3-threads 4-binary 1-min_count 3;
其中,各参数的含义为:Among them, the meaning of each parameter is:
-train后面表示参与训练的训练样本文件名,-train indicates the name of the training sample file to participate in the training.
-cbow表示采用跳空词袋模型,-cbow means using the gap word bag model,
-size表示词向量采用的维度,-size represents the dimension used by the word vector,
-window表示上下文窗口长度,-window indicates the length of the context window,
-negative表示是否采用负采样方法,0表示不使用,1表示使用,-negative indicates whether to use the negative sampling method, 0 means not used, 1 means use,
-hs表示是否使用HS方法,0表示不使用,1表示使用,-hs indicates whether to use the HS method, 0 means not used, 1 means use,
-sample le-3表示采样的阈值为10-3,如果一个词在训练样本中出现的频率越大,那么就越会被采样;-sample le-3 indicates that the threshold of the sample is 10 -3 . If the frequency of a word appears in the training sample is larger, the more it will be sampled;
-thread表示开启线程数,-thread indicates the number of open threads,
-binary表示输出是否为二进制文件,0表示不使用,1表示使用,-binary indicates whether the output is a binary file, 0 means not used, 1 means use,
-min_count表示设置的最低频率,默认为5,如果一个词语在文档中出现的次数小于该阈值,那么该词就会被舍弃。-min_count indicates the lowest frequency set. The default is 5. If a word appears in the document less than the threshold, the word will be discarded.
至此,可得到与所述检索库对应的训练模型。So far, a training model corresponding to the search library can be obtained.
S2,接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;S2: receiving an input search keyword, obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;
本实施例中,两个词向量的相似度指的是余弦相似度,最高可为1,最低可为0。由于训练模型是基于检索库训练得到的,因此基于该训练模型得到的相关词能很好反映检索库的用词特点。具体可通过./distance vectors.bin命令产生相关词及相似度,并通过sh脚本和expect脚本自动生成。
In this embodiment, the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library. Specifically, the related words and similarities can be generated by the ./distance vectors.bin command, and automatically generated by the sh script and the expect script.
例如:需要在简历库中检索C++软件开发工程师,输入关键词为C++、软件、MFC、数据结构,基于该简历库的训练模型可以得到以下相关词词表和相似度,详见下表:For example, you need to search the C++ software development engineer in the resume database. The input keywords are C++, software, MFC, and data structure. Based on the training model of the resume database, you can get the following related words and vocabulary and similarity. See the following table for details:
S3,用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;S3, performing search matching on the search library by using the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;
本实施例中,用上述步骤得出的相关词分别对所述检索库中各文件进行检索匹配,得到各文件与所述相关词的匹配结果;将各相关词对应的相似度作为累加权值,结合所述匹配结果分别可得出各文件与所述相关词的匹配分值。In this embodiment, the related words obtained by the above steps are respectively used to search and match each file in the search library, and the matching result of each file and the related words is obtained; and the similarity corresponding to each related word is used as the cumulative weighting value. And matching the matching result respectively to obtain a matching score of each file and the related word.
S4,根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。S4. Sort the files in the search library according to the matching score from high to low, and output the search result according to the sorting result.
优选的,可设定分值门限,仅对匹配分值高于所述分值门限的检索结果进行排序,并按照匹配分值由高到低的排序输出。通过设定分值门限对检索结果进一步筛选,有利于用户查阅检索结果。Preferably, the score threshold can be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorting values of the matching scores are outputted from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.
通过上述实施例的基于词向量相似度的检索方法,通过对检索库进行词向
量训练,建立所述检索库对应的训练模型;接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。首先由于训练模型是基于检索库训练得到的,因此能很好反映检索库的用词特点,有利于提高检索准确率;其次将关键词以词向量的形式进行表示,检索时根据关键词的相关词进行检索匹配,增加了对相关词的检索匹配能力,从而提高了检索鲁棒性。Through the word vector similarity-based retrieval method of the above embodiment, by performing a word orientation on the search library
Training, establishing a training model corresponding to the search library; receiving an input search keyword, obtaining, by the training model, related words of the search keyword, and similarity between each related word and the search keyword; The related words perform search matching on the search library, and respectively compare matching scores of each file in the search library with the related words according to the similarity; according to the matching scores from high to low The files in the search library are sorted, and the search results are output according to the sort result. Firstly, because the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy. Secondly, the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.
需要说明的是,对于前述的方法实施例,为了简便描述,将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其它顺序或者同时进行。It should be noted that, for the foregoing method embodiments, for the sake of brevity, they are all described as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described action sequence, because In the present invention, certain steps may be performed in other orders or simultaneously.
以下对可用于执行上述基于词向量相似度的检索方法的基于词向量相似度的检索系统实施例进行说明。为了便于说明,基于词向量相似度的检索系统实施例的结构示意图中,仅仅示出了与本发明实施例相关的部分,本领域技术人员可以理解,图中示出的系统结构并不构成对系统的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。An example of a retrieval system based on word vector similarity that can be used to perform the above-described word vector similarity-based retrieval method will be described below. For ease of explanation, in the structural schematic diagram of the retrieval system embodiment based on the word vector similarity, only the parts related to the embodiment of the present invention are shown, and those skilled in the art can understand that the system structure shown in the figure does not constitute a pair. The definition of the system may include more or fewer components than those illustrated, or some components may be combined, or different component arrangements.
图2为本发明实施例的基于词向量相似度的检索系统的示意性结构图;如图2所示,本实施例的基于词向量相似度的检索系统包括:模型训练单元210、生成相关词单元220、检索匹配单元230以及结果输出单元240,各单元详述如下:2 is a schematic structural diagram of a word vector similarity-based retrieval system according to an embodiment of the present invention; as shown in FIG. 2, the word vector similarity-based retrieval system of the present embodiment includes: a model training unit 210, and generates related words. The unit 220, the search matching unit 230, and the result output unit 240 are detailed as follows:
所述模型训练单元210,用于对检索库进行词向量训练,建立所述检索库对应的训练模型;The model training unit 210 is configured to perform word vector training on the search library, and establish a training model corresponding to the search library;
本实施例中的词向量应当具有的特点包括:让相关或者相似的词,在距离
上更接近,例如“麦克”和“话筒”的距离会远小于“麦克”和“天气”的距离。向量的距离可以用传统的欧氏距离来衡量,也可以用cos夹角来衡量。The word vector in this embodiment should have the following features: let relevant or similar words, at a distance
The distance is closer, for example, the distance between "Mike" and "Microphone" will be much smaller than the distance between "Mike" and "Weather". The distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.
优选的,所述词向量可为用Distributed Representation表示的词向量。Distributed Representation表示的词向量为一种低维实数向量,这种向量一般形式为[0.792,-0.177,-0.107,0.109,-0.542,…],维度以50维和100维比较常见。Preferably, the word vector may be a word vector represented by a Distributed Representation. The word vector represented by Distributed Representation is a low-dimensional real number vector. The general form of this vector is [0.792, -0.177, -0.17, 0.109, -0.542,...], and the dimensions are more common in 50-dimensional and 100-dimensional.
作为一优选实施方式,所述模型训练单元210,还用于对检索库进行词向量训练之前,对检索库中各文件分别进行预处理,将各文件预处理后的数据存储到一对应的训练样本文件中,以基于所述训练样本文件对所述检索库进行词向量训练。其中,所述预处理包括数据清洗和提取数据描述。所述数据清洗包括统一大小写、消除多余空格、统一标点符号、统一全半角格式中至少一种;所述提取数据描述包括通过添加用户词典进行分词,具体方式可为添加用户词典并通过NLPIR(又名ICTCLAS2013,汉语分词系统)进行分词。As a preferred embodiment, the model training unit 210 is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training. In the sample file, word vector training is performed on the search library based on the training sample file. Wherein, the pre-processing includes data cleaning and extracting data description. The data cleaning includes at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format; the extracting data description includes segmentation by adding a user dictionary, and the specific manner may be adding a user dictionary and passing NLPIR ( Also known as ICTCLAS2013, Chinese word segmentation system) for word segmentation.
优选的,可通过word2vec对所述训练样本文件进行词向量训练,训练设置如下:Preferably, the training sample file can be trained by word vector by word2vec, and the training settings are as follows:
./word2vec-train result_cropus.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1e-3-threads 4-binary 1-min_count 3;./word2vec-train result_cropus.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1e-3-threads 4-binary 1-min_count 3;
其中,各参数的含义为:Among them, the meaning of each parameter is:
-train后面表示参与训练的训练样本文件名,-train indicates the name of the training sample file to participate in the training.
-cbow表示采用跳空词袋模型,-cbow means using the gap word bag model,
-size表示词向量采用的维度,-size represents the dimension used by the word vector,
-window表示上下文窗口长度,-window indicates the length of the context window,
-negative表示是否采用负采样方法,0表示不使用,1表示使用,
-negative indicates whether to use the negative sampling method, 0 means not used, 1 means use,
-hs表示是否使用HS方法,0表示不使用,1表示使用,-hs indicates whether to use the HS method, 0 means not used, 1 means use,
-sample le-3表示采样的阈值为10-3,-sample le-3 indicates that the sampling threshold is 10 -3 .
-thread表示开启线程数,-thread indicates the number of open threads,
-binary表示输出是否为二进制文件,0表示不使用,1表示使用,-binary indicates whether the output is a binary file, 0 means not used, 1 means use,
-min_count表示设置的最低频率,默认为5。-min_count indicates the lowest frequency set, the default is 5.
进一步的,所述生成相关词单元220,用于接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;Further, the generating related word unit 220 is configured to receive the input search keyword, and obtain the related words of the search keyword and the similarity between each related word and the search keyword by using the training model;
本实施例中,两个词向量的相似度指的是余弦相似度,最高可为1,最低可为0。由于训练模型是基于检索库训练得到的,因此基于该训练模型得到的相关词能很好反映检索库的用词特点。In this embodiment, the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library.
所述检索匹配单元230,用于用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;The search matching unit 230 is configured to perform search matching on the search library by using the related words, and separately calculate matching scores of each file in the search library and the related words according to the similarity;
优选的,所述检索匹配单元230可具体包括:匹配模块,用于用所述相关词分别对所述检索库中各文件进行检索匹配,得到各文件与所述相关词的匹配结果;统计模块,用于将各相关词对应的相似度作为累加权值,结合所述匹配结果分别得出各文件与所述相关词的匹配分值。Preferably, the search matching unit 230 may specifically include: a matching module, configured to perform search and match on each file in the search library by using the related words, and obtain matching results of each file and the related words; And the similarity corresponding to each related word is used as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
所述结果输出单元240,用于根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。The result output unit 240 is configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
本实施例中,还可设定一个分值门限,仅对匹配分值高于所述分值门限的检索结果进行排序,并按照匹配分值由高到低的排序输出。通过设定分值门限对检索结果进一步筛选,有利于用户查阅检索结果。In this embodiment, a score threshold may also be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorted values are sorted according to the rank of the matching scores from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.
需要说明的是,上述示例的基于词向量相似度的检索系统的实施方式中,
各模块/单元之间的信息交互、执行过程等内容,由于与本发明前述方法实施例基于同一构思,其带来的技术效果与本发明前述方法实施例相同,具体内容可参见本发明方法实施例中的叙述,此处不再赘述。It should be noted that, in the implementation manner of the word vector similarity-based retrieval system of the above example,
The information interaction between the modules/units, the execution process, and the like are based on the same concept as the foregoing method embodiments of the present invention, and the technical effects thereof are the same as the foregoing method embodiments of the present invention. For details, refer to the method implementation of the present invention. The description in the example will not be repeated here.
此外,上述示例的基于词向量相似度的检索系统的实施方式中,各功能模块的逻辑划分仅是举例说明,实际应用中可以根据需要,例如出于相应硬件的配置要求或者软件的实现的便利考虑,将上述功能分配由不同的功能模块完成,即将所述基于词向量相似度的检索系统的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。In addition, in the implementation manner of the word vector similarity-based retrieval system of the above example, the logical division of each functional module is merely an example, and the actual application may be according to requirements, for example, the configuration requirements of the corresponding hardware or the convenience of the implementation of the software. It is considered that the above-mentioned function allocation is completed by different functional modules, that is, the internal structure of the word vector similarity-based retrieval system is divided into different functional modules to complete all or part of the functions described above.
另外,上述示例的基于词向量相似度的检索系统的实施方式中,各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。In addition, in the implementation manner of the word vector similarity-based retrieval system of the above example, each functional module may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one. In the module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。本领域普通技术人员可以理解本发明的任意实施例指定的方法的全部或部分步骤是可以通过程序来指令相关的硬件(个人计算机、服务器、或者网络设备等)来完成。该程序可以存储于一计算机可读存储介质中。该程序在执行时,可执行上述任意实施例指定的方法的全部或部分步骤。前述存储介质可以包括任何可以存储程序代码的介质,例如只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。The integrated modules, if implemented in the form of software functional modules and sold or used as separate products, may be stored in a computer readable storage medium. One of ordinary skill in the art will appreciate that all or part of the steps of the method specified by any embodiment of the present invention can be accomplished by a program to instruct related hardware (personal computer, server, or network device, etc.). The program can be stored in a computer readable storage medium. The program, when executed, may perform all or part of the steps of the method specified in any of the above embodiments. The foregoing storage medium may include any medium that can store program code, such as a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。In the above embodiments, the descriptions of the various embodiments are all focused, and the parts that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.
以上所述实施例仅表达了本发明的几种实施方式,不能理解为对本发明专
利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。
The above described embodiments only express several embodiments of the present invention, and are not to be construed as exclusive to the present invention.
The scope of the benefit range. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of the invention should be determined by the appended claims.