WO2017107566A1 - Retrieval method and system based on word vector similarity - Google Patents

Retrieval method and system based on word vector similarity Download PDF

Info

Publication number
WO2017107566A1
WO2017107566A1 PCT/CN2016/098234 CN2016098234W WO2017107566A1 WO 2017107566 A1 WO2017107566 A1 WO 2017107566A1 CN 2016098234 W CN2016098234 W CN 2016098234W WO 2017107566 A1 WO2017107566 A1 WO 2017107566A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
word vector
training
matching
file
Prior art date
Application number
PCT/CN2016/098234
Other languages
French (fr)
Chinese (zh)
Inventor
李贤�
Original Assignee
广州视源电子科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州视源电子科技股份有限公司 filed Critical 广州视源电子科技股份有限公司
Publication of WO2017107566A1 publication Critical patent/WO2017107566A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes

Definitions

  • the invention relates to the field of information retrieval technology, in particular to a retrieval method based on word vector similarity and a retrieval system based on word vector similarity.
  • the existing techniques for the resume search matching process are usually searched by multiple keywords.
  • the user provides a set of keywords to search in the search library, and the number of matching word hits is used as the matching score, and the search result is output according to the ranking of the matching scores from high to low, and the default ranked first is more in line with the user requirements.
  • this search method has the following disadvantages:
  • the existing keyword-based retrieval method has poor retrieval retrieval rate and retrieval result accuracy, and has problems of poor robustness and adaptability.
  • the present invention provides a retrieval method and system based on word vector similarity, which can improve retrieval accuracy and robustness.
  • An aspect of the present invention provides a retrieval method based on word vector similarity, including:
  • the performing word vector training on the search library comprises:
  • the pre-processing includes data cleaning and extracting data description
  • the word vector training for the search library includes:
  • Word vector training is performed on the search library based on the training sample file.
  • the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;
  • the extracting the data description includes segmentation by adding a user dictionary.
  • the performing word vector training on the search library comprises:
  • Word vector training is performed on the training sample file by word2vec.
  • the search library is searched and matched by using the related words, and the matching scores of each file in the search library and the related words are respectively counted according to the similarity, including:
  • the similarity corresponding to each related word is taken as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
  • Another aspect of the present invention provides a retrieval system based on word vector similarity, comprising:
  • a model training unit configured to perform word vector training on the search library, and establish a training model corresponding to the search library
  • a result output unit configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
  • the model training unit is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training sample file;
  • the pre-processing includes data cleaning and extracting data descriptions;
  • the word vector training for the search library includes:
  • Word vector training is performed on the search library based on the training sample file.
  • the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;
  • the extracting the data description includes segmentation by adding a user dictionary.
  • the performing word vector training on the search library comprises:
  • Word vector training is performed on the training sample file by word2vec.
  • the search matching unit comprises:
  • a matching module configured to perform a search and match on each file in the search library by using the related words, to obtain a matching result of each file and the related words
  • a statistic module configured to use the similarity corresponding to each related word as a cumulative weighting value, and combine the matching knot The matching scores of each file and the related words are respectively obtained.
  • the search method and system based on word vector similarity of the above technical solution establishes a training model corresponding to the search library by performing word vector training on the search library; receiving an input search keyword, and obtaining the search through the training model a related word of the keyword, and a similarity between each related word and the search keyword; searching and matching the search library with the related word, and separately counting each file in the search library according to the similarity a matching score of the related word; sorting the files in the search library according to the matching score from high to low, and outputting the search result according to the sorting result.
  • the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy.
  • the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.
  • FIG. 1 is a schematic flowchart of a method for retrieving a word vector similarity according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a word vector similarity-based retrieval system according to an embodiment of the present invention.
  • Embodiments provided by the present invention include a retrieval method embodiment based on word vector similarity, and a corresponding retrieval system embodiment based on word vector similarity. The details are described below separately.
  • FIG. 1 is a schematic flowchart of a method for retrieving a similarity based on word vectors according to an embodiment of the present invention.
  • the word vector similarity-based retrieval method of the present embodiment includes the following steps S1 to S4, and the steps are detailed as follows:
  • the problem of natural language understanding translates into machine learning problems.
  • the first step is to find a way to mathematicalize these symbols, such as expressing each word as a unique vector.
  • the word vector is a common Chinese name for "Word Representation” or "Word Embedding”.
  • the word vector in this embodiment should have the following features: Let related or similar words be closer in distance, for example, the distance between "Mike” and “Microphone” will be much smaller than the distance between "Mike” and "Weather”.
  • the distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.
  • the word vector may be a word vector represented by a Distributed Representation.
  • the word vector represented by Distributed Representation is a low-dimensional real number vector.
  • the general form of this vector is [0.792, -0.177, -0.17, 0.109, -0.542,...], and the dimensions are more common in 50-dimensional and 100-dimensional.
  • each file in the search library may be separately preprocessed, and the preprocessed data of each file is stored in a corresponding training sample file.
  • said pre-processing comprises data cleaning and extracting data descriptions.
  • the data cleaning is mainly used to implement the consistency of the data in the search library, and may specifically include at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format;
  • the extracting data description includes adding a user dictionary.
  • the word segmentation can be specifically added to the user dictionary and segmented by NLPIR (also known as ICTCLAS2013, Chinese word segmentation system).
  • word vector training is performed on the search library based on the training sample file to establish a training model corresponding to the search library.
  • the specific manner may be: using the word2vec to the training sample
  • the file is trained in word vector, and the training settings are as follows:
  • -negative indicates whether to use the negative sampling method, 0 means not used, 1 means use,
  • -hs indicates whether to use the HS method, 0 means not used, 1 means use,
  • sample le-3 indicates that the threshold of the sample is 10 -3 . If the frequency of a word appears in the training sample is larger, the more it will be sampled;
  • -binary indicates whether the output is a binary file, 0 means not used, 1 means use,
  • -min_count indicates the lowest frequency set. The default is 5. If a word appears in the document less than the threshold, the word will be discarded.
  • S2 receiving an input search keyword, obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;
  • the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library. Specifically, the related words and similarities can be generated by the ./distance vectors.bin command, and automatically generated by the sh script and the expect script.
  • the related words obtained by the above steps are respectively used to search and match each file in the search library, and the matching result of each file and the related words is obtained; and the similarity corresponding to each related word is used as the cumulative weighting value. And matching the matching result respectively to obtain a matching score of each file and the related word.
  • the score threshold can be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorting values of the matching scores are outputted from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.
  • the word vector similarity-based retrieval method of the above embodiment by performing a word orientation on the search library Training, establishing a training model corresponding to the search library; receiving an input search keyword, obtaining, by the training model, related words of the search keyword, and similarity between each related word and the search keyword; The related words perform search matching on the search library, and respectively compare matching scores of each file in the search library with the related words according to the similarity; according to the matching scores from high to low
  • the files in the search library are sorted, and the search results are output according to the sort result.
  • the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy.
  • the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.
  • the word vector similarity-based retrieval system of the present embodiment includes: a model training unit 210, and generates related words.
  • the unit 220, the search matching unit 230, and the result output unit 240 are detailed as follows:
  • the model training unit 210 is configured to perform word vector training on the search library, and establish a training model corresponding to the search library;
  • the word vector in this embodiment should have the following features: let relevant or similar words, at a distance The distance is closer, for example, the distance between "Mike” and “Microphone” will be much smaller than the distance between "Mike” and "Weather".
  • the distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.
  • the word vector may be a word vector represented by a Distributed Representation.
  • the word vector represented by Distributed Representation is a low-dimensional real number vector.
  • the general form of this vector is [0.792, -0.177, -0.17, 0.109, -0.542,...], and the dimensions are more common in 50-dimensional and 100-dimensional.
  • the model training unit 210 is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training.
  • word vector training is performed on the search library based on the training sample file.
  • the pre-processing includes data cleaning and extracting data description.
  • the data cleaning includes at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format;
  • the extracting data description includes segmentation by adding a user dictionary, and the specific manner may be adding a user dictionary and passing NLPIR ( Also known as ICTCLAS2013, Chinese word segmentation system) for word segmentation.
  • the training sample file can be trained by word vector by word2vec, and the training settings are as follows:
  • -negative indicates whether to use the negative sampling method, 0 means not used, 1 means use,
  • -hs indicates whether to use the HS method, 0 means not used, 1 means use,
  • sample le-3 indicates that the sampling threshold is 10 -3 .
  • -binary indicates whether the output is a binary file, 0 means not used, 1 means use,
  • -min_count indicates the lowest frequency set, the default is 5.
  • the generating related word unit 220 is configured to receive the input search keyword, and obtain the related words of the search keyword and the similarity between each related word and the search keyword by using the training model;
  • the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library.
  • the search matching unit 230 is configured to perform search matching on the search library by using the related words, and separately calculate matching scores of each file in the search library and the related words according to the similarity;
  • the search matching unit 230 may specifically include: a matching module, configured to perform search and match on each file in the search library by using the related words, and obtain matching results of each file and the related words; And the similarity corresponding to each related word is used as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
  • a matching module configured to perform search and match on each file in the search library by using the related words, and obtain matching results of each file and the related words.
  • the similarity corresponding to each related word is used as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
  • the result output unit 240 is configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
  • a score threshold may also be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorted values are sorted according to the rank of the matching scores from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.
  • each functional module is merely an example, and the actual application may be according to requirements, for example, the configuration requirements of the corresponding hardware or the convenience of the implementation of the software. It is considered that the above-mentioned function allocation is completed by different functional modules, that is, the internal structure of the word vector similarity-based retrieval system is divided into different functional modules to complete all or part of the functions described above.
  • each functional module may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules if implemented in the form of software functional modules and sold or used as separate products, may be stored in a computer readable storage medium.
  • a program to instruct related hardware personal computer, server, or network device, etc.
  • the program can be stored in a computer readable storage medium.
  • the program when executed, may perform all or part of the steps of the method specified in any of the above embodiments.
  • the foregoing storage medium may include any medium that can store program code, such as a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

Abstract

A retrieval method and system based on a word vector similarity. The method comprises: performing word vector training on a retrieval library, and establishing a training model corresponding to the retrieval library (S1); receiving an input retrieval keyword, and obtaining related words of the retrieval keyword and the similarity between each of the related words and the retrieval keyword by means of the training model (S2); retrieving and matching the retrieval library using the related words, and respectively counting scores of matching between various files in the retrieval library and the related words according to the similarity (S3); and sorting the files in the retrieval library according to the matching scores from high to low, and outputting a retrieval result according to the sorting result (S4). By means of the method, the capabilities of retrieving and matching related words can be enhanced in combination with the lexical characteristics in various retrieval libraries, thereby improving the accuracy rate and the robustness of retrieval.

Description

基于词向量相似度的检索方法和系统Search method and system based on word vector similarity 技术领域Technical field
本发明涉及信息检索技术领域,特别是涉及基于词向量相似度的检索方法和基于词向量相似度的检索系统。The invention relates to the field of information retrieval technology, in particular to a retrieval method based on word vector similarity and a retrieval system based on word vector similarity.
背景技术Background technique
现有的对简历搜索匹配过程的技术,通常是通过多个关键词进行检索。通过用户提供一组关键词在检索库中进行检索,以匹配词命中的数量作为匹配分值,根据匹配分值由高到低的排列输出检索结果,默认排在前的结果更符合用户要求。然而,这种检索方式存在以下缺点:The existing techniques for the resume search matching process are usually searched by multiple keywords. The user provides a set of keywords to search in the search library, and the number of matching word hits is used as the matching score, and the search result is output according to the ranking of the matching scores from high to low, and the default ranked first is more in line with the user requirements. However, this search method has the following disadvantages:
(1)没能考虑到不同检索库的用词特点,例如英文的大小写,字符的全角半角等;(1) failed to take into account the characteristics of the different search terms, such as the capitalization of English, the full-width half-width of characters, etc.;
(2)不能考虑到词与词之间的关系,导致检索过程中,对与关键词存在很强联系的其它词缺乏信息匹配能力;例如关键词设为“程序”,却无法对检索库中“软件”的信息进行检索匹配;(2) The relationship between words and words cannot be considered, resulting in the lack of information matching ability for other words that are strongly related to keywords in the retrieval process; for example, the keyword is set to "program" but cannot be searched in the library. The information of the "software" is searched and matched;
(3)对关键词选取的要求高,检索鲁棒性差;如果关键词遗漏或者输错,对最终检索结果会产生很大影响。(3) The requirements for keyword selection are high, and the retrieval robustness is poor; if the keywords are missing or mistyped, it will have a great impact on the final search results.
综上所述,现有的基于关键词的检索方法,其检索召回率和检索结果准确率都不够理想,同时存在鲁棒性和适应性较差的问题。In summary, the existing keyword-based retrieval method has poor retrieval retrieval rate and retrieval result accuracy, and has problems of poor robustness and adaptability.
发明内容Summary of the invention
基于此,本发明提供一种基于词向量相似度的检索方法和系统,能够提高检索准确率和鲁棒性。Based on this, the present invention provides a retrieval method and system based on word vector similarity, which can improve retrieval accuracy and robustness.
本发明一方面提供一种基于词向量相似度的检索方法,包括: An aspect of the present invention provides a retrieval method based on word vector similarity, including:
对检索库进行词向量训练,建立所述检索库对应的训练模型;Performing a word vector training on the search library, and establishing a training model corresponding to the search library;
接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;Receiving an input search keyword, and obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;
用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;Searching and matching the search library with the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;
根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。Sorting the files in the search library according to the matching scores from high to low, and outputting the search results according to the sorting result.
优选的,所述对检索库进行词向量训练,之前包括:Preferably, the performing word vector training on the search library comprises:
对检索库中各文件分别进行预处理,将各文件预处理后的数据存储到一对应的训练样本文件中;所述预处理包括数据清洗和提取数据描述;Performing pre-processing on each file in the search library, and storing the pre-processed data of each file into a corresponding training sample file; the pre-processing includes data cleaning and extracting data description;
所述对检索库进行词向量训练包括:The word vector training for the search library includes:
基于所述训练样本文件对所述检索库进行词向量训练。Word vector training is performed on the search library based on the training sample file.
优选的,所述数据清洗包括统一大小写、消除多余空格、统一标点符号、统一全半角格式中至少一种;Preferably, the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;
所述提取数据描述包括通过添加用户词典进行分词。The extracting the data description includes segmentation by adding a user dictionary.
优选的,所述对检索库进行词向量训练包括:Preferably, the performing word vector training on the search library comprises:
通过word2vec对所述训练样本文件进行词向量训练。Word vector training is performed on the training sample file by word2vec.
优选的,用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值,包括:Preferably, the search library is searched and matched by using the related words, and the matching scores of each file in the search library and the related words are respectively counted according to the similarity, including:
用所述相关词分别对所述检索库中各文件进行检索匹配,得到各文件与所述相关词的匹配结果;Searching and matching each file in the search library by using the related words, and obtaining matching results of each file and the related words;
将各相关词对应的相似度作为累加权值,结合所述匹配结果分别得出各文件与所述相关词的匹配分值。 The similarity corresponding to each related word is taken as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
本发明另一方面提供一种基于词向量相似度的检索系统,包括:Another aspect of the present invention provides a retrieval system based on word vector similarity, comprising:
模型训练单元,用于对检索库进行词向量训练,建立所述检索库对应的训练模型;a model training unit, configured to perform word vector training on the search library, and establish a training model corresponding to the search library;
生成相关词单元,用于接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;Generating a related word unit for receiving an input search keyword, and obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;
检索匹配单元,用于用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;Searching for a matching unit for searching and matching the search library with the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;
结果输出单元,用于根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。a result output unit, configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
优选的,所述模型训练单元,还用于对检索库进行词向量训练之前,对检索库中各文件分别进行预处理,将各文件预处理后的数据存储到一对应的训练样本文件中;所述预处理包括数据清洗和提取数据描述;Preferably, the model training unit is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training sample file; The pre-processing includes data cleaning and extracting data descriptions;
所述对检索库进行词向量训练包括:The word vector training for the search library includes:
基于所述训练样本文件对所述检索库进行词向量训练。Word vector training is performed on the search library based on the training sample file.
优选的,所述数据清洗包括统一大小写、消除多余空格、统一标点符号、统一全半角格式中至少一种;Preferably, the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;
所述提取数据描述包括通过添加用户词典进行分词。The extracting the data description includes segmentation by adding a user dictionary.
优选的,所述对检索库进行词向量训练包括:Preferably, the performing word vector training on the search library comprises:
通过word2vec对所述训练样本文件进行词向量训练。Word vector training is performed on the training sample file by word2vec.
优选的,所述检索匹配单元包括:Preferably, the search matching unit comprises:
匹配模块,用于用所述相关词分别对所述检索库中各文件进行检索匹配,得到各文件与所述相关词的匹配结果;a matching module, configured to perform a search and match on each file in the search library by using the related words, to obtain a matching result of each file and the related words;
统计模块,用于将各相关词对应的相似度作为累加权值,结合所述匹配结 果分别得出各文件与所述相关词的匹配分值。a statistic module, configured to use the similarity corresponding to each related word as a cumulative weighting value, and combine the matching knot The matching scores of each file and the related words are respectively obtained.
上述技术方案的基于词向量相似度的检索方法和系统,通过对检索库进行词向量训练,建立所述检索库对应的训练模型;接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。首先由于训练模型是基于检索库训练得到的,因此能很好反映检索库的用词特点,有利于提高检索准确率;其次将关键词以词向量的形式进行表示,检索时根据关键词的相关词进行检索匹配,增加了对相关词的检索匹配能力,从而提高了检索鲁棒性。The search method and system based on word vector similarity of the above technical solution establishes a training model corresponding to the search library by performing word vector training on the search library; receiving an input search keyword, and obtaining the search through the training model a related word of the keyword, and a similarity between each related word and the search keyword; searching and matching the search library with the related word, and separately counting each file in the search library according to the similarity a matching score of the related word; sorting the files in the search library according to the matching score from high to low, and outputting the search result according to the sorting result. Firstly, because the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy. Secondly, the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.
附图说明DRAWINGS
图1为本发明实施例的基于词向量相似度的检索方法的示意性流程图;1 is a schematic flowchart of a method for retrieving a word vector similarity according to an embodiment of the present invention;
图2为本发明实施例的基于词向量相似度的检索系统的示意性结构图。2 is a schematic structural diagram of a word vector similarity-based retrieval system according to an embodiment of the present invention.
具体实施方式detailed description
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
本发明提供的实施例包括基于词向量相似度的检索方法实施例,还包括相应的基于词向量相似度的检索系统实施例。以下分别进行详细说明。Embodiments provided by the present invention include a retrieval method embodiment based on word vector similarity, and a corresponding retrieval system embodiment based on word vector similarity. The details are described below separately.
图1为本发明实施例的基于词向量相似度的检索方法的示意性流程图;如 图1所示,本实施例的基于词向量相似度的检索方法包括如下步骤S1至S4,各步骤详述如下:1 is a schematic flowchart of a method for retrieving a similarity based on word vectors according to an embodiment of the present invention; As shown in FIG. 1, the word vector similarity-based retrieval method of the present embodiment includes the following steps S1 to S4, and the steps are detailed as follows:
S1,对检索库进行词向量训练,建立所述检索库对应的训练模型;S1, performing word vector training on the search library, and establishing a training model corresponding to the search library;
自然语言理解的问题要转化为机器学习的问题,第一步需要找一种方法把这些符号数学化,例如把每个词都表示为一个特有的向量。词向量是“Word Representation”或“Word Embedding”的中文俗称。The problem of natural language understanding translates into machine learning problems. The first step is to find a way to mathematicalize these symbols, such as expressing each word as a unique vector. The word vector is a common Chinese name for "Word Representation" or "Word Embedding".
本实施例中的词向量应当具有的特点包括:让相关或者相似的词,在距离上更接近,例如“麦克”和“话筒”的距离会远小于“麦克”和“天气”的距离。向量的距离可以用传统的欧氏距离来衡量,也可以用cos夹角来衡量。The word vector in this embodiment should have the following features: Let related or similar words be closer in distance, for example, the distance between "Mike" and "Microphone" will be much smaller than the distance between "Mike" and "Weather". The distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.
优选的,所述词向量可为用Distributed Representation表示的词向量。Distributed Representation表示的词向量为一种低维实数向量,这种向量一般形式为[0.792,-0.177,-0.107,0.109,-0.542,…],维度以50维和100维比较常见。Preferably, the word vector may be a word vector represented by a Distributed Representation. The word vector represented by Distributed Representation is a low-dimensional real number vector. The general form of this vector is [0.792, -0.177, -0.17, 0.109, -0.542,...], and the dimensions are more common in 50-dimensional and 100-dimensional.
作为一优选实施方式,在对检索库进行词向量训练之前,还可对检索库中各文件分别进行预处理,将各文件预处理后的数据存储到一对应的训练样本文件中。As a preferred embodiment, before the word vector training is performed on the search library, each file in the search library may be separately preprocessed, and the preprocessed data of each file is stored in a corresponding training sample file.
优选的,其中所述预处理包括数据清洗和提取数据描述。其中数据清洗主要用于实现检索库中数据的一致性,具体可包括统一大小写、消除多余空格、统一标点符号、统一全半角格式中至少一种;所述提取数据描述包括通过添加用户词典进行分词,具体可为添加用户词典并通过NLPIR(又名ICTCLAS2013,汉语分词系统)进行分词。Preferably, wherein said pre-processing comprises data cleaning and extracting data descriptions. The data cleaning is mainly used to implement the consistency of the data in the search library, and may specifically include at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format; the extracting data description includes adding a user dictionary. The word segmentation can be specifically added to the user dictionary and segmented by NLPIR (also known as ICTCLAS2013, Chinese word segmentation system).
进一步的,基于所述训练样本文件对所述检索库进行词向量训练,以建立所述检索库对应的训练模型。具体方式可为:通过word2vec对所述训练样本 文件进行词向量训练,训练设置如下:Further, word vector training is performed on the search library based on the training sample file to establish a training model corresponding to the search library. The specific manner may be: using the word2vec to the training sample The file is trained in word vector, and the training settings are as follows:
./word2vec-train result_cropus.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1e-3-threads 4-binary 1-min_count 3;./word2vec-train result_cropus.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1e-3-threads 4-binary 1-min_count 3;
其中,各参数的含义为:Among them, the meaning of each parameter is:
-train后面表示参与训练的训练样本文件名,-train indicates the name of the training sample file to participate in the training.
-cbow表示采用跳空词袋模型,-cbow means using the gap word bag model,
-size表示词向量采用的维度,-size represents the dimension used by the word vector,
-window表示上下文窗口长度,-window indicates the length of the context window,
-negative表示是否采用负采样方法,0表示不使用,1表示使用,-negative indicates whether to use the negative sampling method, 0 means not used, 1 means use,
-hs表示是否使用HS方法,0表示不使用,1表示使用,-hs indicates whether to use the HS method, 0 means not used, 1 means use,
-sample le-3表示采样的阈值为10-3,如果一个词在训练样本中出现的频率越大,那么就越会被采样;-sample le-3 indicates that the threshold of the sample is 10 -3 . If the frequency of a word appears in the training sample is larger, the more it will be sampled;
-thread表示开启线程数,-thread indicates the number of open threads,
-binary表示输出是否为二进制文件,0表示不使用,1表示使用,-binary indicates whether the output is a binary file, 0 means not used, 1 means use,
-min_count表示设置的最低频率,默认为5,如果一个词语在文档中出现的次数小于该阈值,那么该词就会被舍弃。-min_count indicates the lowest frequency set. The default is 5. If a word appears in the document less than the threshold, the word will be discarded.
至此,可得到与所述检索库对应的训练模型。So far, a training model corresponding to the search library can be obtained.
S2,接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;S2: receiving an input search keyword, obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;
本实施例中,两个词向量的相似度指的是余弦相似度,最高可为1,最低可为0。由于训练模型是基于检索库训练得到的,因此基于该训练模型得到的相关词能很好反映检索库的用词特点。具体可通过./distance vectors.bin命令产生相关词及相似度,并通过sh脚本和expect脚本自动生成。 In this embodiment, the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library. Specifically, the related words and similarities can be generated by the ./distance vectors.bin command, and automatically generated by the sh script and the expect script.
例如:需要在简历库中检索C++软件开发工程师,输入关键词为C++、软件、MFC、数据结构,基于该简历库的训练模型可以得到以下相关词词表和相似度,详见下表:For example, you need to search the C++ software development engineer in the resume database. The input keywords are C++, software, MFC, and data structure. Based on the training model of the resume database, you can get the following related words and vocabulary and similarity. See the following table for details:
Figure PCTCN2016098234-appb-000001
Figure PCTCN2016098234-appb-000001
S3,用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;S3, performing search matching on the search library by using the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;
本实施例中,用上述步骤得出的相关词分别对所述检索库中各文件进行检索匹配,得到各文件与所述相关词的匹配结果;将各相关词对应的相似度作为累加权值,结合所述匹配结果分别可得出各文件与所述相关词的匹配分值。In this embodiment, the related words obtained by the above steps are respectively used to search and match each file in the search library, and the matching result of each file and the related words is obtained; and the similarity corresponding to each related word is used as the cumulative weighting value. And matching the matching result respectively to obtain a matching score of each file and the related word.
S4,根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。S4. Sort the files in the search library according to the matching score from high to low, and output the search result according to the sorting result.
优选的,可设定分值门限,仅对匹配分值高于所述分值门限的检索结果进行排序,并按照匹配分值由高到低的排序输出。通过设定分值门限对检索结果进一步筛选,有利于用户查阅检索结果。Preferably, the score threshold can be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorting values of the matching scores are outputted from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.
通过上述实施例的基于词向量相似度的检索方法,通过对检索库进行词向 量训练,建立所述检索库对应的训练模型;接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。首先由于训练模型是基于检索库训练得到的,因此能很好反映检索库的用词特点,有利于提高检索准确率;其次将关键词以词向量的形式进行表示,检索时根据关键词的相关词进行检索匹配,增加了对相关词的检索匹配能力,从而提高了检索鲁棒性。Through the word vector similarity-based retrieval method of the above embodiment, by performing a word orientation on the search library Training, establishing a training model corresponding to the search library; receiving an input search keyword, obtaining, by the training model, related words of the search keyword, and similarity between each related word and the search keyword; The related words perform search matching on the search library, and respectively compare matching scores of each file in the search library with the related words according to the similarity; according to the matching scores from high to low The files in the search library are sorted, and the search results are output according to the sort result. Firstly, because the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy. Secondly, the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.
需要说明的是,对于前述的方法实施例,为了简便描述,将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其它顺序或者同时进行。It should be noted that, for the foregoing method embodiments, for the sake of brevity, they are all described as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described action sequence, because In the present invention, certain steps may be performed in other orders or simultaneously.
以下对可用于执行上述基于词向量相似度的检索方法的基于词向量相似度的检索系统实施例进行说明。为了便于说明,基于词向量相似度的检索系统实施例的结构示意图中,仅仅示出了与本发明实施例相关的部分,本领域技术人员可以理解,图中示出的系统结构并不构成对系统的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。An example of a retrieval system based on word vector similarity that can be used to perform the above-described word vector similarity-based retrieval method will be described below. For ease of explanation, in the structural schematic diagram of the retrieval system embodiment based on the word vector similarity, only the parts related to the embodiment of the present invention are shown, and those skilled in the art can understand that the system structure shown in the figure does not constitute a pair. The definition of the system may include more or fewer components than those illustrated, or some components may be combined, or different component arrangements.
图2为本发明实施例的基于词向量相似度的检索系统的示意性结构图;如图2所示,本实施例的基于词向量相似度的检索系统包括:模型训练单元210、生成相关词单元220、检索匹配单元230以及结果输出单元240,各单元详述如下:2 is a schematic structural diagram of a word vector similarity-based retrieval system according to an embodiment of the present invention; as shown in FIG. 2, the word vector similarity-based retrieval system of the present embodiment includes: a model training unit 210, and generates related words. The unit 220, the search matching unit 230, and the result output unit 240 are detailed as follows:
所述模型训练单元210,用于对检索库进行词向量训练,建立所述检索库对应的训练模型;The model training unit 210 is configured to perform word vector training on the search library, and establish a training model corresponding to the search library;
本实施例中的词向量应当具有的特点包括:让相关或者相似的词,在距离 上更接近,例如“麦克”和“话筒”的距离会远小于“麦克”和“天气”的距离。向量的距离可以用传统的欧氏距离来衡量,也可以用cos夹角来衡量。The word vector in this embodiment should have the following features: let relevant or similar words, at a distance The distance is closer, for example, the distance between "Mike" and "Microphone" will be much smaller than the distance between "Mike" and "Weather". The distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.
优选的,所述词向量可为用Distributed Representation表示的词向量。Distributed Representation表示的词向量为一种低维实数向量,这种向量一般形式为[0.792,-0.177,-0.107,0.109,-0.542,…],维度以50维和100维比较常见。Preferably, the word vector may be a word vector represented by a Distributed Representation. The word vector represented by Distributed Representation is a low-dimensional real number vector. The general form of this vector is [0.792, -0.177, -0.17, 0.109, -0.542,...], and the dimensions are more common in 50-dimensional and 100-dimensional.
作为一优选实施方式,所述模型训练单元210,还用于对检索库进行词向量训练之前,对检索库中各文件分别进行预处理,将各文件预处理后的数据存储到一对应的训练样本文件中,以基于所述训练样本文件对所述检索库进行词向量训练。其中,所述预处理包括数据清洗和提取数据描述。所述数据清洗包括统一大小写、消除多余空格、统一标点符号、统一全半角格式中至少一种;所述提取数据描述包括通过添加用户词典进行分词,具体方式可为添加用户词典并通过NLPIR(又名ICTCLAS2013,汉语分词系统)进行分词。As a preferred embodiment, the model training unit 210 is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training. In the sample file, word vector training is performed on the search library based on the training sample file. Wherein, the pre-processing includes data cleaning and extracting data description. The data cleaning includes at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format; the extracting data description includes segmentation by adding a user dictionary, and the specific manner may be adding a user dictionary and passing NLPIR ( Also known as ICTCLAS2013, Chinese word segmentation system) for word segmentation.
优选的,可通过word2vec对所述训练样本文件进行词向量训练,训练设置如下:Preferably, the training sample file can be trained by word vector by word2vec, and the training settings are as follows:
./word2vec-train result_cropus.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1e-3-threads 4-binary 1-min_count 3;./word2vec-train result_cropus.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1e-3-threads 4-binary 1-min_count 3;
其中,各参数的含义为:Among them, the meaning of each parameter is:
-train后面表示参与训练的训练样本文件名,-train indicates the name of the training sample file to participate in the training.
-cbow表示采用跳空词袋模型,-cbow means using the gap word bag model,
-size表示词向量采用的维度,-size represents the dimension used by the word vector,
-window表示上下文窗口长度,-window indicates the length of the context window,
-negative表示是否采用负采样方法,0表示不使用,1表示使用, -negative indicates whether to use the negative sampling method, 0 means not used, 1 means use,
-hs表示是否使用HS方法,0表示不使用,1表示使用,-hs indicates whether to use the HS method, 0 means not used, 1 means use,
-sample le-3表示采样的阈值为10-3-sample le-3 indicates that the sampling threshold is 10 -3 .
-thread表示开启线程数,-thread indicates the number of open threads,
-binary表示输出是否为二进制文件,0表示不使用,1表示使用,-binary indicates whether the output is a binary file, 0 means not used, 1 means use,
-min_count表示设置的最低频率,默认为5。-min_count indicates the lowest frequency set, the default is 5.
进一步的,所述生成相关词单元220,用于接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;Further, the generating related word unit 220 is configured to receive the input search keyword, and obtain the related words of the search keyword and the similarity between each related word and the search keyword by using the training model;
本实施例中,两个词向量的相似度指的是余弦相似度,最高可为1,最低可为0。由于训练模型是基于检索库训练得到的,因此基于该训练模型得到的相关词能很好反映检索库的用词特点。In this embodiment, the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library.
所述检索匹配单元230,用于用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;The search matching unit 230 is configured to perform search matching on the search library by using the related words, and separately calculate matching scores of each file in the search library and the related words according to the similarity;
优选的,所述检索匹配单元230可具体包括:匹配模块,用于用所述相关词分别对所述检索库中各文件进行检索匹配,得到各文件与所述相关词的匹配结果;统计模块,用于将各相关词对应的相似度作为累加权值,结合所述匹配结果分别得出各文件与所述相关词的匹配分值。Preferably, the search matching unit 230 may specifically include: a matching module, configured to perform search and match on each file in the search library by using the related words, and obtain matching results of each file and the related words; And the similarity corresponding to each related word is used as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
所述结果输出单元240,用于根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。The result output unit 240 is configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
本实施例中,还可设定一个分值门限,仅对匹配分值高于所述分值门限的检索结果进行排序,并按照匹配分值由高到低的排序输出。通过设定分值门限对检索结果进一步筛选,有利于用户查阅检索结果。In this embodiment, a score threshold may also be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorted values are sorted according to the rank of the matching scores from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.
需要说明的是,上述示例的基于词向量相似度的检索系统的实施方式中, 各模块/单元之间的信息交互、执行过程等内容,由于与本发明前述方法实施例基于同一构思,其带来的技术效果与本发明前述方法实施例相同,具体内容可参见本发明方法实施例中的叙述,此处不再赘述。It should be noted that, in the implementation manner of the word vector similarity-based retrieval system of the above example, The information interaction between the modules/units, the execution process, and the like are based on the same concept as the foregoing method embodiments of the present invention, and the technical effects thereof are the same as the foregoing method embodiments of the present invention. For details, refer to the method implementation of the present invention. The description in the example will not be repeated here.
此外,上述示例的基于词向量相似度的检索系统的实施方式中,各功能模块的逻辑划分仅是举例说明,实际应用中可以根据需要,例如出于相应硬件的配置要求或者软件的实现的便利考虑,将上述功能分配由不同的功能模块完成,即将所述基于词向量相似度的检索系统的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。In addition, in the implementation manner of the word vector similarity-based retrieval system of the above example, the logical division of each functional module is merely an example, and the actual application may be according to requirements, for example, the configuration requirements of the corresponding hardware or the convenience of the implementation of the software. It is considered that the above-mentioned function allocation is completed by different functional modules, that is, the internal structure of the word vector similarity-based retrieval system is divided into different functional modules to complete all or part of the functions described above.
另外,上述示例的基于词向量相似度的检索系统的实施方式中,各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。In addition, in the implementation manner of the word vector similarity-based retrieval system of the above example, each functional module may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one. In the module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。本领域普通技术人员可以理解本发明的任意实施例指定的方法的全部或部分步骤是可以通过程序来指令相关的硬件(个人计算机、服务器、或者网络设备等)来完成。该程序可以存储于一计算机可读存储介质中。该程序在执行时,可执行上述任意实施例指定的方法的全部或部分步骤。前述存储介质可以包括任何可以存储程序代码的介质,例如只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。The integrated modules, if implemented in the form of software functional modules and sold or used as separate products, may be stored in a computer readable storage medium. One of ordinary skill in the art will appreciate that all or part of the steps of the method specified by any embodiment of the present invention can be accomplished by a program to instruct related hardware (personal computer, server, or network device, etc.). The program can be stored in a computer readable storage medium. The program, when executed, may perform all or part of the steps of the method specified in any of the above embodiments. The foregoing storage medium may include any medium that can store program code, such as a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。In the above embodiments, the descriptions of the various embodiments are all focused, and the parts that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.
以上所述实施例仅表达了本发明的几种实施方式,不能理解为对本发明专 利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。 The above described embodiments only express several embodiments of the present invention, and are not to be construed as exclusive to the present invention. The scope of the benefit range. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of the invention should be determined by the appended claims.

Claims (10)

  1. 一种基于词向量相似度的检索方法,其特征在于,包括:A retrieval method based on word vector similarity, characterized in that it comprises:
    对检索库进行词向量训练,建立所述检索库对应的训练模型;Performing a word vector training on the search library, and establishing a training model corresponding to the search library;
    接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;Receiving an input search keyword, and obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;
    用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;Searching and matching the search library with the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;
    根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。Sorting the files in the search library according to the matching scores from high to low, and outputting the search results according to the sorting result.
  2. 根据权利要求1所述的基于词向量相似度的检索方法,其特征在于,所述对检索库进行词向量训练之前包括:The method for retrieving a word vector similarity according to claim 1, wherein the performing the word vector training on the search library comprises:
    对检索库中各文件分别进行预处理,将各文件预处理后的数据存储到一对应的训练样本文件中;所述预处理包括数据清洗和提取数据描述;Performing pre-processing on each file in the search library, and storing the pre-processed data of each file into a corresponding training sample file; the pre-processing includes data cleaning and extracting data description;
    所述对检索库进行词向量训练包括:The word vector training for the search library includes:
    基于所述训练样本文件对所述检索库进行词向量训练。Word vector training is performed on the search library based on the training sample file.
  3. 根据权利要求2所述的基于词向量相似度的检索方法,其特征在于,所述数据清洗包括统一大小写、消除多余空格、统一标点符号、统一全半角格式中至少一种;The method for retrieving a word vector similarity according to claim 2, wherein the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;
    所述提取数据描述包括通过添加用户词典进行分词。The extracting the data description includes segmentation by adding a user dictionary.
  4. 根据权利要求2所述的基于词向量相似度的检索方法,其特征在于,所述对检索库进行词向量训练包括:The word vector similarity-based retrieval method according to claim 2, wherein the performing word vector training on the retrieval library comprises:
    通过word2vec对所述训练样本文件进行词向量训练。Word vector training is performed on the training sample file by word2vec.
  5. 根据权利要求1所述的基于词向量相似度的检索方法,其特征在于, 用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值包括:A word vector similarity-based retrieval method according to claim 1, wherein Searching and matching the search library with the related words, and separately counting the matching scores of each file in the search library and the related words according to the similarity:
    用所述相关词分别对所述检索库中各文件进行检索匹配,得到各文件与所述相关词的匹配结果;Searching and matching each file in the search library by using the related words, and obtaining matching results of each file and the related words;
    将各相关词对应的相似度作为累加权值,结合所述匹配结果分别得出各文件与所述相关词的匹配分值。The similarity corresponding to each related word is taken as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
  6. 一种基于词向量相似度的检索系统,其特征在于,包括:A retrieval system based on word vector similarity, characterized in that it comprises:
    模型训练单元,用于对检索库进行词向量训练,建立所述检索库对应的训练模型;a model training unit, configured to perform word vector training on the search library, and establish a training model corresponding to the search library;
    生成相关词单元,用于接收输入的检索关键词,通过所述训练模型得到所述检索关键词的相关词,以及各相关词与所述检索关键词的相似度;Generating a related word unit for receiving an input search keyword, and obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;
    检索匹配单元,用于用所述相关词对所述检索库进行检索匹配,并根据所述相似度分别统计所述检索库中各文件与所述相关词的匹配分值;Searching for a matching unit for searching and matching the search library with the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;
    结果输出单元,用于根据所述匹配分值由高到低对所述检索库中的文件进行排序,根据排序结果输出检索结果。a result output unit, configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
  7. 根据权利要求6所述的基于词向量相似度的检索系统,其特征在于,所述模型训练单元,还用于对检索库进行词向量训练之前,对检索库中各文件分别进行预处理,将各文件预处理后的数据存储到一对应的训练样本文件中;所述预处理包括数据清洗和提取数据描述;The word vector similarity-based retrieval system according to claim 6, wherein the model training unit is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library. The preprocessed data of each file is stored in a corresponding training sample file; the preprocessing includes data cleaning and extracting data description;
    所述对检索库进行词向量训练包括:The word vector training for the search library includes:
    基于所述训练样本文件对所述检索库进行词向量训练。Word vector training is performed on the search library based on the training sample file.
  8. 根据权利要求7所述的基于词向量相似度的检索系统,其特征在于,所述数据清洗包括统一大小写、消除多余空格、统一标点符号、统一全半角格 式中至少一种;The word vector similarity-based retrieval system according to claim 7, wherein the data cleaning comprises unified capitalization, elimination of extra spaces, unified punctuation, and unified half-width At least one of the formulas;
    所述提取数据描述包括通过添加用户词典进行分词。The extracting the data description includes segmentation by adding a user dictionary.
  9. 根据权利要求7所述的基于词向量相似度的检索系统,其特征在于,所述对检索库进行词向量训练包括:The word vector similarity-based retrieval system according to claim 7, wherein the performing word vector training on the retrieval library comprises:
    通过word2vec对所述训练样本文件进行词向量训练。Word vector training is performed on the training sample file by word2vec.
  10. 根据权利要求6所述的基于词向量相似度的检索系统,其特征在于,所述检索匹配单元包括:The word vector similarity-based retrieval system according to claim 6, wherein the retrieval matching unit comprises:
    匹配模块,用于用所述相关词分别对所述检索库中各文件进行检索匹配,得到各文件与所述相关词的匹配结果;a matching module, configured to perform a search and match on each file in the search library by using the related words, to obtain a matching result of each file and the related words;
    统计模块,用于将各相关词对应的相似度作为累加权值,结合所述匹配结果分别得出各文件与所述相关词的匹配分值。 The statistic module is configured to use the similarity corresponding to each related word as a cumulative weighting value, and combine the matching result to obtain a matching score of each file and the related word respectively.
PCT/CN2016/098234 2015-12-25 2016-09-06 Retrieval method and system based on word vector similarity WO2017107566A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201511003865.4 2015-12-25
CN201511003865.4A CN105631009A (en) 2015-12-25 2015-12-25 Word vector similarity based retrieval method and system

Publications (1)

Publication Number Publication Date
WO2017107566A1 true WO2017107566A1 (en) 2017-06-29

Family

ID=56045942

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/098234 WO2017107566A1 (en) 2015-12-25 2016-09-06 Retrieval method and system based on word vector similarity

Country Status (2)

Country Link
CN (1) CN105631009A (en)
WO (1) WO2017107566A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165382A (en) * 2018-08-03 2019-01-08 南京工业大学 A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines
CN109582771A (en) * 2018-11-26 2019-04-05 国网湖南省电力有限公司 Smart client exchange method towards power domain based on mobile application
CN109933779A (en) * 2017-12-18 2019-06-25 苏宁云商集团股份有限公司 User's intension recognizing method and system
CN110084658A (en) * 2018-01-26 2019-08-02 北京京东尚科信息技术有限公司 The matched method and apparatus of article
CN111104488A (en) * 2019-12-30 2020-05-05 广州广电运通信息科技有限公司 Method, device and storage medium for integrating retrieval and similarity analysis
CN111625468A (en) * 2020-06-05 2020-09-04 中国银行股份有限公司 Test case duplicate removal method and device
CN112711648A (en) * 2020-12-23 2021-04-27 航天信息股份有限公司 Database character string ciphertext storage method, electronic device and medium
CN113515621A (en) * 2021-04-02 2021-10-19 中国科学院深圳先进技术研究院 Data retrieval method, device, equipment and computer readable storage medium
CN113569006A (en) * 2021-06-17 2021-10-29 国家电网有限公司 Large-scale data quality anomaly detection method based on data characteristics
CN116431838A (en) * 2023-06-15 2023-07-14 北京墨丘科技有限公司 Document retrieval method, device, system and storage medium

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631009A (en) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 Word vector similarity based retrieval method and system
CN106407311B (en) * 2016-08-30 2020-07-24 北京百度网讯科技有限公司 Method and device for obtaining search result
CN106886567B (en) * 2017-01-12 2019-11-08 北京航空航天大学 Microblogging incident detection method and device based on semantic extension
CN107330023B (en) * 2017-06-21 2021-02-12 北京百度网讯科技有限公司 Text content recommendation method and device based on attention points
CN110610695B (en) * 2018-05-28 2022-05-17 宁波方太厨具有限公司 Speech recognition method based on isolated words and range hood applying same
CN109190046A (en) * 2018-09-18 2019-01-11 北京点网聚科技有限公司 Content recommendation method, device and content recommendation service device
CN110110333A (en) * 2019-05-08 2019-08-09 上海数据交易中心有限公司 A kind of search method and system interconnecting object
CN110309278B (en) * 2019-05-23 2021-11-16 泰康保险集团股份有限公司 Keyword retrieval method, device, medium and electronic equipment
CN110674087A (en) * 2019-09-03 2020-01-10 平安科技(深圳)有限公司 File query method and device and computer readable storage medium
CN110909789A (en) * 2019-11-20 2020-03-24 精硕科技(北京)股份有限公司 Sound volume prediction method and device, electronic equipment and storage medium
CN111625621B (en) * 2020-04-27 2023-05-09 中国铁道科学研究院集团有限公司电子计算技术研究所 Document retrieval method and device, electronic equipment and storage medium
CN112650833A (en) * 2020-12-25 2021-04-13 哈尔滨工业大学(深圳) API (application program interface) matching model establishing method and cross-city government affair API matching method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778161A (en) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 Keyword extracting method based on Word2Vec and Query log
US20150248608A1 (en) * 2014-02-28 2015-09-03 Educational Testing Service Deep Convolutional Neural Networks for Automated Scoring of Constructed Responses
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes
CN105005589A (en) * 2015-06-26 2015-10-28 腾讯科技(深圳)有限公司 Text classification method and text classification device
CN105631009A (en) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 Word vector similarity based retrieval method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150248608A1 (en) * 2014-02-28 2015-09-03 Educational Testing Service Deep Convolutional Neural Networks for Automated Scoring of Constructed Responses
CN104778161A (en) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 Keyword extracting method based on Word2Vec and Query log
CN105005589A (en) * 2015-06-26 2015-10-28 腾讯科技(深圳)有限公司 Text classification method and text classification device
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes
CN105631009A (en) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 Word vector similarity based retrieval method and system

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933779A (en) * 2017-12-18 2019-06-25 苏宁云商集团股份有限公司 User's intension recognizing method and system
CN110084658A (en) * 2018-01-26 2019-08-02 北京京东尚科信息技术有限公司 The matched method and apparatus of article
CN110084658B (en) * 2018-01-26 2024-01-16 北京京东尚科信息技术有限公司 Method and device for matching articles
CN109165382B (en) * 2018-08-03 2022-08-23 南京工业大学 Similar defect report recommendation method combining weighted word vector and potential semantic analysis
CN109165382A (en) * 2018-08-03 2019-01-08 南京工业大学 A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines
CN109582771A (en) * 2018-11-26 2019-04-05 国网湖南省电力有限公司 Smart client exchange method towards power domain based on mobile application
CN111104488A (en) * 2019-12-30 2020-05-05 广州广电运通信息科技有限公司 Method, device and storage medium for integrating retrieval and similarity analysis
CN111104488B (en) * 2019-12-30 2023-10-24 广州广电运通信息科技有限公司 Method, device and storage medium for integrating retrieval and similarity analysis
CN111625468A (en) * 2020-06-05 2020-09-04 中国银行股份有限公司 Test case duplicate removal method and device
CN111625468B (en) * 2020-06-05 2024-04-16 中国银行股份有限公司 Test case duplicate removal method and device
CN112711648A (en) * 2020-12-23 2021-04-27 航天信息股份有限公司 Database character string ciphertext storage method, electronic device and medium
CN113515621A (en) * 2021-04-02 2021-10-19 中国科学院深圳先进技术研究院 Data retrieval method, device, equipment and computer readable storage medium
CN113515621B (en) * 2021-04-02 2024-03-29 中国科学院深圳先进技术研究院 Data retrieval method, device, equipment and computer readable storage medium
CN113569006A (en) * 2021-06-17 2021-10-29 国家电网有限公司 Large-scale data quality anomaly detection method based on data characteristics
CN116431838A (en) * 2023-06-15 2023-07-14 北京墨丘科技有限公司 Document retrieval method, device, system and storage medium
CN116431838B (en) * 2023-06-15 2024-01-30 北京墨丘科技有限公司 Document retrieval method, device, system and storage medium

Also Published As

Publication number Publication date
CN105631009A (en) 2016-06-01

Similar Documents

Publication Publication Date Title
WO2017107566A1 (en) Retrieval method and system based on word vector similarity
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
CN109960724B (en) Text summarization method based on TF-IDF
CN109101479B (en) Clustering method and device for Chinese sentences
JP5608817B2 (en) Target word recognition using specified characteristic values
US8073877B2 (en) Scalable semi-structured named entity detection
US9087297B1 (en) Accurate video concept recognition via classifier combination
Zhang et al. Extractive document summarization based on convolutional neural networks
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
US10482146B2 (en) Systems and methods for automatic customization of content filtering
US11928875B2 (en) Layout-aware, scalable recognition system
CN106557777B (en) One kind being based on the improved Kmeans document clustering method of SimHash
CN108052500B (en) Text key information extraction method and device based on semantic analysis
CN108038099B (en) Low-frequency keyword identification method based on word clustering
Saenko et al. Unsupervised learning of visual sense models for polysemous words
US20190318191A1 (en) Noise mitigation in vector space representations of item collections
CN109446299B (en) Method and system for searching e-mail content based on event recognition
CN112989802A (en) Barrage keyword extraction method, device, equipment and medium
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
Twinandilla et al. Multi-document summarization using k-means and latent dirichlet allocation (lda)–significance sentences
CN116738988A (en) Text detection method, computer device, and storage medium
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
WO2021227951A1 (en) Naming of front-end page element
US20050149846A1 (en) Apparatus, method, and program for text classification using frozen pattern

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16877377

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16877377

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 15/11/2018)

122 Ep: pct application non-entry in european phase

Ref document number: 16877377

Country of ref document: EP

Kind code of ref document: A1