WO2017107566A1

WO2017107566A1 - Retrieval method and system based on word vector similarity

Info

Publication number: WO2017107566A1
Application number: PCT/CN2016/098234
Authority: WO
Inventors: 李贤�
Original assignee: 广州视源电子科技股份有限公司
Priority date: 2015-12-25
Filing date: 2016-09-06
Publication date: 2017-06-29
Also published as: CN105631009A

Abstract

A retrieval method and system based on a word vector similarity. The method comprises: performing word vector training on a retrieval library, and establishing a training model corresponding to the retrieval library (S1); receiving an input retrieval keyword, and obtaining related words of the retrieval keyword and the similarity between each of the related words and the retrieval keyword by means of the training model (S2); retrieving and matching the retrieval library using the related words, and respectively counting scores of matching between various files in the retrieval library and the related words according to the similarity (S3); and sorting the files in the retrieval library according to the matching scores from high to low, and outputting a retrieval result according to the sorting result (S4). By means of the method, the capabilities of retrieving and matching related words can be enhanced in combination with the lexical characteristics in various retrieval libraries, thereby improving the accuracy rate and the robustness of retrieval.

Description

Search method and system based on word vector similarity

Technical field

The invention relates to the field of information retrieval technology, in particular to a retrieval method based on word vector similarity and a retrieval system based on word vector similarity.

Background technique

The existing techniques for the resume search matching process are usually searched by multiple keywords. The user provides a set of keywords to search in the search library, and the number of matching word hits is used as the matching score, and the search result is output according to the ranking of the matching scores from high to low, and the default ranked first is more in line with the user requirements. However, this search method has the following disadvantages:

(1) failed to take into account the characteristics of the different search terms, such as the capitalization of English, the full-width half-width of characters, etc.;

(2) The relationship between words and words cannot be considered, resulting in the lack of information matching ability for other words that are strongly related to keywords in the retrieval process; for example, the keyword is set to "program" but cannot be searched in the library. The information of the "software" is searched and matched;

(3) The requirements for keyword selection are high, and the retrieval robustness is poor; if the keywords are missing or mistyped, it will have a great impact on the final search results.

In summary, the existing keyword-based retrieval method has poor retrieval retrieval rate and retrieval result accuracy, and has problems of poor robustness and adaptability.

Summary of the invention

Based on this, the present invention provides a retrieval method and system based on word vector similarity, which can improve retrieval accuracy and robustness.

An aspect of the present invention provides a retrieval method based on word vector similarity, including:

Performing a word vector training on the search library, and establishing a training model corresponding to the search library;

Receiving an input search keyword, and obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;

Searching and matching the search library with the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;

Sorting the files in the search library according to the matching scores from high to low, and outputting the search results according to the sorting result.

Preferably, the performing word vector training on the search library comprises:

Performing pre-processing on each file in the search library, and storing the pre-processed data of each file into a corresponding training sample file; the pre-processing includes data cleaning and extracting data description;

The word vector training for the search library includes:

Word vector training is performed on the search library based on the training sample file.

Preferably, the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;

The extracting the data description includes segmentation by adding a user dictionary.

Word vector training is performed on the training sample file by word2vec.

Preferably, the search library is searched and matched by using the related words, and the matching scores of each file in the search library and the related words are respectively counted according to the similarity, including:

Searching and matching each file in the search library by using the related words, and obtaining matching results of each file and the related words;

The similarity corresponding to each related word is taken as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.

Another aspect of the present invention provides a retrieval system based on word vector similarity, comprising:

a model training unit, configured to perform word vector training on the search library, and establish a training model corresponding to the search library;

Generating a related word unit for receiving an input search keyword, and obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;

Searching for a matching unit for searching and matching the search library with the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;

a result output unit, configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.

Preferably, the model training unit is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training sample file; The pre-processing includes data cleaning and extracting data descriptions;

The word vector training for the search library includes:

Word vector training is performed on the training sample file by word2vec.

Preferably, the search matching unit comprises:

a matching module, configured to perform a search and match on each file in the search library by using the related words, to obtain a matching result of each file and the related words;

a statistic module, configured to use the similarity corresponding to each related word as a cumulative weighting value, and combine the matching knot The matching scores of each file and the related words are respectively obtained.

The search method and system based on word vector similarity of the above technical solution establishes a training model corresponding to the search library by performing word vector training on the search library; receiving an input search keyword, and obtaining the search through the training model a related word of the keyword, and a similarity between each related word and the search keyword; searching and matching the search library with the related word, and separately counting each file in the search library according to the similarity a matching score of the related word; sorting the files in the search library according to the matching score from high to low, and outputting the search result according to the sorting result. Firstly, because the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy. Secondly, the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.

DRAWINGS

1 is a schematic flowchart of a method for retrieving a word vector similarity according to an embodiment of the present invention;

2 is a schematic structural diagram of a word vector similarity-based retrieval system according to an embodiment of the present invention.

detailed description

The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Embodiments provided by the present invention include a retrieval method embodiment based on word vector similarity, and a corresponding retrieval system embodiment based on word vector similarity. The details are described below separately.

1 is a schematic flowchart of a method for retrieving a similarity based on word vectors according to an embodiment of the present invention; As shown in FIG. 1, the word vector similarity-based retrieval method of the present embodiment includes the following steps S1 to S4, and the steps are detailed as follows:

S1, performing word vector training on the search library, and establishing a training model corresponding to the search library;

The problem of natural language understanding translates into machine learning problems. The first step is to find a way to mathematicalize these symbols, such as expressing each word as a unique vector. The word vector is a common Chinese name for "Word Representation" or "Word Embedding".

The word vector in this embodiment should have the following features: Let related or similar words be closer in distance, for example, the distance between "Mike" and "Microphone" will be much smaller than the distance between "Mike" and "Weather". The distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.

Preferably, the word vector may be a word vector represented by a Distributed Representation. The word vector represented by Distributed Representation is a low-dimensional real number vector. The general form of this vector is [0.792, -0.177, -0.17, 0.109, -0.542,...], and the dimensions are more common in 50-dimensional and 100-dimensional.

As a preferred embodiment, before the word vector training is performed on the search library, each file in the search library may be separately preprocessed, and the preprocessed data of each file is stored in a corresponding training sample file.

Preferably, wherein said pre-processing comprises data cleaning and extracting data descriptions. The data cleaning is mainly used to implement the consistency of the data in the search library, and may specifically include at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format; the extracting data description includes adding a user dictionary. The word segmentation can be specifically added to the user dictionary and segmented by NLPIR (also known as ICTCLAS2013, Chinese word segmentation system).

Further, word vector training is performed on the search library based on the training sample file to establish a training model corresponding to the search library. The specific manner may be: using the word2vec to the training sample The file is trained in word vector, and the training settings are as follows:

./word2vec-train result_cropus.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1e-3-threads 4-binary 1-min_count 3;

Among them, the meaning of each parameter is:

-train indicates the name of the training sample file to participate in the training.

-cbow means using the gap word bag model,

-size represents the dimension used by the word vector,

-window indicates the length of the context window,

-negative indicates whether to use the negative sampling method, 0 means not used, 1 means use,

-hs indicates whether to use the HS method, 0 means not used, 1 means use,

-sample le-3 indicates that the threshold of the sample is 10 ^-3 . If the frequency of a word appears in the training sample is larger, the more it will be sampled;

-thread indicates the number of open threads,

-binary indicates whether the output is a binary file, 0 means not used, 1 means use,

-min_count indicates the lowest frequency set. The default is 5. If a word appears in the document less than the threshold, the word will be discarded.

So far, a training model corresponding to the search library can be obtained.

S2: receiving an input search keyword, obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;

In this embodiment, the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library. Specifically, the related words and similarities can be generated by the ./distance vectors.bin command, and automatically generated by the sh script and the expect script.

For example, you need to search the C++ software development engineer in the resume database. The input keywords are C++, software, MFC, and data structure. Based on the training model of the resume database, you can get the following related words and vocabulary and similarity. See the following table for details:

S3, performing search matching on the search library by using the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;

In this embodiment, the related words obtained by the above steps are respectively used to search and match each file in the search library, and the matching result of each file and the related words is obtained; and the similarity corresponding to each related word is used as the cumulative weighting value. And matching the matching result respectively to obtain a matching score of each file and the related word.

S4. Sort the files in the search library according to the matching score from high to low, and output the search result according to the sorting result.

Preferably, the score threshold can be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorting values of the matching scores are outputted from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.

Through the word vector similarity-based retrieval method of the above embodiment, by performing a word orientation on the search library Training, establishing a training model corresponding to the search library; receiving an input search keyword, obtaining, by the training model, related words of the search keyword, and similarity between each related word and the search keyword; The related words perform search matching on the search library, and respectively compare matching scores of each file in the search library with the related words according to the similarity; according to the matching scores from high to low The files in the search library are sorted, and the search results are output according to the sort result. Firstly, because the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy. Secondly, the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.

It should be noted that, for the foregoing method embodiments, for the sake of brevity, they are all described as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described action sequence, because In the present invention, certain steps may be performed in other orders or simultaneously.

An example of a retrieval system based on word vector similarity that can be used to perform the above-described word vector similarity-based retrieval method will be described below. For ease of explanation, in the structural schematic diagram of the retrieval system embodiment based on the word vector similarity, only the parts related to the embodiment of the present invention are shown, and those skilled in the art can understand that the system structure shown in the figure does not constitute a pair. The definition of the system may include more or fewer components than those illustrated, or some components may be combined, or different component arrangements.

2 is a schematic structural diagram of a word vector similarity-based retrieval system according to an embodiment of the present invention; as shown in FIG. 2, the word vector similarity-based retrieval system of the present embodiment includes: a model training unit 210, and generates related words. The unit 220, the search matching unit 230, and the result output unit 240 are detailed as follows:

The model training unit 210 is configured to perform word vector training on the search library, and establish a training model corresponding to the search library;

The word vector in this embodiment should have the following features: let relevant or similar words, at a distance The distance is closer, for example, the distance between "Mike" and "Microphone" will be much smaller than the distance between "Mike" and "Weather". The distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.

As a preferred embodiment, the model training unit 210 is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training. In the sample file, word vector training is performed on the search library based on the training sample file. Wherein, the pre-processing includes data cleaning and extracting data description. The data cleaning includes at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format; the extracting data description includes segmentation by adding a user dictionary, and the specific manner may be adding a user dictionary and passing NLPIR ( Also known as ICTCLAS2013, Chinese word segmentation system) for word segmentation.

Preferably, the training sample file can be trained by word vector by word2vec, and the training settings are as follows:

Among them, the meaning of each parameter is:

-cbow means using the gap word bag model,

-size represents the dimension used by the word vector,

-window indicates the length of the context window,

-hs indicates whether to use the HS method, 0 means not used, 1 means use,

-sample le-3 indicates that the sampling threshold is 10 ^-3 .

-thread indicates the number of open threads,

-min_count indicates the lowest frequency set, the default is 5.

Further, the generating related word unit 220 is configured to receive the input search keyword, and obtain the related words of the search keyword and the similarity between each related word and the search keyword by using the training model;

In this embodiment, the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library.

The search matching unit 230 is configured to perform search matching on the search library by using the related words, and separately calculate matching scores of each file in the search library and the related words according to the similarity;

Preferably, the search matching unit 230 may specifically include: a matching module, configured to perform search and match on each file in the search library by using the related words, and obtain matching results of each file and the related words; And the similarity corresponding to each related word is used as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.

The result output unit 240 is configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.

In this embodiment, a score threshold may also be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorted values are sorted according to the rank of the matching scores from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.

It should be noted that, in the implementation manner of the word vector similarity-based retrieval system of the above example, The information interaction between the modules/units, the execution process, and the like are based on the same concept as the foregoing method embodiments of the present invention, and the technical effects thereof are the same as the foregoing method embodiments of the present invention. For details, refer to the method implementation of the present invention. The description in the example will not be repeated here.

In addition, in the implementation manner of the word vector similarity-based retrieval system of the above example, the logical division of each functional module is merely an example, and the actual application may be according to requirements, for example, the configuration requirements of the corresponding hardware or the convenience of the implementation of the software. It is considered that the above-mentioned function allocation is completed by different functional modules, that is, the internal structure of the word vector similarity-based retrieval system is divided into different functional modules to complete all or part of the functions described above.

In addition, in the implementation manner of the word vector similarity-based retrieval system of the above example, each functional module may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one. In the module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules.

The integrated modules, if implemented in the form of software functional modules and sold or used as separate products, may be stored in a computer readable storage medium. One of ordinary skill in the art will appreciate that all or part of the steps of the method specified by any embodiment of the present invention can be accomplished by a program to instruct related hardware (personal computer, server, or network device, etc.). The program can be stored in a computer readable storage medium. The program, when executed, may perform all or part of the steps of the method specified in any of the above embodiments. The foregoing storage medium may include any medium that can store program code, such as a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

In the above embodiments, the descriptions of the various embodiments are all focused, and the parts that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.

The above described embodiments only express several embodiments of the present invention, and are not to be construed as exclusive to the present invention. The scope of the benefit range. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of the invention should be determined by the appended claims.

Claims

A retrieval method based on word vector similarity, characterized in that it comprises:

Performing a word vector training on the search library, and establishing a training model corresponding to the search library;

Receiving an input search keyword, and obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;

Searching and matching the search library with the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;

Sorting the files in the search library according to the matching scores from high to low, and outputting the search results according to the sorting result.
The method for retrieving a word vector similarity according to claim 1, wherein the performing the word vector training on the search library comprises:

Performing pre-processing on each file in the search library, and storing the pre-processed data of each file into a corresponding training sample file; the pre-processing includes data cleaning and extracting data description;

The word vector training for the search library includes:

Word vector training is performed on the search library based on the training sample file.
The method for retrieving a word vector similarity according to claim 2, wherein the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;

The extracting the data description includes segmentation by adding a user dictionary.
The word vector similarity-based retrieval method according to claim 2, wherein the performing word vector training on the retrieval library comprises:

Word vector training is performed on the training sample file by word2vec.
A word vector similarity-based retrieval method according to claim 1, wherein Searching and matching the search library with the related words, and separately counting the matching scores of each file in the search library and the related words according to the similarity:

Searching and matching each file in the search library by using the related words, and obtaining matching results of each file and the related words;

The similarity corresponding to each related word is taken as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
A retrieval system based on word vector similarity, characterized in that it comprises:

a model training unit, configured to perform word vector training on the search library, and establish a training model corresponding to the search library;

Generating a related word unit for receiving an input search keyword, and obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;

Searching for a matching unit for searching and matching the search library with the related words, and separately counting matching scores of each file in the search library and the related words according to the similarity;

a result output unit, configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
The word vector similarity-based retrieval system according to claim 6, wherein the model training unit is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library. The preprocessed data of each file is stored in a corresponding training sample file; the preprocessing includes data cleaning and extracting data description;

The word vector training for the search library includes:

Word vector training is performed on the search library based on the training sample file.
The word vector similarity-based retrieval system according to claim 7, wherein the data cleaning comprises unified capitalization, elimination of extra spaces, unified punctuation, and unified half-width At least one of the formulas;

The extracting the data description includes segmentation by adding a user dictionary.
The word vector similarity-based retrieval system according to claim 7, wherein the performing word vector training on the retrieval library comprises:

Word vector training is performed on the training sample file by word2vec.
The word vector similarity-based retrieval system according to claim 6, wherein the retrieval matching unit comprises:

a matching module, configured to perform a search and match on each file in the search library by using the related words, to obtain a matching result of each file and the related words;

The statistic module is configured to use the similarity corresponding to each related word as a cumulative weighting value, and combine the matching result to obtain a matching score of each file and the related word respectively.