CN111813930B

CN111813930B - Similar document retrieval method and device

Info

Publication number: CN111813930B
Application number: CN202010543812.6A
Authority: CN
Inventors: 毛红保
Original assignee: Iol Wuhan Information Technology Co ltd
Current assignee: Iol Wuhan Information Technology Co ltd
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2024-02-20
Anticipated expiration: 2040-06-15
Also published as: CN111813930A; WO2021253873A1

Abstract

The embodiment of the invention provides a similar document retrieval method and a similar document retrieval device, wherein the method comprises the following steps: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set. The method considers the results of the word frequency searching method and the document vectorizing searching method at the same time, and combines the results through the similarity, so that semantic inertia is eliminated to a certain extent, a multi-dimensional search result is obtained, and the limitation of the search result obtained by a single model is avoided.

Description

Similar document retrieval method and device

Technical Field

The invention relates to the field of natural language analysis, in particular to a similar document retrieval method and device.

Background

And (3) automatically retrieving the document which is the most similar to the content of the document from a massive document library by giving the document to be retrieved. In the field of translation, when a manuscript to be translated is received, the document similar to the topic content of the manuscript needs to be retrieved from a historical manuscript library so as to be quickly matched with a proper translator, thereby improving the quality and efficiency of translation.

The conventional document retrieval method mainly comprises a keyword related method, such as TF-IDF (term frequency-inverse document frequency), and the like, and the method can meet the requirements in most cases, but has the defect of neglecting word-to-word sequences. For example, if a document contains a large number of phrases such as "machine learning", the phrases are split into two keywords, namely "machine" and "learning" for searching; if the machine learning in the document is replaced by the learning machine, the retrieval result is not affected. To solve such problems, deep learning-based document semantic representations are applied in document retrieval, such as document vectorization model Doc2vec. The document vectorization model is sensitive to word order, can better represent the document from the semantic level, but semantic inertia can exist in the actual application process. For example, the first 5 documents with the highest matching degree with the "motorcycle production" need to be searched, and the document library contains a large number of documents related to the "motorcycle sales" and the "automobile production", and if the semantic representation method is adopted for searching, it is likely that all the first 5 documents searched are related to the "automobile production". This is because the semantic representation method is more sensitive to the semantics of the global level of the document than to highlight a certain keyword. But the user is likely to want the first 5 documents to be in terms of both "car production" and "motorcycle sales". It can be seen that the search results obtained based on the current method are often limited, and no accurate search results can be obtained.

Disclosure of Invention

In order to solve the above problems, the embodiment of the invention provides a similar document retrieval method and device.

In a first aspect, an embodiment of the present invention provides a similar document retrieval method, including: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain the candidate document set; and determining a retrieval result according to the candidate document set.

Further, the determining a search result according to the candidate document set includes: selecting the documents with the first preset proportion as a third document set according to the second document set from large to small in similarity; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting a second preset proportion of documents from the third document set as a retrieval result according to the similarity.

Further, before the overlapping of the same document similarity in the first document set and the second document set, the method further includes: and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.

Further, the number of documents in the first set of documents, the second set of documents, and the candidate set of documents remain consistent.

Further, the word frequency search model is a TF-IDF model.

Further, the document vectorization model is a Doc2vec model.

Further, the first preset ratio is 2/3, and the second preset ratio is 1/2.

In a second aspect, an embodiment of the present invention provides a similar document retrieval apparatus, including: the classification acquisition module is used for searching to obtain a first document set and the similarity of each document based on the word frequency search model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; the similarity stacking module is used for stacking the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and the search result determining module is used for determining a search result according to the candidate document set.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the similar document retrieval method of the first aspect of the present invention when the program is executed by the processor.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the similar document retrieval method of the first aspect of the present invention.

According to the similar document retrieval method and device provided by the embodiment of the invention, the same document similarity in the first document set and the second document set is overlapped, the preset number of documents are selected according to the similarity from large to small, the candidate document set is obtained, meanwhile, the results of the word frequency search method and the document vectorization search method are considered, and the similarity is combined, so that semantic inertia is eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a similar document retrieval method provided by an embodiment of the present invention;

FIG. 2 is a flowchart of a method for retrieving similar documents according to another embodiment of the present invention;

FIG. 3 is a block diagram of a similar document retrieval device according to an embodiment of the present invention;

fig. 4 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

FIG. 1 is a flowchart of a similar document retrieval method provided in an embodiment of the present invention, and as shown in FIG. 1, the embodiment of the present invention provides a similar document retrieval method, including:

101. and searching based on the word frequency search model to obtain the similarity of each document in the first document set, and searching based on the document vectorization model to obtain the similarity of the second document set and each document.

The term frequency search model generally refers to a model that searches based on the term frequency of a keyword, such as a TF-IDF model. The document vectorization model generally refers to a type of model based on semantic retrieval of keyword vectors, such as Doc2vec model and word2vec model.

In the specific implementation process, keyword retrieval is carried out on the document to be retrieved based on a word frequency search model, a keyword retrieval Result of the document to be retrieved is obtained, a first document set is obtained, and the first document set is recorded as Result _TF-IDF . Carrying out semantic vectorization representation on the document to be searched, searching based on a document vectorization model, obtaining a semantic search Result of the document to be searched, obtaining a second document set, and marking the second document set as Result _Doc2vec . In addition to the search results, the similarity of each searched document is obtained, and the similarity represents the similarity degree of the searched document and the document to be searched.

102. Overlapping the similarity of the same documents in the first document set and the second document set, selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set, and recording the candidate document set as Result _combination 。

Considering that the same document exists in the first document set and the second document set, the similarity is overlapped on the same document, and the similarity of other documents in the two sets is kept unchanged. And then sorting the whole according to the similarity, and selecting a preset number of documents from the documents to serve as a candidate document set.

As an alternative embodiment, the number of documents in the first set of documents, the second set of documents, and the set of candidate documents remain the same. The values may be the same or similar. For example, the number of the documents in the first document set, the second document set and the candidate document set is N, so that the balance of word frequency based search and document vectorization based search is ensured.

103. And determining a retrieval result according to the candidate document set.

In the candidate document set, the word frequency searching mode and the semantic searching mode are comprehensively considered, and the final search result is determined according to the candidate document set, so that the limitation of the search result obtained by a single model can be avoided. For example, a portion may be selected from the candidate documents as the search result, or the search result may be further determined based on the candidate document, the first document, and the second document set.

According to the similar document retrieval method, the same document similarity in the first document set and the second document set is overlapped, a preset number of documents are selected according to the similarity from large to small to obtain a candidate document set, meanwhile, the results of a word frequency search method and a document vectorization search method are considered, and the results are combined through the similarity, so that semantic inertia is eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.

Based on the content of the above embodiment, as an alternative embodiment, determining the search result according to the candidate document set includes: according to the second document set, selecting documents with a first preset proportion from large to small according to the similarity, and taking the documents as a third document set; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as search results according to the similarity.

Fig. 2 is a flowchart of a similar document searching method according to another embodiment of the present invention, as shown in fig. 2, for a second document set, documents of a first preset proportion are selected according to the similarity, for example, the number of the first document set, the second document set and the candidate document set is 3N. And if the first preset proportion is 2/3, the selected third document set is 2N. For each document in the third document set, if the document exists in the candidate document set, updating the similarity value of the document in the third document set by using the similarity value in the candidate document set, and keeping the similarity value of other documents in the third document set unchanged. And re-ordering the updated third document set according to the similarity, and selecting the documents with the second preset proportion as search results. For example, the second preset ratio is 1/2, and the first N items Result with the similarity from large to small are selected _merge As a final search result.

According to the similar document retrieval method, the semantic retrieval result of the document vectorization model is taken as the main part, and the semantic retrieval result is adjusted by keyword retrieval, so that semantic inertia can be eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the accuracy of the retrieval result is ensured.

Based on the content of the above embodiment, as an alternative embodiment, before overlapping the same document similarity in the first document set and the second document set, the method further includes: and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.

And respectively normalizing the document similarity in the first document set obtained by semantic retrieval and the document similarity in the second document set of the keyword retrieval result, and then superposing the document similarity simultaneously existing in the two sets. By respectively carrying out normalization processing on the document similarity in the first document set and the second document set, the influence caused by unbalanced similarity of the first document set and the second document set is avoided.

Based on the foregoing embodiment, as an alternative embodiment, the word frequency search model is a TF-IDF model.

TF-IDF is a common weighting method for information retrieval and data mining. TF is the Term Frequency (Term Frequency) and IDF is the inverse text Frequency index (Inverse Document Frequency).

And training a TF-IDF model of a document library by using a genetic tool based on the python language, and carrying out keyword vectorization representation and retrieval on the document to be retrieved based on the model to obtain a keyword retrieval result of the document to be retrieved.

Based on the foregoing embodiment, as an alternative embodiment, the document vectorization model is a Doc2vec model. Doc2vec is an unsupervised algorithm, can obtain the vector expression of the text, and is an expansion of word 2vec. The learned vectors can find the similarity between texts by calculating the distance, can be used for text clustering, and can be used for text classification by a supervised learning method for tagged data, such as classical emotion analysis.

The Doc2vec model of the document library can be trained by using a genetic tool based on the python language, semantic vectorization representation and retrieval are carried out on the document to be retrieved based on the model, and the semantic retrieval result of the document to be retrieved is obtained.

Based on the foregoing embodiments, as an alternative embodiment, the first preset ratio is 2/3, and the second preset ratio is 1/2. The foregoing embodiments have been illustrated and will not be described in detail herein.

FIG. 3 is a block diagram of a similar document searching apparatus according to an embodiment of the present invention, as shown in FIG. 3, the similar document searching apparatus includes: a classification acquisition module 301, a similarity superposition module 302, and a search result determination module 303. The classification acquisition module 301 is configured to search for a first document set and a similarity of each document based on a word frequency search model, and search for a second document set and a similarity of each document based on a document vectorization model; the similarity stacking module 302 is configured to stack the same document similarity in the first document set and the second document set, and select a preset number of documents according to the similarity from large to small, so as to obtain a candidate document set; the search result determining module 303 is configured to determine a search result according to the candidate document set.

Based on the content of the above embodiment, as an alternative embodiment, the search result determining module 303 is specifically configured to: according to the second document set, selecting documents with a first preset proportion from large to small according to the similarity, and taking the documents as a third document set; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as search results according to the similarity.

The embodiment of the device provided by the embodiment of the present invention is for implementing the above embodiments of the method, and specific flow and details refer to the above embodiments of the method, which are not repeated herein.

According to the similar document retrieval device provided by the embodiment of the invention, the same document similarity in the first document set and the second document set is overlapped, the preset number of documents are selected according to the similarity from large to small, the candidate document set is obtained, meanwhile, the results of the word frequency search method and the document vectorization search method are considered, and the similarity is combined, so that semantic inertia is eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.

Fig. 4 is a schematic physical structure of an electronic device according to an embodiment of the present invention, as shown in fig. 4, the electronic device may include: a processor (processor) 401, a communication interface (Communications Interface) 402, a memory (memory) 403, and a bus 404, wherein the processor 401, the communication interface 402, and the memory 403 complete communication with each other through the bus 404. The communication interface 402 may be used for information transfer of an electronic device. The processor 401 may call logic instructions in the memory 403 to perform a method comprising: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set.

Further, the logic instructions in the memory 403 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the above-described method embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the transmission method provided in the above embodiments, for example, including: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A similar document retrieval method, comprising:

searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model;

overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set;

determining a retrieval result according to the candidate document set;

wherein, the determining the search result according to the candidate document set includes:

selecting the documents with the first preset proportion as a third document set according to the second document set from large to small in similarity;

and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting a second preset proportion of documents from the third document set as a retrieval result according to the similarity.

2. The method for searching similar documents according to claim 1, wherein before said superimposing the same document similarity in said first document set and said second document set, further comprising:

and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.

3. The similar document retrieval method according to claim 1, wherein the number of documents in the first set of documents, the second set of documents, and the candidate set of documents remain the same.

4. The method of claim 1, wherein the term-frequency search model is a TF-IDF model.

5. The method of claim 1, wherein the document vectorization model is a Doc2vec model.

6. The similar document retrieving method according to claim 1, wherein the first preset ratio is 2/3 and the second preset ratio is 1/2.

7. A similar document retrieval apparatus, comprising:

the classification acquisition module is used for searching to obtain a first document set and the similarity of each document based on the word frequency search model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model;

the similarity stacking module is used for stacking the same document similarity in the first document set and the second document set, and selecting a preset number of documents from large to small according to the similarity to obtain a candidate document set;

the search result determining module is used for determining a search result according to the candidate document set;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the similar document retrieval method of any one of claims 1 to 6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the similar document retrieval method according to any one of claims 1 to 6.