CN114281977A

CN114281977A - Similar document searching method and device based on massive documents

Info

Publication number: CN114281977A
Application number: CN202111473898.0A
Authority: CN
Inventors: 张秀龙
Original assignee: Oriental Fortune Information Co ltd
Current assignee: Oriental Fortune Information Co ltd
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2022-04-05

Abstract

The method comprises the steps of determining the word segmentation quantity of effective word segmentation in a target document; determining the number of recalls of the recalled words based on a preset similarity threshold and the number of the participles; acquiring all words and the word frequency of each word contained in a document library to be searched, and taking effective participles corresponding to the recall quantity with the lowest word frequency as recall words according to the sequence of the word frequencies from low to high; selecting at least one candidate document from a document library, wherein each candidate document comprises recall words with the number of recalls; calculating the similarity between the target document and each candidate document; and determining the corresponding candidate document with the similarity greater than or equal to the preset similarity threshold as the similar document corresponding to the target document, so that the missed recall of the candidate texts for subsequent similarity calculation is avoided, the number of the candidate texts for subsequent similarity calculation is reduced, and the calculation speed is greatly improved.

Description

Similar document searching method and device based on massive documents

Technical Field

The application relates to the technical field of computers, in particular to a method and equipment for searching similar documents based on massive documents.

Background

When calculating similarity of a large number of documents, it is generally necessary to extract texts with high similarity, for example: deduplication, related document queries, and the like. The technology is generally processed by using a method based on an inverted index and a distance formula (such as cosine distance and the like), but in a scene with huge sample size and high processing speed and concurrency requirements, the traditional method has a slow processing speed and cannot meet the requirements, so that various improved methods exist.

The general text similarity calculation is divided into two steps, wherein the first step is to find all possible similar documents to form a candidate set, and the second step is to calculate the similarity by using the target document and all candidate documents, and generally how many candidate documents with the former similarity are returned when in use. Or setting a similarity threshold, and returning candidate documents above the similarity threshold; in the context of text search, how many texts are returned before the text search is generally performed. In the scene of duplicate removal and high similarity document searching, only the text above the similarity threshold is generally taken.

In the existing similarity acceleration methods, the method of minimizing the initial recall set (i.e. the candidate set) is also generally adopted for acceleration, i.e. a small part of samples are recalled from a large document set at first, and then the similarity is accurately calculated for the small part of samples. However, how to reduce the initial recall and to recall all similar documents without leakage has become an important research topic in the industry.

Disclosure of Invention

An object of the present application is to provide a method and an apparatus for searching similar documents based on mass documents, which can reduce the amount of similarity calculation while giving no recall, thereby greatly increasing the calculation speed.

According to one aspect of the application, a method for searching similar documents based on massive documents is provided, wherein the method comprises the following steps:

determining the word segmentation quantity of effective word segmentation in the target document;

determining the number of recalls of the recalled words based on a preset similarity threshold and the number of the participles;

acquiring all words contained in a document library to be searched and the word frequency of each word, and taking the word corresponding to the recall quantity with the lowest word frequency in all the words contained in the document library as the recall word according to the sequence of the word frequency from low to high;

selecting at least one candidate document from the document library, wherein each candidate document comprises the recall number of recall terms;

calculating the similarity between the target document and each candidate document;

and determining the corresponding candidate document when the similarity is greater than or equal to the preset similarity threshold as a similar document corresponding to the target document.

Further, in the above method, the determining the number of the effective participles in the target document includes:

acquiring a target document;

performing word segmentation processing on the target document to obtain initial word segmentation in the target document;

performing stop word processing on the initial participles in the target document to obtain effective participles in the target document;

and counting the word segmentation quantity of the effective word segmentation in the target document.

Further, in the above method, the calculation formula for determining the number of recalled words based on the preset similarity threshold and the number of segmented words is:

(1-t²)*L，

wherein t is the preset similarity threshold, and L is the number of the participles.

Further, in the above method, the selecting at least one candidate document from the document library, where each candidate document includes the recall number of recall terms, includes:

acquiring a document library to be searched and an inverted index formed by the storage positions of each word in the document library in one or more documents in the document library;

selecting at least one candidate document from the document library that each contains the recall number of recall words based on the inverted index of each of the words.

Further, in the above method, the calculating the similarity between the target document and each of the candidate documents includes:

acquiring a forward index of each candidate document, wherein the forward index of each candidate document is used for storing a list of all words contained in the candidate document;

respectively acquiring a text characterization vector corresponding to each candidate document based on the forward index of each candidate document;

and calculating the similarity between the target document and each candidate document respectively based on the obtained text characterization vector corresponding to the target document and the text characterization vector corresponding to each candidate document.

According to another aspect of the present application, there is also provided a non-volatile storage medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to implement the similar document searching method based on mass documents as described above.

According to another aspect of the present application, there is also provided a similar document searching apparatus based on a large number of documents, wherein the apparatus includes:

one or more processors;

a computer-readable medium for storing one or more computer-readable instructions,

when executed by the one or more processors, cause the one or more processors to implement a method for similar document searching based on mass documents, as described above.

Compared with the prior art, the method and the device have the advantages that the number of the effective word segmentation in the target document is determined; determining the number of recalls of the recalled words based on a preset similarity threshold and the number of the participles; then, acquiring all words contained in a document library to be searched and the word frequency of each word, and taking the words corresponding to the recall quantity with the lowest word frequency in all the words contained in the document library as the recall words according to the sequence of the word frequencies from low to high so as to determine which words the recall words with the recall quantity are specific; selecting at least one candidate document from a document library to be searched, wherein each candidate document comprises the recall terms of the recall number; calculating the similarity between the target document and each candidate document; and determining the corresponding candidate document when the similarity is greater than or equal to the preset similarity threshold as a similar document corresponding to the target document. According to the method and the device, all effective participles above the preset similarity threshold are used as the recall words to be recalled, and the recall words in the recall number all select the effective participles with the minimum word frequency in the recall number, so that the missed recall of the alternative texts for subsequent similarity calculation is avoided, the number of the alternative texts for subsequent similarity calculation is reduced, and the calculation speed is greatly improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow diagram of a method for similar document searching based on a large number of documents, in accordance with an aspect of the subject application;

FIG. 2 is a flow chart of a similar document searching method based on massive documents in an actual application scenario according to an aspect of the present application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

For example, in the prior art method for searching similar documents based on massive documents, all words of the current document are required to participate in recall. For example, assuming that 10 different recalled documents can be recalled by one word, if there are 100 words in a document, then 1000 recalled documents and the current document need to be calculated for similarity. However, the prior art generally seeks to extract keywords and recall documents by using the keywords. As such, if 8 words are selected as key words, 80 recalled documents are recalled, and the calculation amount is only 8% of the original calculation amount by only calculating the similarity between the 80 recalled documents and the current document, which is not only a little but also a lot faster; but the remaining 92 words (100 words minus 8 words) do not participate in the recall, and there is a great risk of missed recalls, that is, similar documents may exist in the remaining 920 documents to be recalled, and are missed because they are not recalled, and the less keywords are obtained, the less calculation amount is needed, but the higher risk of missed recalls is.

In the prior art, the calculation speed is improved by increasing the risk of missed recall, but the application can provide a similar document searching method based on massive documents by deducing cosine similarity, so that the method searches the relation between the number of recall words participating in recall and a similarity threshold under the condition of ensuring complete recall, that is, in order to recall all documents above a preset similarity threshold, all recalled documents must contain one or more of a certain number of recall words, otherwise, the recalled documents cannot reach the preset similarity threshold. Through the deduction, the number of initial recalls can be greatly reduced, and meanwhile, the fact that the documents higher than the preset similarity threshold are all in the initially recalled documents can be guaranteed, and the recalls cannot be missed.

By way of example in the foregoing embodiment, in the present application, only M (which needs to be calculated by the number of words in the target document) words are selected to ensure that documents with similarity higher than a preset similarity threshold can be recalled, where the number of documents recalled for the first time is M × 10, and the calculation amount is M% of the method in the prior art. Therefore, the key to the problem is how to solve the M accurately, which increases the amount of calculation and decreases the risk of missed recall. Furthermore, the number of the documents recalled by each word is different, some documents are more, some documents are less, and the long tail distribution of the words is adopted, so that sometimes a small number of words appear in a large number of documents, and a large number of words only appear in a small number of documents, so that the words with less word frequency can be recalled during recalling, and the calculated amount is greatly less than M%. In view of this, an aspect of the present application proposes a method for searching similar documents based on mass documents, a specific flowchart of the method is shown in fig. 1, wherein the method includes step S11, step S12, step S13, step S14, step S15 and step S16, and specifically includes the following steps:

step S11, determining the number of the effective participles in the target document;

step S12, determining the number of recalls of the recalled word based on a preset similarity threshold and the number of the participles;

step S13, acquiring all words contained in a document library to be searched and the word frequency of each word, and taking the words corresponding to the recall quantity with the lowest word frequency in all the words contained in the document library as the recall words according to the sequence of the word frequencies from low to high;

step S14, selecting at least one candidate document from the document library to be searched, wherein each candidate document comprises the recall words of the recall quantity;

step S15, calculating the similarity between the target document and each candidate document;

step S16, determining the candidate document corresponding to the target document when the similarity is greater than or equal to the preset similarity threshold as a similar document corresponding to the target document.

Through the steps S11 to S16, all the effective participles above the preset similarity threshold are recalled as the recall words, and the recall words in the recall number all select the effective participles with the minimum word frequency in the recall number, so that missed recalling of the candidate texts for subsequent similarity calculation is avoided, it is ensured that the similar documents above the preset similarity threshold can be recalled all, the number of the candidate texts for subsequent similarity calculation is reduced, and the calculation speed is greatly increased.

For example, first, effective participles such as effective participle 1, effective participle 2, effective participle 3, … …, effective participle (L-1), and effective participle L in the target document are determined, and the number L of effective participles in the target document, i.e., L effective participles, where L is a positive integer greater than or equal to 1. Then, the number of the recalled words to be recalled, namely the number of recalls M, is determined according to the set preset similarity threshold t and the number L of the participles in the target document, wherein M is a positive integer greater than or equal to 1. In order to greatly reduce the subsequent calculation amount during similarity calculation and ensure that all the alternative documents above a preset similarity threshold can be recalled, calculating and acquiring all words contained in a document library to be searched and the word frequency of each word, and taking effective participles corresponding to the recall quantity with the lowest word frequency in all the words contained in the document library as recall words according to the sequence from low to high of the word frequency; in a preferred embodiment of the present application, if the number of the effective participles in the target document is preferably 100, and the number of the recalls of the recalled word determined by combining the preset similarity threshold and the number of the participles is 8, meanwhile, 10000 words obtained after the participle processing is performed in the document library to be searched are counted, the word frequency of each word in the document library is counted to obtain the word frequency of each word contained in the document library, the word frequencies of the 10000 words are ranked in the order of the word frequencies from low to high, if the word 3, the word 114, the word 428, the word 1439, the word 3556, the word 5268, the word 7387, and the word 8496 in all the words contained in the document library are the 8 words with the lowest word frequency among the 10000 words contained in the document library, the word 3, the word 114, the word 428, the word 1439, the word 14356, the word 3556, the word 5268, the word 7387, and the word 8496 in all the words contained in the document library are taken as the recalled words, meanwhile, the number n of recalls of a recall word is 8. Next, at least one candidate document is selected from the document library to be searched, for example, candidate document 1, candidate document 2, candidate document 3, … …, candidate document (P-1), and candidate document P, where P is a positive integer greater than or equal to 1, where each of the selected candidate documents includes 8 recalled words with the lowest word frequency, that is, each of the selected candidate documents includes word 3, word 114, word 428, word 1439, word 3556, word 5268, word 7387, and word 8496 of all words included in the document library. Then, calculating the similarity between the target document and each candidate document; and finally, determining the corresponding candidate document when the calculated similarity is greater than or equal to the preset similarity threshold as the similar document corresponding to the target document, so that the similar document of the target document can be quickly and seamlessly retrieved in a massive document library, the missed recall is avoided, the calculation amount is reduced, and the calculation speed is increased.

Following the above embodiment of the present application, the step S11 determines the number of the effective participles in the target document, which specifically includes:

acquiring a target document;

As shown in fig. 2, before querying similar documents of a target document in a mass document library to be searched, a target document is obtained, for example, the target document is input, imported or uploaded; then, performing word segmentation processing on the target document to obtain initial word segmentation in the target document; in order to avoid interference of repeated words or useless words on subsequent recalls, it is further required to perform stop word processing on all initial participles in the target document to obtain effective participles in the target document, and then count the number of the participles of the effective participles in the target document, for example, in a preferred embodiment, the target document has 100 effective participles, so that the word segmentation processing, the stop word processing and the count of the participles of the effective participles of the target document are realized, and preparation for performing subsequent search on similar documents corresponding to the target document in a massive document library to be searched is performed.

Next to the foregoing embodiment of the present application, in step S12, based on the preset similarity threshold and the number of segmented words, a calculation formula for determining the number of recalled words is:

(1-t²)*L，

In the embodiment of the application, under the preset similarity threshold value, the similarity of the target document is calculated in a massive document library to be searched, a candidate set can be found by adopting an inverted table generally, and then the similarity is calculated piece by piece. However, when calculating in massive documents, the corpus recalled in the inverted list is still very large, and the method for greatly reducing the number of the recalled texts needs to be obtained by deducing a cosine similarity calculation formula, wherein,

the cosine similarity is calculated as follows:

a, B represents two feature sets of text indicating two feature vectors of the same latitude, and having a length of n. In the context of text similarity calculation, the values of a and B are both Term Frequency-Inverse text Frequency index (TF-IDF) values. Here, for the convenience of derivation, the present application may simplify the above formula. Each dimension value of the vector is represented by 0 and 1, where 0 represents no occurrence and 1 represents occurrence, and a vector representing a document can be represented as follows:

[1，0，1，0，1，1，1，1]

the cosine equation can be simplified to the form:

wherein l_sRepresenting the number of words that co-occur between two documents, where l₁Representing the number of words in the target document,/₂Representing the number of words in any one document in a vast corpus of documents.

The calculation of the embodiment of the application only contains alpha l₁A word without (1-alpha) l₁Word-by-word document S₂An upper limit of the degree of similarity to the target document,

let l_s≤min(ɑl₁，l₂) The number of the same words is the same,

if alpha l₁<l₂Then, then

If alpha l₁>l₂Then, then

If alpha l₁＝l₂Then, then

At this time, let the preset similarity threshold be t, set

It can be seen that if the candidate document does not contain (1-alpha) l₁These words, the upper limits of the candidate document and the target document are

On the contrary, if the similarity between the candidate document and the target document is larger than that between the candidate document and the target document

Then (1-alpha) l must be included₁Or else the predetermined similarity threshold t is not reached for one or more words.

1-ɑ＝1-t²

Therefore, if a document with a similarity greater than a preset similarity threshold value t is to be obtained, the target document (1-t) must be included²)l₁One or more of the words. From the above derivation,. l₁There is no order requirement, so in step S13, the present application can order (1-t)²)l₁The word frequency of the effective participles in the target document in at least one effective participle corresponding to the whole target document is in the order from low to highIn (1-t), the most anterior is selected²)l₁Effective participles as target document recall words, i.e. (1-t) with minimum occurrence²)l₁A valid word segmentation wherein₁And determining the recall number of the recalled words based on a preset similarity threshold t and the word segmentation number L according to the following calculation formula:

(1-t²)*L，

and determining the recall number of the required recall words when searching for similar documents of the target document from a massive document library.

Then, in step S14, the lowest word frequency (1-t) may be used²)l₁The recalling words are used for filtering the documents in the document library to be searched, and all the selected words contain the (1-t)²)l₁Taking the documents of the recalled words as candidate documents, wherein the number of the candidate documents is one or more; then, in step S15, the similarity between the target document and each candidate document is calculated by using the cosine similarity calculation formula, and finally, in step S16, the corresponding candidate document is determined as the similar document corresponding to the target document when the calculated similarity is greater than or equal to the preset similarity threshold, so that the similar document corresponding to the target document is searched and calculated.

Following the above embodiment of the present application, the step S14 selects at least one candidate document from a document library to be searched, where each candidate document includes the recall terms in the number of recalls, and specifically includes:

acquiring a document library to be searched and an inverted index formed by storage positions of each word in the document library in one or more documents in the document library;

As shown in FIG. 2, in order to quickly search out one or more candidate documents from the document library to be searched, first, one or more candidate documents are obtainedAn inverted index (corresponding to the inverted table in fig. 2) formed by the storage positions of the document library and each word in the document library in one or more documents in the document library is taken, and the inverted index is used for indicating a word and the mapping relationship of the word in which documents in the document library exist, so that in the embodiment of the application, based on the inverted index of each word in the document library, the document library can be selected to all contain (1-t)²)l₁One or more candidate documents of the recalled words realize that one or more candidate documents of similar documents which can be used as target documents can be quickly screened from the document library in a mode of inverted indexing of the words.

Following the above embodiment of the present application, the step S15 of calculating the similarity between the target document and each of the candidate documents includes:

As shown in fig. 2, when calculating the similarity between the target document and each candidate document, a forward index (corresponding to the forward table in fig. 2) of each candidate document may be obtained first, where the forward index of each candidate document is used to indicate a list of all words contained in the candidate document, that is, a mapping relationship between the candidate document and all words therein; then, correspondingly obtaining a text characterization vector corresponding to each candidate document based on the forward index of each candidate document, wherein the text characterization vector of the candidate document comprises all words in the candidate document and storage positions of the words in the candidate document; and finally, acquiring a text characterization vector of a target document, and calculating the similarity between the target document and each candidate document based on the text characterization vector corresponding to the target document and the text characterization vector corresponding to each candidate document, so as to realize the calculation of the similarity between the target document and each candidate document, and facilitate the subsequent screening of similar documents of the target document through the calculated similarity.

Through all the embodiments of the application, recall omission is given, the similarity calculation amount can be greatly reduced, the problem that how to reduce recall is avoided but recall omission is not solved in the prior art is solved, and the calculation speed is improved by increasing the risk of recall omission in the prior art is also avoided; and the candidate documents can be selected by selecting the recall words with the minimum word frequency and the recall number, so that the number of the recalled candidate documents is further reduced, and the calculation speed can be greatly improved.

In practical application scenarios, for example, experiments are performed on corpus of 130 ten thousand financial news, one thousand documents are randomly extracted as a document library participating in search, and the average number of recalls of each document using the prior art is as follows: 1257596, respectively; by adopting the similar document searching method based on the massive documents, the number of recalls is as follows: 13000, it can be seen that the number of recalls of the method for searching for similar documents based on a large number of documents provided in an aspect of the present application is 1.03% in the prior art, after the method for searching for similar documents based on a large number of documents provided in an aspect of the present application is adopted, the number of recalls is reduced by 98% on average, the amount of calculation and the calculation time are reduced by about 98% correspondingly, and the effect is very significant.

one or more processors;

Here, for details of each embodiment in the similar document searching device based on the mass documents, reference may be made to corresponding parts of the embodiment of the similar document searching method based on the mass documents, and details are not described here again.

In summary, the present application determines the number of the effective participles in the target document; based on a preset similarity threshold and the number of the participles, determining the number of recalls of the recalled words needing to be recalled; then, acquiring all words contained in a document library to be searched and the word frequency of each word, and taking the words corresponding to the recall quantity with the lowest word frequency in all the words contained in the document library as the recall words according to the sequence of the word frequencies from low to high so as to determine which words the recall words with the recall quantity are specific; selecting at least one candidate document from a document library to be searched, wherein each candidate document comprises the recall terms of the recall number; calculating the similarity between the target document and each candidate document; and determining the corresponding candidate document when the similarity is greater than or equal to the preset similarity threshold as a similar document corresponding to the target document. According to the method and the device, all effective participles above the preset similarity threshold are used as the recall words to be recalled, and the recall words in the recall number all select the effective participles with the minimum word frequency in the recall number, so that the missed recall of the alternative texts for subsequent similarity calculation is avoided, the number of the alternative texts for subsequent similarity calculation is reduced, and the calculation speed is greatly improved.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A similar document searching method based on massive documents, wherein the method comprises the following steps:

2. The method of claim 1, wherein the determining a number of tokens for a valid token in a target document comprises:

acquiring a target document;

3. The method of claim 1, wherein the calculation formula for determining the number of recalls of a recalled word based on a preset similarity threshold and the number of participles is:

(1-t²)*L，

4. The method of claim 1, wherein said selecting at least one candidate document from said document corpus, wherein each of said candidate documents contains said recall number of recall terms, comprises:

5. The method of claim 1, wherein said calculating a similarity between said target document and each of said candidate documents comprises:

6. A non-transitory storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement the method of any one of claims 1 to 5.

7. A similar document searching apparatus based on a large number of documents, wherein the apparatus comprises:

one or more processors;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.