CN117290460A

CN117290460A - Method, system, device and storage medium for calculating similarity of massive texts

Info

Publication number: CN117290460A
Application number: CN202311576057.1A
Authority: CN
Inventors: 孙琦; 魏东晓; 于通
Original assignee: Zhongfu Information Co Ltd
Current assignee: Zhongfu Information Co Ltd
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2023-12-26

Abstract

The invention provides a method, a system, a device and a storage medium for calculating similarity of massive texts, and belongs to the technical field of text recognition. The method comprises the following steps: carrying out word bag persistence, loading word bags and constructing an AC tree for storing document indexes; preprocessing a document to be detected; eliminating irrelevant documents in the documents to be detected by using important features; searching corresponding documents according to the document index, and identifying similar documents and similarity values of the documents to be detected by adopting a similarity calculation method of multi-feature fusion. According to the method, the literal similarity and the semantic similarity of the texts are comprehensively considered, the calculation accuracy of the similar documents can be ensured, and meanwhile, the similar documents can be effectively and rapidly searched in a large number of texts.

Description

Method, system, device and storage medium for calculating similarity of massive texts

Technical Field

The invention relates to the technical field of text recognition, in particular to a method, a system, a device and a storage medium for calculating similarity of massive texts.

Background

The existing text similarity calculation method comprises the following steps: the fingerprint similarity calculation can quickly find out similar documents, the documents are converted into fingerprints mainly through simhash, minhash and other modes, and the similarity of different documents is calculated through Hamming distance calculation and other modes; editing distance similarity calculation, namely calculating the editing distance of two documents to represent the similarity of the documents; the similarity of semantic features is calculated by the traditional method, documents are encoded by doc2vec and other modes, and the similarity of the two documents is calculated by selecting a proper distance measurement mode; and calculating semantic similarity of the large model, loading a pre-training model to realize text coding, and calculating similarity of two documents in a measurement mode such as cosine similarity and the like.

Although each of the above methods has advantages, the accuracy and the speed cannot be both required. Specifically:

the fingerprint similarity calculation has the characteristics of high speed and convenient storage, and is a more general method in the current mass text similarity calculation. But has the disadvantages that: the accuracy is not enough, the same hash value is easy to conflict and generate, the situation of misjudgment exists, and the specific similarity cannot be known.

Editing distance similarity calculation is accurate, but has a problem of slow speed.

Although the characteristics of semantic information can be considered in the traditional semantic feature similarity calculation, the traditional semantic feature similarity calculation is easy to influence by initialization, the effect is good, bad and high in contingency.

Most of the businesses use the edit distance to finish the text similarity analysis, judge the similarity by comparing the conversion times of the character strings, although the method is simple and visual, can finish most of the text similarity analysis, but low-level errors often occur, and the meaning of two sentences is opposite when judging that the two texts are obviously unpaired like considering that the edit distance is smaller, and a negative word is added to the texts at random. In addition, in a specific field, there are often some rare words or terms of the specific field, which may not be able to accurately capture the meaning of the semantic similarity calculation method.

Disclosure of Invention

Aiming at the problems, the invention aims to provide a method, a system, a device and a storage medium for calculating mass text similarity, which comprehensively consider text literal similarity and semantic similarity, can ensure the accuracy of similar document calculation, and can effectively and rapidly search similar documents in mass texts.

The invention aims to achieve the aim, and the aim is achieved by the following technical scheme: a method for calculating the similarity of massive texts comprises the following steps:

carrying out word bag persistence, loading word bags and constructing an AC tree for storing document indexes;

preprocessing a document to be detected;

eliminating irrelevant documents in the documents to be detected by using important features;

searching corresponding documents according to the document index, and identifying similar documents and similarity values of the documents to be detected by adopting a similarity calculation method of multi-feature fusion.

Further, the performing word bag persistence, loading word bags and constructing an AC tree for storing document indexes includes:

processing the existing document in the file storage system, extracting the characteristic words of the document through the TF-IDF, and storing the characteristic words in a database table;

and loading word bags by reading the database table, and constructing an AC tree for storing the document indexes.

Further, the tree nodes of the AC tree are configured to store feature word indexes, and md5 is used as the feature word index.

Further, the preprocessing of the document to be detected includes:

and removing stop words, irrelevant words and special symbols in the document to be tested, and replacing the text numbers and serial numbers in the document to be tested with preset characters.

Further, the excluding irrelevant documents in the documents to be detected by using the important features includes:

and carrying out multimode matching on the document to be detected, and obtaining the matched feature words and the corresponding document indexes.

Further, searching for a corresponding document according to the document index, and identifying similar documents and similarity values of the document to be detected by adopting a similarity calculation method of multi-feature fusion, including:

screening out corresponding documents according to the document indexes;

respectively carrying out semantic similarity calculation and literal similarity calculation on the document to be detected and the screened document, and determining the final similarity according to the calculation result;

determining similar documents of the documents to be detected according to the highest value of the final similarity;

and returning the similar documents and the corresponding similarity values.

Further, the performing semantic similarity calculation and literal similarity calculation, and determining the final similarity according to the calculation result, includes:

according to the formulaCalculating semantic similarity between to-be-detected document and screened document；

Wherein X is a document to be detected, Y is a screened document, A is an encoding vector of X, and B is an encoding vector of Y;

according to the calculation formula of the literal similarity, calculating the literal similarity of the document to be detected and the screened document；

The literal similarity calculation formula is specifically as follows:

wherein,for the length of the document to be detected, +.>For the length of the screened document, +.>Number of characters matching for two strings, +.>Representing half of the number of transposition;

according to the formulaCalculating final similarity->；

Wherein,weight for semantic similarity, +.>Is the weight of literal similarity, and +.>。

Correspondingly, the invention also discloses a mass text similarity calculation system, which comprises:

the initialization module is configured for carrying out word bag persistence, loading word bags and constructing an AC tree for storing document indexes;

the preprocessing module is configured for preprocessing a document to be detected;

a screening module configured to exclude irrelevant documents among the documents to be detected using the important features;

and the similarity fusion calculation module is configured to search corresponding documents according to the document index, and identify similar documents and similarity values of the documents to be detected by adopting a multi-feature fusion similarity calculation method.

Correspondingly, the invention discloses a mass text similarity calculation device, which comprises:

the memory is used for storing a mass text similarity calculation program;

and the processor is used for realizing the steps of the mass text similarity calculation method when executing the mass text similarity calculation program.

Correspondingly, the invention discloses a readable storage medium, wherein a mass text similarity calculation program is stored on the readable storage medium, and the mass text similarity calculation program realizes the steps of the mass text similarity calculation method according to any one of the above steps when being executed by a processor.

Compared with the prior art, the invention has the beneficial effects that: the invention discloses a method, a system, a device and a storage medium for calculating mass text similarity, which are used for carrying out multimode matching on documents by constructing an AC tree, eliminating most irrelevant documents by utilizing important characteristics, obtaining a small preselected result set, realizing quick screening, and being capable of responding to large-scale data and quickly searching similar documents; then, by utilizing a mixed similarity calculation method, the literal similarity and the semantic similarity of the text are comprehensively considered when similar documents are identified, and the accuracy problem of the text similarity is guaranteed while quick retrieval is realized.

It can be seen that the present invention has outstanding substantial features and significant advances over the prior art, as well as the benefits of its implementation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of an embodiment of the present invention.

Fig. 2 is a schematic diagram of an AC tree structure according to an embodiment of the present invention.

Fig. 3 is a system configuration diagram of an embodiment of the present invention.

In the figure, 1, initializing a module; 2. a preprocessing module; 3. a screening module; 4. and the similarity fusion calculation module.

Detailed Description

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a method for calculating the similarity of massive texts, which is divided into two stages, namely an initialization stage and a similar document retrieval stage. The initialization stage is only needed to be loaded once when the service is started, and in the using process, if the word bag is updated, the database table and the AC tree are updated at the same time according to the tree structure of the patent. The similar document searching stage comprises the steps of firstly preprocessing a document to be detected, including: removing stop words, irrelevant words and special symbols; and replacing some text numbers and serial numbers, and then calculating the similarity between the document to be tested and the retrieval document.

As described with reference to fig. 1, the method for calculating the similarity of mass texts provided by the invention specifically comprises the following steps:

s1: performing word bag persistence, loading word bags and constructing an AC tree for storing document indexes.

In the specific implementation mode, firstly, processing the existing document in a file storage system, extracting the characteristic words of the document through TF-IDF, and storing the characteristic words in a database table; therefore, the word bag is permanently stored in the database table and consists of characteristic words and word frequencies thereof in the document. Then, index initialization is performed, namely, a word bag is loaded by reading a database table, and an AC tree for storing document indexes is constructed. The structure of the AC tree is shown in fig. 2, where the tree nodes store feature word indexes, with md5 as the feature word-word bag index, e.g. matching to beijing university, and return all document indexes [ doc1md5, doc2md5,..docnmd 5] containing the word.

S2: preprocessing the document to be detected.

In a specific embodiment, preprocessing a document to be tested includes: removing stop words, irrelevant words and special symbols; and some clerks, substitution of serial numbers, etc.

S3: the important features are used to exclude irrelevant documents in the documents to be detected.

In a specific embodiment, the purpose of this step is to achieve a fast screening of documents. And carrying out multimode matching on the document to be detected, and obtaining the matched feature words and the corresponding document indexes, thereby achieving the purpose of reducing the detection range.

S4: searching corresponding documents according to the document index, and identifying similar documents and similarity values of the documents to be detected by adopting a similarity calculation method of multi-feature fusion.

In the specific embodiment, firstly, semantic similarity calculation and literal similarity calculation are respectively carried out on a document to be detected and a screened document, and the final similarity is determined according to the calculation result; then, determining similar documents of the documents to be detected according to the highest value of the final similarity; and finally, returning the similar documents and the corresponding similarity values.

The similarity calculation process of the step is specifically as follows:

1. semantic similarity calculation:

because BERT has the problem of embedding space anisotropy, similarity is difficult to measure in cosine, dot product and other modes, high-frequency and low-frequency words are unevenly distributed, sentence embedding of BERT is directly used for text similarity calculation, and finally the obtained effect is relatively poor. The method uses BERT-Flow to complete the encoding of documents in terms of semantic similarity, and can convert anisotropic sentence embedding distribution into smooth and isotropic Gaussian distribution. The cosine similarity is used as a measurement mode of the document similarity, and the calculation formula of the semantic similarity is as follows:

wherein X is a document to be detected, Y is a screened document, A is a coding vector of X, and B is a coding vector of Y.

2. Calculating the word similarity:

since the real environment text is complex and various, the consideration of semantic similarity is insufficient, and two situations of literal similarity and semantic similarity exist in reality. Although these two similarities cannot be completely separated, there are cases that the literal similarity is similar but the semantic similarity is not necessarily similar, so the method combines the semantic similarity with the literal similarity, and the following calculation mode of the literal similarity is used for balancing the importance degree of the two in complex business by configuring weights:

by calculating the java Distance, considering the position of the word in the text, whether the two documents are similar in structure or not is judged. The basis for the Jaro distance calculation is the number of common characters between strings (the position of the characters in both strings is the same). The literal similarity calculation formula is specifically as follows:

3. according to the formulaCalculating final similarity->。

Therefore, the invention discloses a mass text similarity calculation method, and simultaneously considers text literal similarity and semantic similarity, so that the accuracy problem of text similarity is guaranteed while quick retrieval is realized, corresponding weights are set, and adjustment is performed according to task requirements. In addition, in order to achieve the purpose of quickly searching similar documents in a large number of texts, the patent constructs a tree structure index by extracting document keywords to form word bags, so that the quick search of the large number of data is completed, and the accuracy and the speed requirement of an algorithm are ensured.

Based on the above embodiment, as shown in fig. 3, the present invention also discloses a mass text similarity calculation system, including: the system comprises an initialization module 1, a preprocessing module 2, a screening module 3 and a similarity fusion calculation module 4.

The initialization module 1 is configured to perform word bag persistence, load word bags and construct an AC tree for storing document indexes.

And the preprocessing module 2 is configured to preprocess the document to be detected.

A screening module 3 configured to exclude irrelevant documents among the documents to be detected using the important features.

And the similarity fusion calculation module 4 is configured to search corresponding documents according to the document index, and identify similar documents and similarity values of the documents to be detected by adopting a multi-feature fusion similarity calculation method.

The specific implementation manner of the massive text similarity calculation system in this embodiment is basically identical to the specific implementation manner of the massive text similarity calculation method, and is not described herein again.

The invention also discloses a mass text similarity calculation device, which comprises a processor and a memory; the steps of the massive text similarity calculation method according to any one of the above are realized when the processor executes the massive text similarity calculation program stored in the memory.

Further, the mass text similarity calculation device in this embodiment may further include:

the input interface is used for acquiring a mass text similarity calculation program imported from the outside, storing the acquired mass text similarity calculation program into the memory, and acquiring various instructions and parameters transmitted by the external terminal equipment and transmitting the various instructions and parameters into the processor so that the processor can develop corresponding processing by utilizing the various instructions and parameters. In this embodiment, the input interface may specifically include, but is not limited to, a USB interface, a serial interface, a voice input interface, a fingerprint input interface, a hard disk reading interface, and the like.

And the output interface is used for outputting various data generated by the processor to the terminal equipment connected with the output interface so that other terminal equipment connected with the output interface can acquire various data generated by the processor. In this embodiment, the output interface may specifically include, but is not limited to, a USB interface, a serial interface, and the like.

And the communication unit is used for establishing remote communication connection between the mass text similarity calculation device and the external server so that the mass text similarity calculation device can mount the image files to the external server. In this embodiment, the communication unit may specifically include, but is not limited to, a remote communication unit based on a wireless communication technology or a wired communication technology.

And the keyboard is used for acquiring various parameter data or instructions input by a user by knocking the key cap in real time.

And the display is used for running the related information of the mass text similarity calculation process to display in real time.

A mouse may be used to assist a user in inputting data and to simplify user operations.

Embodiments of the present invention also disclose a readable storage medium, where the readable storage medium includes Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. A readable storage medium stores a mass text similarity calculation program which when executed by a processor implements the steps of the mass text similarity calculation method as described in any one of the above.

In summary, the text literal similarity and the semantic similarity are comprehensively considered, so that the accuracy of similar document calculation can be ensured, and meanwhile, similar documents can be effectively and rapidly searched in a large number of texts.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the method disclosed in the embodiment, since it corresponds to the system disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed systems, and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each module may exist alone physically, or two or more modules may be integrated in one unit.

Similarly, each processing unit in the embodiments of the present invention may be integrated in one functional module, or each processing unit may exist physically, or two or more processing units may be integrated in one functional module.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The method, the system, the device and the readable storage medium for calculating the similarity of the massive texts provided by the invention are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims

1. The mass text similarity calculation method is characterized by comprising the following steps of:

preprocessing a document to be detected;

2. The method for computing the similarity of mass texts according to claim 1, wherein the performing word bag persistence, loading word bags and constructing an AC tree for storing document indexes comprises:

3. The method for computing the similarity of mass texts according to claim 2, wherein the tree nodes of the AC tree are used for storing feature word indexes, and md5 is used as the feature word index.

4. The method for calculating the similarity of mass texts according to claim 3, wherein the preprocessing of the document to be detected comprises:

5. The method for computing the similarity of mass texts according to claim 4, wherein the step of eliminating irrelevant documents among the documents to be detected by using important features comprises:

6. The method for computing the similarity of massive texts according to claim 5, wherein the searching for corresponding documents according to the document index and identifying similar documents and similarity values of the documents to be detected by adopting a similarity computing method of multi-feature fusion comprises:

screening out corresponding documents according to the document indexes;

and returning the similar documents and the corresponding similarity values.

7. The method for calculating the similarity of mass texts according to claim 6, wherein the steps of performing semantic similarity calculation and literal similarity calculation and determining the final similarity according to the calculation result include:

The literal similarity calculation formula is specifically as follows:

wherein,for the length of the document to be detected, +.>For the length of the screened document, +.>The number of characters that match for the two strings,representing half of the number of transposition;

according to the formulaCalculating final similarity->；

8. A mass text similarity computing system, comprising:

9. A mass text similarity calculation device, comprising:

the memory is used for storing a mass text similarity calculation program;

a processor, configured to implement the steps of the mass text similarity calculation method according to any one of claims 1 to 7 when executing the mass text similarity calculation program.

10. A readable storage medium, characterized by: a mass text similarity calculation program stored on the readable storage medium, which when executed by a processor, implements the steps of the mass text similarity calculation method according to any one of claims 1 to 7.