CN117290460A - Method, system, device and storage medium for calculating similarity of massive texts - Google Patents

Method, system, device and storage medium for calculating similarity of massive texts Download PDF

Info

Publication number
CN117290460A
CN117290460A CN202311576057.1A CN202311576057A CN117290460A CN 117290460 A CN117290460 A CN 117290460A CN 202311576057 A CN202311576057 A CN 202311576057A CN 117290460 A CN117290460 A CN 117290460A
Authority
CN
China
Prior art keywords
similarity
document
documents
detected
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311576057.1A
Other languages
Chinese (zh)
Inventor
孙琦
魏东晓
于通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongfu Information Co Ltd
Original Assignee
Zhongfu Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongfu Information Co Ltd filed Critical Zhongfu Information Co Ltd
Priority to CN202311576057.1A priority Critical patent/CN117290460A/en
Publication of CN117290460A publication Critical patent/CN117290460A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a method, a system, a device and a storage medium for calculating similarity of massive texts, and belongs to the technical field of text recognition. The method comprises the following steps: carrying out word bag persistence, loading word bags and constructing an AC tree for storing document indexes; preprocessing a document to be detected; eliminating irrelevant documents in the documents to be detected by using important features; searching corresponding documents according to the document index, and identifying similar documents and similarity values of the documents to be detected by adopting a similarity calculation method of multi-feature fusion. According to the method, the literal similarity and the semantic similarity of the texts are comprehensively considered, the calculation accuracy of the similar documents can be ensured, and meanwhile, the similar documents can be effectively and rapidly searched in a large number of texts.

Description

Method, system, device and storage medium for calculating similarity of massive texts
Technical Field
The invention relates to the technical field of text recognition, in particular to a method, a system, a device and a storage medium for calculating similarity of massive texts.
Background
The existing text similarity calculation method comprises the following steps: the fingerprint similarity calculation can quickly find out similar documents, the documents are converted into fingerprints mainly through simhash, minhash and other modes, and the similarity of different documents is calculated through Hamming distance calculation and other modes; editing distance similarity calculation, namely calculating the editing distance of two documents to represent the similarity of the documents; the similarity of semantic features is calculated by the traditional method, documents are encoded by doc2vec and other modes, and the similarity of the two documents is calculated by selecting a proper distance measurement mode; and calculating semantic similarity of the large model, loading a pre-training model to realize text coding, and calculating similarity of two documents in a measurement mode such as cosine similarity and the like.
Although each of the above methods has advantages, the accuracy and the speed cannot be both required. Specifically:
the fingerprint similarity calculation has the characteristics of high speed and convenient storage, and is a more general method in the current mass text similarity calculation. But has the disadvantages that: the accuracy is not enough, the same hash value is easy to conflict and generate, the situation of misjudgment exists, and the specific similarity cannot be known.
Editing distance similarity calculation is accurate, but has a problem of slow speed.
Although the characteristics of semantic information can be considered in the traditional semantic feature similarity calculation, the traditional semantic feature similarity calculation is easy to influence by initialization, the effect is good, bad and high in contingency.
Most of the businesses use the edit distance to finish the text similarity analysis, judge the similarity by comparing the conversion times of the character strings, although the method is simple and visual, can finish most of the text similarity analysis, but low-level errors often occur, and the meaning of two sentences is opposite when judging that the two texts are obviously unpaired like considering that the edit distance is smaller, and a negative word is added to the texts at random. In addition, in a specific field, there are often some rare words or terms of the specific field, which may not be able to accurately capture the meaning of the semantic similarity calculation method.
Disclosure of Invention
Aiming at the problems, the invention aims to provide a method, a system, a device and a storage medium for calculating mass text similarity, which comprehensively consider text literal similarity and semantic similarity, can ensure the accuracy of similar document calculation, and can effectively and rapidly search similar documents in mass texts.
The invention aims to achieve the aim, and the aim is achieved by the following technical scheme: a method for calculating the similarity of massive texts comprises the following steps:
carrying out word bag persistence, loading word bags and constructing an AC tree for storing document indexes;
preprocessing a document to be detected;
eliminating irrelevant documents in the documents to be detected by using important features;
searching corresponding documents according to the document index, and identifying similar documents and similarity values of the documents to be detected by adopting a similarity calculation method of multi-feature fusion.
Further, the performing word bag persistence, loading word bags and constructing an AC tree for storing document indexes includes:
processing the existing document in the file storage system, extracting the characteristic words of the document through the TF-IDF, and storing the characteristic words in a database table;
and loading word bags by reading the database table, and constructing an AC tree for storing the document indexes.
Further, the tree nodes of the AC tree are configured to store feature word indexes, and md5 is used as the feature word index.
Further, the preprocessing of the document to be detected includes:
and removing stop words, irrelevant words and special symbols in the document to be tested, and replacing the text numbers and serial numbers in the document to be tested with preset characters.
Further, the excluding irrelevant documents in the documents to be detected by using the important features includes:
and carrying out multimode matching on the document to be detected, and obtaining the matched feature words and the corresponding document indexes.
Further, searching for a corresponding document according to the document index, and identifying similar documents and similarity values of the document to be detected by adopting a similarity calculation method of multi-feature fusion, including:
screening out corresponding documents according to the document indexes;
respectively carrying out semantic similarity calculation and literal similarity calculation on the document to be detected and the screened document, and determining the final similarity according to the calculation result;
determining similar documents of the documents to be detected according to the highest value of the final similarity;
and returning the similar documents and the corresponding similarity values.
Further, the performing semantic similarity calculation and literal similarity calculation, and determining the final similarity according to the calculation result, includes:
according to the formulaCalculating semantic similarity between to-be-detected document and screened document
Wherein X is a document to be detected, Y is a screened document, A is an encoding vector of X, and B is an encoding vector of Y;
according to the calculation formula of the literal similarity, calculating the literal similarity of the document to be detected and the screened document
The literal similarity calculation formula is specifically as follows:
wherein,for the length of the document to be detected, +.>For the length of the screened document, +.>Number of characters matching for two strings, +.>Representing half of the number of transposition;
according to the formulaCalculating final similarity->
Wherein,weight for semantic similarity, +.>Is the weight of literal similarity, and +.>
Correspondingly, the invention also discloses a mass text similarity calculation system, which comprises:
the initialization module is configured for carrying out word bag persistence, loading word bags and constructing an AC tree for storing document indexes;
the preprocessing module is configured for preprocessing a document to be detected;
a screening module configured to exclude irrelevant documents among the documents to be detected using the important features;
and the similarity fusion calculation module is configured to search corresponding documents according to the document index, and identify similar documents and similarity values of the documents to be detected by adopting a multi-feature fusion similarity calculation method.
Correspondingly, the invention discloses a mass text similarity calculation device, which comprises:
the memory is used for storing a mass text similarity calculation program;
and the processor is used for realizing the steps of the mass text similarity calculation method when executing the mass text similarity calculation program.
Correspondingly, the invention discloses a readable storage medium, wherein a mass text similarity calculation program is stored on the readable storage medium, and the mass text similarity calculation program realizes the steps of the mass text similarity calculation method according to any one of the above steps when being executed by a processor.
Compared with the prior art, the invention has the beneficial effects that: the invention discloses a method, a system, a device and a storage medium for calculating mass text similarity, which are used for carrying out multimode matching on documents by constructing an AC tree, eliminating most irrelevant documents by utilizing important characteristics, obtaining a small preselected result set, realizing quick screening, and being capable of responding to large-scale data and quickly searching similar documents; then, by utilizing a mixed similarity calculation method, the literal similarity and the semantic similarity of the text are comprehensively considered when similar documents are identified, and the accuracy problem of the text similarity is guaranteed while quick retrieval is realized.
It can be seen that the present invention has outstanding substantial features and significant advances over the prior art, as well as the benefits of its implementation.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of an embodiment of the present invention.
Fig. 2 is a schematic diagram of an AC tree structure according to an embodiment of the present invention.
Fig. 3 is a system configuration diagram of an embodiment of the present invention.
In the figure, 1, initializing a module; 2. a preprocessing module; 3. a screening module; 4. and the similarity fusion calculation module.
Detailed Description
In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a method for calculating the similarity of massive texts, which is divided into two stages, namely an initialization stage and a similar document retrieval stage. The initialization stage is only needed to be loaded once when the service is started, and in the using process, if the word bag is updated, the database table and the AC tree are updated at the same time according to the tree structure of the patent. The similar document searching stage comprises the steps of firstly preprocessing a document to be detected, including: removing stop words, irrelevant words and special symbols; and replacing some text numbers and serial numbers, and then calculating the similarity between the document to be tested and the retrieval document.
As described with reference to fig. 1, the method for calculating the similarity of mass texts provided by the invention specifically comprises the following steps:
s1: performing word bag persistence, loading word bags and constructing an AC tree for storing document indexes.
In the specific implementation mode, firstly, processing the existing document in a file storage system, extracting the characteristic words of the document through TF-IDF, and storing the characteristic words in a database table; therefore, the word bag is permanently stored in the database table and consists of characteristic words and word frequencies thereof in the document. Then, index initialization is performed, namely, a word bag is loaded by reading a database table, and an AC tree for storing document indexes is constructed. The structure of the AC tree is shown in fig. 2, where the tree nodes store feature word indexes, with md5 as the feature word-word bag index, e.g. matching to beijing university, and return all document indexes [ doc1md5, doc2md5,..docnmd 5] containing the word.
S2: preprocessing the document to be detected.
In a specific embodiment, preprocessing a document to be tested includes: removing stop words, irrelevant words and special symbols; and some clerks, substitution of serial numbers, etc.
S3: the important features are used to exclude irrelevant documents in the documents to be detected.
In a specific embodiment, the purpose of this step is to achieve a fast screening of documents. And carrying out multimode matching on the document to be detected, and obtaining the matched feature words and the corresponding document indexes, thereby achieving the purpose of reducing the detection range.
S4: searching corresponding documents according to the document index, and identifying similar documents and similarity values of the documents to be detected by adopting a similarity calculation method of multi-feature fusion.
In the specific embodiment, firstly, semantic similarity calculation and literal similarity calculation are respectively carried out on a document to be detected and a screened document, and the final similarity is determined according to the calculation result; then, determining similar documents of the documents to be detected according to the highest value of the final similarity; and finally, returning the similar documents and the corresponding similarity values.
The similarity calculation process of the step is specifically as follows:
1. semantic similarity calculation:
because BERT has the problem of embedding space anisotropy, similarity is difficult to measure in cosine, dot product and other modes, high-frequency and low-frequency words are unevenly distributed, sentence embedding of BERT is directly used for text similarity calculation, and finally the obtained effect is relatively poor. The method uses BERT-Flow to complete the encoding of documents in terms of semantic similarity, and can convert anisotropic sentence embedding distribution into smooth and isotropic Gaussian distribution. The cosine similarity is used as a measurement mode of the document similarity, and the calculation formula of the semantic similarity is as follows:
wherein X is a document to be detected, Y is a screened document, A is a coding vector of X, and B is a coding vector of Y.
2. Calculating the word similarity:
since the real environment text is complex and various, the consideration of semantic similarity is insufficient, and two situations of literal similarity and semantic similarity exist in reality. Although these two similarities cannot be completely separated, there are cases that the literal similarity is similar but the semantic similarity is not necessarily similar, so the method combines the semantic similarity with the literal similarity, and the following calculation mode of the literal similarity is used for balancing the importance degree of the two in complex business by configuring weights:
by calculating the java Distance, considering the position of the word in the text, whether the two documents are similar in structure or not is judged. The basis for the Jaro distance calculation is the number of common characters between strings (the position of the characters in both strings is the same). The literal similarity calculation formula is specifically as follows:
wherein,for the length of the document to be detected, +.>For the length of the screened document, +.>Number of characters matching for two strings, +.>Representing half of the number of transposition;
3. according to the formulaCalculating final similarity->
Wherein,weight for semantic similarity, +.>Is the weight of literal similarity, and +.>
Therefore, the invention discloses a mass text similarity calculation method, and simultaneously considers text literal similarity and semantic similarity, so that the accuracy problem of text similarity is guaranteed while quick retrieval is realized, corresponding weights are set, and adjustment is performed according to task requirements. In addition, in order to achieve the purpose of quickly searching similar documents in a large number of texts, the patent constructs a tree structure index by extracting document keywords to form word bags, so that the quick search of the large number of data is completed, and the accuracy and the speed requirement of an algorithm are ensured.
Based on the above embodiment, as shown in fig. 3, the present invention also discloses a mass text similarity calculation system, including: the system comprises an initialization module 1, a preprocessing module 2, a screening module 3 and a similarity fusion calculation module 4.
The initialization module 1 is configured to perform word bag persistence, load word bags and construct an AC tree for storing document indexes.
And the preprocessing module 2 is configured to preprocess the document to be detected.
A screening module 3 configured to exclude irrelevant documents among the documents to be detected using the important features.
And the similarity fusion calculation module 4 is configured to search corresponding documents according to the document index, and identify similar documents and similarity values of the documents to be detected by adopting a multi-feature fusion similarity calculation method.
The specific implementation manner of the massive text similarity calculation system in this embodiment is basically identical to the specific implementation manner of the massive text similarity calculation method, and is not described herein again.
The invention also discloses a mass text similarity calculation device, which comprises a processor and a memory; the steps of the massive text similarity calculation method according to any one of the above are realized when the processor executes the massive text similarity calculation program stored in the memory.
Further, the mass text similarity calculation device in this embodiment may further include:
the input interface is used for acquiring a mass text similarity calculation program imported from the outside, storing the acquired mass text similarity calculation program into the memory, and acquiring various instructions and parameters transmitted by the external terminal equipment and transmitting the various instructions and parameters into the processor so that the processor can develop corresponding processing by utilizing the various instructions and parameters. In this embodiment, the input interface may specifically include, but is not limited to, a USB interface, a serial interface, a voice input interface, a fingerprint input interface, a hard disk reading interface, and the like.
And the output interface is used for outputting various data generated by the processor to the terminal equipment connected with the output interface so that other terminal equipment connected with the output interface can acquire various data generated by the processor. In this embodiment, the output interface may specifically include, but is not limited to, a USB interface, a serial interface, and the like.
And the communication unit is used for establishing remote communication connection between the mass text similarity calculation device and the external server so that the mass text similarity calculation device can mount the image files to the external server. In this embodiment, the communication unit may specifically include, but is not limited to, a remote communication unit based on a wireless communication technology or a wired communication technology.
And the keyboard is used for acquiring various parameter data or instructions input by a user by knocking the key cap in real time.
And the display is used for running the related information of the mass text similarity calculation process to display in real time.
A mouse may be used to assist a user in inputting data and to simplify user operations.
Embodiments of the present invention also disclose a readable storage medium, where the readable storage medium includes Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. A readable storage medium stores a mass text similarity calculation program which when executed by a processor implements the steps of the mass text similarity calculation method as described in any one of the above.
In summary, the text literal similarity and the semantic similarity are comprehensively considered, so that the accuracy of similar document calculation can be ensured, and meanwhile, similar documents can be effectively and rapidly searched in a large number of texts.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the method disclosed in the embodiment, since it corresponds to the system disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided by the present invention, it should be understood that the disclosed systems, and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each module may exist alone physically, or two or more modules may be integrated in one unit.
Similarly, each processing unit in the embodiments of the present invention may be integrated in one functional module, or each processing unit may exist physically, or two or more processing units may be integrated in one functional module.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The method, the system, the device and the readable storage medium for calculating the similarity of the massive texts provided by the invention are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims (10)

1. The mass text similarity calculation method is characterized by comprising the following steps of:
carrying out word bag persistence, loading word bags and constructing an AC tree for storing document indexes;
preprocessing a document to be detected;
eliminating irrelevant documents in the documents to be detected by using important features;
searching corresponding documents according to the document index, and identifying similar documents and similarity values of the documents to be detected by adopting a similarity calculation method of multi-feature fusion.
2. The method for computing the similarity of mass texts according to claim 1, wherein the performing word bag persistence, loading word bags and constructing an AC tree for storing document indexes comprises:
processing the existing document in the file storage system, extracting the characteristic words of the document through the TF-IDF, and storing the characteristic words in a database table;
and loading word bags by reading the database table, and constructing an AC tree for storing the document indexes.
3. The method for computing the similarity of mass texts according to claim 2, wherein the tree nodes of the AC tree are used for storing feature word indexes, and md5 is used as the feature word index.
4. The method for calculating the similarity of mass texts according to claim 3, wherein the preprocessing of the document to be detected comprises:
and removing stop words, irrelevant words and special symbols in the document to be tested, and replacing the text numbers and serial numbers in the document to be tested with preset characters.
5. The method for computing the similarity of mass texts according to claim 4, wherein the step of eliminating irrelevant documents among the documents to be detected by using important features comprises:
and carrying out multimode matching on the document to be detected, and obtaining the matched feature words and the corresponding document indexes.
6. The method for computing the similarity of massive texts according to claim 5, wherein the searching for corresponding documents according to the document index and identifying similar documents and similarity values of the documents to be detected by adopting a similarity computing method of multi-feature fusion comprises:
screening out corresponding documents according to the document indexes;
respectively carrying out semantic similarity calculation and literal similarity calculation on the document to be detected and the screened document, and determining the final similarity according to the calculation result;
determining similar documents of the documents to be detected according to the highest value of the final similarity;
and returning the similar documents and the corresponding similarity values.
7. The method for calculating the similarity of mass texts according to claim 6, wherein the steps of performing semantic similarity calculation and literal similarity calculation and determining the final similarity according to the calculation result include:
according to the formulaCalculating semantic similarity between to-be-detected document and screened document
Wherein X is a document to be detected, Y is a screened document, A is an encoding vector of X, and B is an encoding vector of Y;
according to the calculation formula of the literal similarity, calculating the literal similarity of the document to be detected and the screened document
The literal similarity calculation formula is specifically as follows:
wherein,for the length of the document to be detected, +.>For the length of the screened document, +.>The number of characters that match for the two strings,representing half of the number of transposition;
according to the formulaCalculating final similarity->
Wherein,weight for semantic similarity, +.>Is the weight of literal similarity, and +.>
8. A mass text similarity computing system, comprising:
the initialization module is configured for carrying out word bag persistence, loading word bags and constructing an AC tree for storing document indexes;
the preprocessing module is configured for preprocessing a document to be detected;
a screening module configured to exclude irrelevant documents among the documents to be detected using the important features;
and the similarity fusion calculation module is configured to search corresponding documents according to the document index, and identify similar documents and similarity values of the documents to be detected by adopting a multi-feature fusion similarity calculation method.
9. A mass text similarity calculation device, comprising:
the memory is used for storing a mass text similarity calculation program;
a processor, configured to implement the steps of the mass text similarity calculation method according to any one of claims 1 to 7 when executing the mass text similarity calculation program.
10. A readable storage medium, characterized by: a mass text similarity calculation program stored on the readable storage medium, which when executed by a processor, implements the steps of the mass text similarity calculation method according to any one of claims 1 to 7.
CN202311576057.1A 2023-11-24 2023-11-24 Method, system, device and storage medium for calculating similarity of massive texts Pending CN117290460A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311576057.1A CN117290460A (en) 2023-11-24 2023-11-24 Method, system, device and storage medium for calculating similarity of massive texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311576057.1A CN117290460A (en) 2023-11-24 2023-11-24 Method, system, device and storage medium for calculating similarity of massive texts

Publications (1)

Publication Number Publication Date
CN117290460A true CN117290460A (en) 2023-12-26

Family

ID=89244736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311576057.1A Pending CN117290460A (en) 2023-11-24 2023-11-24 Method, system, device and storage medium for calculating similarity of massive texts

Country Status (1)

Country Link
CN (1) CN117290460A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1790321A (en) * 2005-10-28 2006-06-21 北大方正集团有限公司 Fast similarity-based retrieval method for mass text
CN104182464A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic-based text retrieval method
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN110619036A (en) * 2019-08-25 2019-12-27 南京理工大学 Full-text retrieval system based on improved IF-IDF algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1790321A (en) * 2005-10-28 2006-06-21 北大方正集团有限公司 Fast similarity-based retrieval method for mass text
CN104182464A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic-based text retrieval method
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN110619036A (en) * 2019-08-25 2019-12-27 南京理工大学 Full-text retrieval system based on improved IF-IDF algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张宝军: "网络入侵检测原理与技术研究", 30 August 2014, 中国广播电视出版社, pages: 80 - 81 *
黄嘉俊: "基于组合语义相似度计算的疾病术语自动编码", 微型电脑应用, pages 157 - 160 *

Similar Documents

Publication Publication Date Title
RU2678716C1 (en) Use of autoencoders for learning text classifiers in natural language
US9305083B2 (en) Author disambiguation
CN110741376B (en) Automatic document analysis for different natural languages
US10528609B2 (en) Aggregating procedures for automatic document analysis
CN112990035B (en) Text recognition method, device, equipment and storage medium
CN108228567B (en) Method and device for extracting short names of organizations
CN112257419A (en) Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof
CN112329460A (en) Text topic clustering method, device, equipment and storage medium
CN112989010A (en) Data query method, data query device and electronic equipment
CN106933824B (en) Method and device for determining document set similar to target document in multiple documents
CN114116973A (en) Multi-document text duplicate checking method, electronic equipment and storage medium
CN112988784B (en) Data query method, query statement generation method and device
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN116306504B (en) Candidate entity generation method and device, storage medium and electronic equipment
US11676231B1 (en) Aggregating procedures for automatic document analysis
CN109902162B (en) Text similarity identification method based on digital fingerprints, storage medium and device
CN113806492B (en) Record generation method, device, equipment and storage medium based on semantic recognition
CN117290460A (en) Method, system, device and storage medium for calculating similarity of massive texts
CN114547233A (en) Data duplicate checking method and device and electronic equipment
CN113505117A (en) Data quality evaluation method, device, equipment and medium based on data indexes
CN113869408A (en) Classification method and computer equipment
CN113515662A (en) Similar song retrieval method, device, equipment and storage medium
CN115495636A (en) Webpage searching method, device and storage medium
CN116136866B (en) Knowledge graph-based correction method and device for Chinese news abstract factual knowledge
CN114238663B (en) Knowledge graph analysis method and system for material data, electronic device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination