CN109635084B - Real-time rapid duplicate removal method and system for multi-source data document - Google Patents

Real-time rapid duplicate removal method and system for multi-source data document Download PDF

Info

Publication number
CN109635084B
CN109635084B CN201811456999.5A CN201811456999A CN109635084B CN 109635084 B CN109635084 B CN 109635084B CN 201811456999 A CN201811456999 A CN 201811456999A CN 109635084 B CN109635084 B CN 109635084B
Authority
CN
China
Prior art keywords
document
similar
current
data
current document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811456999.5A
Other languages
Chinese (zh)
Other versions
CN109635084A (en
Inventor
柴志伟
丑晓慧
许冠宇
宋乐安
许涵洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Deepq Information Technology Co ltd
Ningbo Deepq Information Technology Co ltd
Original Assignee
Shanghai Deepq Information Technology Co ltd
Ningbo Deepq Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Deepq Information Technology Co ltd, Ningbo Deepq Information Technology Co ltd filed Critical Shanghai Deepq Information Technology Co ltd
Priority to CN201811456999.5A priority Critical patent/CN109635084B/en
Publication of CN109635084A publication Critical patent/CN109635084A/en
Application granted granted Critical
Publication of CN109635084B publication Critical patent/CN109635084B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of information processing, and particularly relates to a real-time and rapid duplicate removal method and a real-time and rapid duplicate removal system for a multi-source data document, which comprise the following steps: receiving a current document and filtering the current document to obtain filtered document data; calculating the characteristic words of the document data through a local sensitivity hash algorithm; judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data; and if not, storing the characteristic words and the document data of the current document into the database, otherwise, not storing. The method and the device can perform real-time and rapid duplicate removal processing on similar document data from different sources, and avoid repeated storage of the similar documents.

Description

Real-time rapid duplicate removal method and system for multi-source data document
Technical Field
The invention belongs to the technical field of information processing, and particularly relates to a real-time and rapid duplicate removal method and system for a multi-source data document.
Background
The document data from different website sources is forwarded or referenced, the same content of the articles or the higher repetition rate exists, in practical application, the similar articles need to be screened and filtered, the previous manual editing method needs to consume a large amount of labor cost, and for the data needing to be pushed in real time for news, the timeliness of manual operation deduplication is very low, while the general deduplication algorithm has high memory occupancy rate in the online calculation process, when the data volume is too large, the memory overflow is easily caused, and the offline calculation can solve the problem of the memory overflow but cannot ensure the timeliness of deduplication.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a real-time quick duplicate removal method and a real-time quick duplicate removal system for multi-source data documents, which can carry out real-time quick duplicate removal processing on similar document data from different sources and avoid repeated storage of the similar documents.
In a first aspect, the invention provides a real-time fast duplicate removal method for a multi-source data document, which comprises the following steps:
receiving a current document and filtering the current document to obtain filtered document data;
calculating the characteristic words of the document data through a local sensitivity hash algorithm;
judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data;
and if not, storing the characteristic words and the document data of the current document into the database, otherwise, not storing.
Preferably, the calculating the feature words of the document data by using the local sensitivity hash algorithm specifically includes:
segmenting the text content in the document data to obtain words if the words are dry;
counting the weight of each word by a word frequency statistical method;
mapping each word to a Hash value by using a Hash algorithm;
carrying out weighted calculation on the hash value of each word according to the weight to obtain a weighted digital string;
summing the digit strings of all the words according to the position to obtain a final digit string;
the final digital string is converted into a feature word of 64-bit bytes in 01 form.
Preferably, the determining whether the current document is similar to a previous document stored in the database according to the feature words and the document data includes the specific steps of:
calculating the hamming distance between the characteristic words of the current document and the characteristic words of the previous document, wherein if the hamming distance is more than or equal to N, the current document is not similar to the previous document, otherwise, the current document is similar to the previous document, and obtaining a preliminary similar document;
and calculating the word number difference between the text content of the current document and the text content of the preliminary similar document, wherein if the word number difference is larger than M, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and a second-degree similar document is obtained.
Preferably, the determining whether the current document is similar to a previous document stored in the database according to the feature words and the document data further includes:
extracting keywords of the current document and keywords of the second-degree similar document;
when the number of the keywords of the current document or the number of the keywords of the second-degree similar document is less than or equal to 3, if the number of the same keywords is less than 2, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained;
and when the number of the keywords of the current document and the number of the keywords of the second-degree similar document are both larger than 3, if the number of the same keywords is smaller than 3, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained.
Preferably, the determining whether the current document is similar to a previous document stored in the database according to the feature words and the document data further includes:
and calculating the data value occupancy of the current document and the data value occupancy of the three-degree similar document, wherein if the data value occupancy of the two documents is the same and the data is different, the current document is not similar to the three-degree similar document, and otherwise, the current document is similar to the three-degree similar document.
Preferably, the document data includes article titles, ID numbers, text contents, and data source identifiers, and the storing of the feature words and the document data of the current document in the database includes the specific steps of:
combining the characteristic words and the serial number ID of the current document into a key value of the current document;
and storing the article title, the ID number, the text content, the data source identification and the key value of the current document into a redis database.
Preferably, before the similarity comparison between the current document and the previous document is carried out, the key value of the previous document is extracted from the database, and the characteristic words of the previous document are obtained according to the key value.
In a second aspect, the present invention provides a real-time fast duplicate removal system for a multi-source data document, which is suitable for the real-time fast duplicate removal method for the multi-source data document described in the first aspect, and includes:
the data processing unit is used for receiving the current document and filtering the current document to obtain filtered document data;
the computing unit is used for computing the characteristic words of the document data through a local sensitivity hash algorithm;
the similarity judging unit is used for judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data;
and the duplicate removal access unit is used for storing the characteristic words and the document data of the current document into the database if the current document is not similar to the previous document, and otherwise, not storing the characteristic words and the document data.
Preferably, the similarity determination unit is specifically configured to:
calculating the hamming distance between the characteristic words of the current document and the characteristic words of the previous document, wherein if the hamming distance is more than or equal to 3, the current document is not similar to the previous document, otherwise, the current document is similar to the previous document, and obtaining a preliminary similar document;
calculating the word number difference between the text content of the current document and the text content of the preliminary similar document, wherein if the word number difference is more than 500, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and a second-degree similar document is obtained;
extracting keywords of the current document and keywords of the second-degree similar document; when the number of the keywords of the current document or the number of the keywords of the second-degree similar document is less than or equal to 3, if the number of the same keywords is less than 2, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained; when the number of the keywords of the current document and the number of the keywords of the second-degree similar document are both larger than 3, if the number of the same keywords is smaller than 3, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained;
and calculating the data value occupancy of the current document and the data value occupancy of the three-degree similar document, wherein if the data value occupancy of the two documents is the same and the data is different, the current document is not similar to the three-degree similar document, and otherwise, the current document is similar to the three-degree similar document.
In a third aspect, the present invention provides a computer terminal comprising a processor and a memory coupled to the processor, the memory being configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method according to the first aspect.
According to the technical scheme, the similarity of the current document and the stored previous document is identified and judged, the dissimilar current document is stored, and the similar current document is not stored, so that the real-time rapid duplicate removal processing of the similar documents from different sources is realized, and the repeated storage of the similar documents is avoided.
Drawings
In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.
FIG. 1 is a schematic flow chart of a real-time fast deduplication method for a multi-source data document in the present embodiment;
FIG. 2 is a schematic structural diagram of a real-time fast deduplication system for multi-source data documents in this embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
The first embodiment is as follows:
the embodiment provides a real-time quick duplicate removal method for a multi-source data document, as shown in fig. 1, comprising the following steps:
s1, receiving the current document and filtering the current document to obtain the filtered document data;
s2, calculating the characteristic words of the document data through a local sensitivity hash algorithm;
s3, judging whether the current document is similar to the previous document stored in the database according to the characteristic words and the document data;
s4, if not similar, storing the characteristic word and the document data of the current document into the database, otherwise not storing.
In this embodiment, if a user enters a website from a circle of friends to read an article, and feels that the article is good, the article of the website is transferred, a background receives the transferred current document, filters the current document, removes contents such as a website header and a website footer, and obtains filtered document data, where the document data includes related contents of the article, for example: article title, ID number, text content, data source identification, etc.
And then calculating the feature words according to the filtered document data, wherein in S2, the feature words of the document data are calculated by using a local sensitivity hash algorithm, and the specific steps are as follows:
s21, segmenting the text content in the document data to obtain words if the words are dry;
s22, counting the weight of each word by a word frequency counting method;
s23, mapping each word to a Hash value by using a Hash algorithm;
s24, carrying out weighted calculation on the hash value of each word according to the weight to obtain a weighted digital string;
s25, summing the digit strings of all words according to the position to obtain the final digit string;
s26, the final digital string is converted into a feature word of 64-bit bytes in 01 form.
In this embodiment, the text content is divided into words, and some nonsense words, such as "deep," "deep," and the like, are removed. And counting the occurrence frequency of each word in the text, and dividing the number by the total word number of the full text to obtain the word frequency as the weight of the word. Each word is mapped to a Hash value using a Hash algorithm, e.g. the word "robot" is mapped to 11001 (5 bits in the example, we are actually 64 bits). And then weighting the Hash value of each word according to the weight, wherein if the weight of the robot is 3, the weighted numeric string is as follows: 33-3-33 (1: 1x3 in hash, 0: -1x3 in hash). After the digit string for each word is computed, the digit strings for each word are summed bitwise, for example: the intelligent digital string is 6-666-6, and the digital string after the two words of the intelligent robot are summed according to the position is as follows:
(3+6) (3-6) (-3+6) (-3+6) (3-6)->9 -3 3 3 -3
and by analogy, carrying out bitwise summation on the words in the full text to obtain a final digital string. The final digital string is then converted into a feature word of 64-bit bytes in the form of 01 (the rule is 1 for greater than 0 and 0 for less than 0, resulting in 10110).
After the characteristic words are obtained, the similarity between the current document and the previous document is identified and judged, wherein in S3, whether the current document is similar to the previous document stored in the database is judged according to the characteristic words and the document data, and the specific steps are as follows:
s31, calculating the Hamming distance between the feature word of the current document and the feature word of the previous document, if the Hamming distance is more than or equal to 3, the current document is not similar to the previous document, otherwise, the current document is similar to the previous document, and obtaining a preliminary similar document;
s32, calculating word number difference between the text content of the current document and the text content of the preliminary similar document, if the word number difference is larger than 500, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and obtaining a second-degree similar document.
S33, extracting keywords of the current document and keywords of the second-degree similar document;
when the number of the keywords of the current document or the number of the keywords of the second-degree similar document is less than or equal to 3, if the number of the same keywords is less than 2, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained;
and when the number of the keywords of the current document and the number of the keywords of the second-degree similar document are both larger than 3, if the number of the same keywords is smaller than 3, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained.
S34, calculating the data value occupancy of the current document and the data value occupancy of the three-degree similar document, if the data value occupancy of the two documents is the same and the data is different, the current document is not similar to the three-degree similar document, otherwise, the two documents are similar.
In this embodiment, the similarity determination is performed sequentially through four steps, the determination in the previous step is not similar, and the determination in the subsequent step is not required, and the determination in the subsequent step is performed when the determination in the previous step is similar.
The first step is to judge through the feature words, extract the key value of the previous document from the redis database, and obtain the feature words of the previous document according to the key value. And calculating the Hamming distance between the characteristic word of the current document and the characteristic word of the previous document, wherein if the Hamming distance is more than or equal to 3, the current document is not similar to the previous document. And if the Hamming distance is less than 3, the current document is similar to the previous document, and a preliminary similar document is obtained. Because the similarity judgment of the characteristic words obtained by the local sensitive hash algorithm has certain defects, the judgment is more accurate, and the subsequent further judgment is carried out.
And the second step is to judge through the word number, extract the text content of the previous document from the redis database, compare the word number difference of the text content of the two documents, if the word number difference is more than 500, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and obtain a second-degree similar document. Because the long space documents have a large number of words and may contain short space documents, the two spaces having a large difference may also be considered similar, and therefore, documents having a space length different by more than 500 words are judged to be dissimilar.
And thirdly, judging through the keywords, increasing the dimension of the keywords to judge, and improving the accuracy of judgment. Extracting keywords of the document by using a textrank algorithm, and dividing the keywords into two cases of fewer keywords and more keywords according to the number of extracted keywords: the keyword of one document is less than or equal to 3, if the number of the keywords of the two documents which are the same is less than 2, the two documents are not similar, otherwise, the two documents are similar, and a three-degree similar document is obtained; the number of the keywords of the two documents is larger than 3, if the number of the keywords of the two documents which are the same is smaller than 3, the two documents are not similar, otherwise, the two documents are similar, and the three-degree similar documents are obtained.
And fourthly, judging through data, wherein the news data is a template article with high content repetition degree but different data, and at the moment, the documents are not considered to be similar. If the data value occupancy of the two documents is the same but the data is different, the two documents are judged to be dissimilar. The data occupation amount, namely the byte number of the data occupation, such as stock class report articles, is the same as the daily template, but the data is different.
And when the current document is similar to the previous document, storing the current document, and if the current document is not similar to the previous document, not storing the current document. The method for storing the characteristic words and the document data of the current document into the database comprises the following specific steps:
combining the characteristic words and the serial number ID of the current document into a key value of the current document;
and storing the article title, the ID number, the text content, the data source identification and the key value of the current document into a redis database.
In this embodiment, a redis database is used for data storage, and the characteristics of redis are used for caching of a large amount of data, so that the reading speed is very high. The article title, the ID number, the text content, the data source identification and the feature word of the document are stored in the redis, in order to ensure the uniqueness, the combination of the feature word and the ID is used as a Key Value of the redis, when the similarity of the document is calculated before the document is extracted from the redis, the feature word can be obtained only by extracting the Key Value, and the access speed is increased by a plurality of times compared with the access speed of taking the feature word as Value of the redis.
In summary, the method of the embodiment identifies and judges the similarity between the current document and the stored previous document, stores the dissimilar current document, and does not store the similar current document, thereby implementing real-time fast deduplication processing on similar documents from different sources, and avoiding repeated storage of the similar documents.
The second implementation:
the embodiment provides a real-time fast duplicate removal system for a multi-source data document, which is suitable for the real-time fast duplicate removal method for the multi-source data document described in the first embodiment, and as shown in fig. 2, the method includes:
the data processing unit is used for receiving the current document and filtering the current document to obtain filtered document data;
the computing unit is used for computing the characteristic words of the document data through a local sensitivity hash algorithm;
the similarity judging unit is used for judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data;
and the duplicate removal access unit is used for storing the characteristic words and the document data of the current document into the database if the current document is not similar to the previous document, and otherwise, not storing the characteristic words and the document data.
In this embodiment, if a user enters a website from a circle of friends to read an article, and feels that the article is good, the article of the website is transferred, a background receives the transferred current document, filters the current document, removes contents such as a website header and a website footer, and obtains filtered document data, where the document data includes related contents of the article, for example: article title, ID number, text content, data source identification, etc.
And then calculating the feature words according to the filtered document data, wherein the calculating unit is specifically configured to:
segmenting the text content in the document data to obtain words if the words are dry;
counting the weight of each word by a word frequency statistical method;
mapping each word to a Hash value by using a Hash algorithm;
carrying out weighted calculation on the hash value of each word according to the weight to obtain a weighted digital string;
summing the digit strings of all the words according to the position to obtain a final digit string;
the final digital string is converted into a feature word of 64-bit bytes in 01 form.
In this embodiment, the text content is divided into words, and some nonsense words, such as "deep," "deep," and the like, are removed. And counting the occurrence frequency of each word in the text, and dividing the number by the total word number of the full text to obtain the word frequency as the weight of the word. Each word is mapped to a Hash value using a Hash algorithm, e.g. the word "robot" is mapped to 11001 (5 bits in the example, we are actually 64 bits). And then weighting the Hash value of each word according to the weight, wherein if the weight of the robot is 3, the weighted numeric string is as follows: 33-3-33 (1: 1x3 in hash, 0: -1x3 in hash). After the digit string for each word is computed, the digit strings for each word are summed bitwise, for example: the intelligent digital string is 6-666-6, and the digital string after the two words of the intelligent robot are summed according to the position is as follows:
(3+6) (3-6) (-3+6) (-3+6) (3-6)->9 -3 3 3 -3
and by analogy, carrying out bitwise summation on the words in the full text to obtain a final digital string. The final digital string is then converted into a feature word of 64-bit bytes in the form of 01 (the rule is 1 for greater than 0 and 0 for less than 0, resulting in 10110).
After the feature words are obtained, identifying and judging the similarity between the current document and the previous document, wherein the similarity judging unit is specifically configured to:
calculating the hamming distance between the characteristic words of the current document and the characteristic words of the previous document, wherein if the hamming distance is more than or equal to 3, the current document is not similar to the previous document, otherwise, the current document is similar to the previous document, and obtaining a preliminary similar document;
and calculating the word number difference between the text content of the current document and the text content of the preliminary similar document, wherein if the word number difference is more than 500, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and a second-degree similar document is obtained.
Extracting keywords of the current document and keywords of the second-degree similar document; when the number of the keywords of the current document or the number of the keywords of the second-degree similar document is less than or equal to 3, if the number of the same keywords is less than 2, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained; and when the number of the keywords of the current document and the number of the keywords of the second-degree similar document are both larger than 3, if the number of the same keywords is smaller than 3, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained.
And calculating the data value occupancy of the current document and the data value occupancy of the three-degree similar document, wherein if the data value occupancy of the two documents is the same and the data is different, the current document is not similar to the three-degree similar document, and otherwise, the current document is similar to the three-degree similar document.
In this embodiment, the similarity determination is performed sequentially through four steps, the determination in the previous step is not similar, and the determination in the subsequent step is not required, and the determination in the subsequent step is performed when the determination in the previous step is similar.
The first step is to judge through the feature words, extract the key value of the previous document from the redis database, and obtain the feature words of the previous document according to the key value. And calculating the Hamming distance between the characteristic word of the current document and the characteristic word of the previous document, wherein if the Hamming distance is more than or equal to 3, the current document is not similar to the previous document. And if the Hamming distance is less than 3, the current document is similar to the previous document, and a preliminary similar document is obtained. Because the similarity judgment of the characteristic words obtained by the local sensitive hash algorithm has certain defects, the judgment is more accurate, and the subsequent further judgment is carried out.
And the second step is to judge through the word number, extract the text content of the previous document from the redis database, compare the word number difference of the text content of the two documents, if the word number difference is more than 500, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and obtain a second-degree similar document. Because the long space documents have a large number of words and may contain short space documents, the two spaces having a large difference may also be considered similar, and therefore, documents having a space length different by more than 500 words are judged to be dissimilar.
And thirdly, judging through the keywords, increasing the dimension of the keywords to judge, and improving the accuracy of judgment. Extracting keywords of the document by using a textrank algorithm, and dividing the keywords into two cases of fewer keywords and more keywords according to the number of extracted keywords: the keyword of one document is less than or equal to 3, if the number of the keywords of the two documents which are the same is less than 2, the two documents are not similar, otherwise, the two documents are similar, and a three-degree similar document is obtained; the number of the keywords of the two documents is larger than 3, if the number of the keywords of the two documents which are the same is smaller than 3, the two documents are not similar, otherwise, the two documents are similar, and the three-degree similar documents are obtained.
And fourthly, judging through data, wherein the news data is a template article with high content repetition degree but different data, and at the moment, the documents are not considered to be similar. If the data value occupancy of the two documents is the same but the data is different, the two documents are judged to be dissimilar. The data occupation amount, namely the byte number of the data occupation, such as stock class report articles, is the same as the daily template, but the data is different.
And when the current document is similar to the previous document, storing the current document, and if the current document is not similar to the previous document, not storing the current document. The method for storing the characteristic words and the document data of the current document into the database specifically comprises the following steps:
combining the characteristic words and the serial number ID of the current document into a key value of the current document;
and storing the article title, the ID number, the text content, the data source identification and the key value of the current document into a redis database.
In this embodiment, a redis database is used for data storage, the deduplication access unit stores data into the redis or extracts data from the redis, and the characteristics of the redis are used for caching a large amount of data, so that the reading speed is very high. The article title, the ID number, the text content, the data source identification and the feature word of the document are stored in the redis, in order to ensure the uniqueness, the combination of the feature word and the ID is used as a Key Value of the redis, when the similarity of the document is calculated before the document is extracted from the redis, the feature word can be obtained only by extracting the Key Value, and the access speed is increased by a plurality of times compared with the access speed of taking the feature word as Value of the redis.
In summary, the system of this embodiment performs the identification and determination of the similarity between the current document and the stored previous document, stores the current document that is not similar, and does not store the current document that is similar, thereby implementing the real-time fast deduplication processing on the similar documents from different sources, and avoiding the repeated storage of the similar documents.
Example three:
the embodiment provides a computer terminal, which includes a processor and a memory connected to the processor, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method of the first embodiment.
It should be understood that in the present embodiment, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.
The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.
The computer terminal of this embodiment executes the method of the first embodiment, and performs the identification and determination of the similarity between the current document and the stored previous document, stores the current document that is not similar, and does not store the current document that is similar, thereby implementing the real-time fast deduplication processing of the similar documents from different sources, and avoiding the repeated storage of the similar documents.
Those of ordinary skill in the art will appreciate that the system elements and method steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed method and system may be implemented in other ways. For example, the above division of elements is merely a logical division, and other divisions may be realized, for example, multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not executed. The units may or may not be physically separate, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims (6)

1. A real-time quick duplicate removal method for multi-source data documents is characterized by comprising the following steps:
receiving a current document and filtering the current document to obtain filtered document data;
calculating the characteristic words of the document data through a local sensitivity hash algorithm;
judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data;
if not, storing the characteristic words and the document data of the current document into a database, otherwise, not storing;
the method comprises the following steps of judging whether a current document is similar to a previous document stored in a database or not according to the characteristic words and the document data, and specifically comprising the following steps:
calculating the hamming distance between the characteristic words of the current document and the characteristic words of the previous document, wherein if the hamming distance is more than or equal to N, the current document is not similar to the previous document, otherwise, the current document is similar to the previous document, and obtaining a preliminary similar document;
calculating the word number difference between the text content of the current document and the text content of the preliminary similar document, wherein if the word number difference is larger than M, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and a second-degree similar document is obtained;
the judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data further comprises:
extracting keywords of the current document and keywords of the second-degree similar document;
when the number of the keywords of the current document or the number of the keywords of the second-degree similar document is less than or equal to 3, if the number of the same keywords is less than 2, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained;
when the number of the keywords of the current document and the number of the keywords of the second-degree similar document are both larger than 3, if the number of the same keywords is smaller than 3, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained;
the judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data further comprises:
and calculating the data value occupancy of the current document and the data value occupancy of the three-degree similar document, wherein if the data value occupancy of the two documents is the same and the data is different, the current document is not similar to the three-degree similar document, and otherwise, the current document is similar to the three-degree similar document.
2. The method for removing the duplicate of the multi-source data document in real time according to claim 1, wherein the characteristic words of the document data are calculated by a local sensitivity hash algorithm, and the method comprises the following specific steps:
segmenting the text content in the document data to obtain words if the words are dry;
counting the weight of each word by a word frequency statistical method;
mapping each word to a Hash value by using a Hash algorithm;
carrying out weighted calculation on the hash value of each word according to the weight to obtain a weighted digital string;
summing the digit strings of all the words according to the position to obtain a final digit string;
the final digital string is converted into a feature word of 64-bit bytes in 01 form.
3. The method according to claim 1, wherein the document data includes article title, ID number, text content and data source identifier, and the storing of the feature word and document data of the current document into the database comprises the following specific steps:
combining the characteristic words and the serial number ID of the current document into a key value of the current document;
and storing the article title, the ID number, the text content, the data source identification and the key value of the current document into a redis database.
4. The method of claim 3, wherein before the similarity comparison between the current document and the previous document, the key value of the previous document is extracted from the database, and the feature word of the previous document is obtained according to the key value.
5. A real-time fast deduplication system for a multi-source data document, comprising:
the data processing unit is used for receiving the current document and filtering the current document to obtain filtered document data;
the computing unit is used for computing the characteristic words of the document data through a local sensitivity hash algorithm;
the similarity judging unit is used for judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data;
the duplicate removal access unit is used for storing the characteristic words and the document data of the current document into the database if the current document is not similar to the previous document, and otherwise, the characteristic words and the document data are not stored;
the similarity judgment unit is specifically configured to:
calculating the hamming distance between the characteristic words of the current document and the characteristic words of the previous document, wherein if the hamming distance is more than or equal to 3, the current document is not similar to the previous document, otherwise, the current document is similar to the previous document, and obtaining a preliminary similar document;
calculating the word number difference between the text content of the current document and the text content of the preliminary similar document, wherein if the word number difference is more than 500, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and a second-degree similar document is obtained;
extracting keywords of the current document and keywords of the second-degree similar document; when the number of the keywords of the current document or the number of the keywords of the second-degree similar document is less than or equal to 3, if the number of the same keywords is less than 2, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained; when the number of the keywords of the current document and the number of the keywords of the second-degree similar document are both larger than 3, if the number of the same keywords is smaller than 3, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained;
and calculating the data value occupancy of the current document and the data value occupancy of the three-degree similar document, wherein if the data value occupancy of the two documents is the same and the data is different, the current document is not similar to the three-degree similar document, and otherwise, the current document is similar to the three-degree similar document.
6. A computer terminal comprising a processor and a memory coupled to the processor, the memory for storing a computer program, the computer program comprising program instructions, characterized in that the processor is configured to invoke the program instructions to perform the method according to any one of claims 1-4.
CN201811456999.5A 2018-11-30 2018-11-30 Real-time rapid duplicate removal method and system for multi-source data document Active CN109635084B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811456999.5A CN109635084B (en) 2018-11-30 2018-11-30 Real-time rapid duplicate removal method and system for multi-source data document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811456999.5A CN109635084B (en) 2018-11-30 2018-11-30 Real-time rapid duplicate removal method and system for multi-source data document

Publications (2)

Publication Number Publication Date
CN109635084A CN109635084A (en) 2019-04-16
CN109635084B true CN109635084B (en) 2020-11-24

Family

ID=66070616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811456999.5A Active CN109635084B (en) 2018-11-30 2018-11-30 Real-time rapid duplicate removal method and system for multi-source data document

Country Status (1)

Country Link
CN (1) CN109635084B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750217A (en) * 2019-10-18 2020-02-04 北京浪潮数据技术有限公司 Information management method and related device
CN111368521B (en) * 2020-02-29 2023-04-07 重庆百事得大牛机器人有限公司 Management method for legal advisor service
CN111597178A (en) * 2020-05-18 2020-08-28 山东浪潮通软信息科技有限公司 Method, system, equipment and medium for cleaning repeating data
CN111737966B (en) * 2020-06-11 2024-03-01 北京百度网讯科技有限公司 Document repetition detection method, device, equipment and readable storage medium
CN114661771A (en) * 2022-04-14 2022-06-24 广州经传多赢投资咨询有限公司 Stock data storage and reading method, equipment and readable storage medium
CN115422125B (en) * 2022-09-29 2023-05-19 浙江星汉信息技术股份有限公司 Electronic document automatic archiving method and system based on intelligent algorithm

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346443A (en) * 2014-10-20 2015-02-11 北京国双科技有限公司 Web text processing method and device
EP2846499A1 (en) * 2013-09-06 2015-03-11 Alcatel Lucent Method And Device For Classifying A Message
CN105224518A (en) * 2014-06-17 2016-01-06 腾讯科技(深圳)有限公司 The lookup method of the computing method of text similarity and system, Similar Text and system
CN108009152A (en) * 2017-12-04 2018-05-08 陕西识代运筹信息科技股份有限公司 A kind of data processing method and device of the text similarity analysis based on Spark-Streaming
CN108573045A (en) * 2018-04-18 2018-09-25 同方知网数字出版技术股份有限公司 A kind of alignment matrix similarity retrieval method based on multistage fingerprint
CN108763486A (en) * 2018-05-30 2018-11-06 湖南写邦科技有限公司 Paper duplicate checking method, terminal and storage medium based on terminal

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8661341B1 (en) * 2011-01-19 2014-02-25 Google, Inc. Simhash based spell correction
US20180068023A1 (en) * 2016-09-07 2018-03-08 Facebook, Inc. Similarity Search Using Polysemous Codes
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities
CN108776654A (en) * 2018-05-30 2018-11-09 昆明理工大学 One kind being based on improved simhash transcription comparison methods

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2846499A1 (en) * 2013-09-06 2015-03-11 Alcatel Lucent Method And Device For Classifying A Message
CN105224518A (en) * 2014-06-17 2016-01-06 腾讯科技(深圳)有限公司 The lookup method of the computing method of text similarity and system, Similar Text and system
CN104346443A (en) * 2014-10-20 2015-02-11 北京国双科技有限公司 Web text processing method and device
CN108009152A (en) * 2017-12-04 2018-05-08 陕西识代运筹信息科技股份有限公司 A kind of data processing method and device of the text similarity analysis based on Spark-Streaming
CN108573045A (en) * 2018-04-18 2018-09-25 同方知网数字出版技术股份有限公司 A kind of alignment matrix similarity retrieval method based on multistage fingerprint
CN108763486A (en) * 2018-05-30 2018-11-06 湖南写邦科技有限公司 Paper duplicate checking method, terminal and storage medium based on terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于语义情感倾向的文本相似度计算";游春晖;《电子科技大学硕士学位论文》;20040429;第63-69页 *

Also Published As

Publication number Publication date
CN109635084A (en) 2019-04-16

Similar Documents

Publication Publication Date Title
CN109635084B (en) Real-time rapid duplicate removal method and system for multi-source data document
CN111460153B (en) Hot topic extraction method, device, terminal equipment and storage medium
CN107423613A (en) The method, apparatus and server of device-fingerprint are determined according to similarity
WO2019052162A1 (en) Method, apparatus and device for improving data cleaning efficiency, and readable storage medium
CN106909575A (en) Text clustering method and device
CN107832444A (en) Event based on search daily record finds method and device
CN106933878B (en) Information processing method and device
CN111817978A (en) Flow classification method and device
CN108170691A (en) It is associated with the determining method and apparatus of document
CN107229605B (en) Text similarity calculation method and device
CN112035449A (en) Data processing method and device, computer equipment and storage medium
CN113190623B (en) Data processing method, device, server and storage medium
CN110399464B (en) Similar news judgment method and system and electronic equipment
CN109670153A (en) A kind of determination method, apparatus, storage medium and the terminal of similar model
CN106657128B (en) Data packet filtering method and device based on wildcard mask rule
CN110597985A (en) Data classification method, device, terminal and medium based on data analysis
CN108628875A (en) A kind of extracting method of text label, device and server
CN104901947B (en) One kind is based on TCAM serial numbers matching process and device
CA3144051A1 (en) Data sorting method, device, and system
CN110046180B (en) Method and device for locating similar examples and electronic equipment
CN109033070B (en) Data processing method, server and computer readable medium
CN113868373A (en) Word cloud generation method and device, electronic equipment and storage medium
CN109101485B (en) Information processing method and device, electronic equipment and computer storage medium
CN112100670A (en) Big data based privacy data grading protection method
CN112307070A (en) Mask data query method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant