CN109635084B

CN109635084B - Real-time rapid duplicate removal method and system for multi-source data document

Info

Publication number: CN109635084B
Application number: CN201811456999.5A
Authority: CN
Inventors: 柴志伟; 丑晓慧; 许冠宇; 宋乐安; 许涵洋
Original assignee: Shanghai Deepq Information Technology Co ltd; Ningbo Deepq Information Technology Co ltd
Current assignee: Shanghai Deepq Information Technology Co ltd; Ningbo Deepq Information Technology Co ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2020-11-24
Anticipated expiration: 2038-11-30
Also published as: CN109635084A

Abstract

The invention belongs to the technical field of information processing, and particularly relates to a real-time and rapid duplicate removal method and a real-time and rapid duplicate removal system for a multi-source data document, which comprise the following steps: receiving a current document and filtering the current document to obtain filtered document data; calculating the characteristic words of the document data through a local sensitivity hash algorithm; judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data; and if not, storing the characteristic words and the document data of the current document into the database, otherwise, not storing. The method and the device can perform real-time and rapid duplicate removal processing on similar document data from different sources, and avoid repeated storage of the similar documents.

Description

Real-time rapid duplicate removal method and system for multi-source data document

Technical Field

The invention belongs to the technical field of information processing, and particularly relates to a real-time and rapid duplicate removal method and system for a multi-source data document.

Background

The document data from different website sources is forwarded or referenced, the same content of the articles or the higher repetition rate exists, in practical application, the similar articles need to be screened and filtered, the previous manual editing method needs to consume a large amount of labor cost, and for the data needing to be pushed in real time for news, the timeliness of manual operation deduplication is very low, while the general deduplication algorithm has high memory occupancy rate in the online calculation process, when the data volume is too large, the memory overflow is easily caused, and the offline calculation can solve the problem of the memory overflow but cannot ensure the timeliness of deduplication.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a real-time quick duplicate removal method and a real-time quick duplicate removal system for multi-source data documents, which can carry out real-time quick duplicate removal processing on similar document data from different sources and avoid repeated storage of the similar documents.

In a first aspect, the invention provides a real-time fast duplicate removal method for a multi-source data document, which comprises the following steps:

receiving a current document and filtering the current document to obtain filtered document data;

calculating the characteristic words of the document data through a local sensitivity hash algorithm;

judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data;

and if not, storing the characteristic words and the document data of the current document into the database, otherwise, not storing.

Preferably, the calculating the feature words of the document data by using the local sensitivity hash algorithm specifically includes:

segmenting the text content in the document data to obtain words if the words are dry;

counting the weight of each word by a word frequency statistical method;

mapping each word to a Hash value by using a Hash algorithm;

carrying out weighted calculation on the hash value of each word according to the weight to obtain a weighted digital string;

summing the digit strings of all the words according to the position to obtain a final digit string;

the final digital string is converted into a feature word of 64-bit bytes in 01 form.

Preferably, the determining whether the current document is similar to a previous document stored in the database according to the feature words and the document data includes the specific steps of:

calculating the hamming distance between the characteristic words of the current document and the characteristic words of the previous document, wherein if the hamming distance is more than or equal to N, the current document is not similar to the previous document, otherwise, the current document is similar to the previous document, and obtaining a preliminary similar document;

and calculating the word number difference between the text content of the current document and the text content of the preliminary similar document, wherein if the word number difference is larger than M, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and a second-degree similar document is obtained.

Preferably, the determining whether the current document is similar to a previous document stored in the database according to the feature words and the document data further includes:

extracting keywords of the current document and keywords of the second-degree similar document;

when the number of the keywords of the current document or the number of the keywords of the second-degree similar document is less than or equal to 3, if the number of the same keywords is less than 2, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained;

and when the number of the keywords of the current document and the number of the keywords of the second-degree similar document are both larger than 3, if the number of the same keywords is smaller than 3, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained.

and calculating the data value occupancy of the current document and the data value occupancy of the three-degree similar document, wherein if the data value occupancy of the two documents is the same and the data is different, the current document is not similar to the three-degree similar document, and otherwise, the current document is similar to the three-degree similar document.

Preferably, the document data includes article titles, ID numbers, text contents, and data source identifiers, and the storing of the feature words and the document data of the current document in the database includes the specific steps of:

combining the characteristic words and the serial number ID of the current document into a key value of the current document;

and storing the article title, the ID number, the text content, the data source identification and the key value of the current document into a redis database.

Preferably, before the similarity comparison between the current document and the previous document is carried out, the key value of the previous document is extracted from the database, and the characteristic words of the previous document are obtained according to the key value.

In a second aspect, the present invention provides a real-time fast duplicate removal system for a multi-source data document, which is suitable for the real-time fast duplicate removal method for the multi-source data document described in the first aspect, and includes:

the data processing unit is used for receiving the current document and filtering the current document to obtain filtered document data;

the computing unit is used for computing the characteristic words of the document data through a local sensitivity hash algorithm;

the similarity judging unit is used for judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data;

and the duplicate removal access unit is used for storing the characteristic words and the document data of the current document into the database if the current document is not similar to the previous document, and otherwise, not storing the characteristic words and the document data.

Preferably, the similarity determination unit is specifically configured to:

calculating the hamming distance between the characteristic words of the current document and the characteristic words of the previous document, wherein if the hamming distance is more than or equal to 3, the current document is not similar to the previous document, otherwise, the current document is similar to the previous document, and obtaining a preliminary similar document;

calculating the word number difference between the text content of the current document and the text content of the preliminary similar document, wherein if the word number difference is more than 500, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and a second-degree similar document is obtained;

extracting keywords of the current document and keywords of the second-degree similar document; when the number of the keywords of the current document or the number of the keywords of the second-degree similar document is less than or equal to 3, if the number of the same keywords is less than 2, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained; when the number of the keywords of the current document and the number of the keywords of the second-degree similar document are both larger than 3, if the number of the same keywords is smaller than 3, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained;

In a third aspect, the present invention provides a computer terminal comprising a processor and a memory coupled to the processor, the memory being configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method according to the first aspect.

According to the technical scheme, the similarity of the current document and the stored previous document is identified and judged, the dissimilar current document is stored, and the similar current document is not stored, so that the real-time rapid duplicate removal processing of the similar documents from different sources is realized, and the repeated storage of the similar documents is avoided.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a schematic flow chart of a real-time fast deduplication method for a multi-source data document in the present embodiment;

FIG. 2 is a schematic structural diagram of a real-time fast deduplication system for multi-source data documents in this embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The first embodiment is as follows:

the embodiment provides a real-time quick duplicate removal method for a multi-source data document, as shown in fig. 1, comprising the following steps:

s1, receiving the current document and filtering the current document to obtain the filtered document data;

s2, calculating the characteristic words of the document data through a local sensitivity hash algorithm;

s3, judging whether the current document is similar to the previous document stored in the database according to the characteristic words and the document data;

s4, if not similar, storing the characteristic word and the document data of the current document into the database, otherwise not storing.

In this embodiment, if a user enters a website from a circle of friends to read an article, and feels that the article is good, the article of the website is transferred, a background receives the transferred current document, filters the current document, removes contents such as a website header and a website footer, and obtains filtered document data, where the document data includes related contents of the article, for example: article title, ID number, text content, data source identification, etc.

And then calculating the feature words according to the filtered document data, wherein in S2, the feature words of the document data are calculated by using a local sensitivity hash algorithm, and the specific steps are as follows:

s21, segmenting the text content in the document data to obtain words if the words are dry;

s22, counting the weight of each word by a word frequency counting method;

s23, mapping each word to a Hash value by using a Hash algorithm;

s24, carrying out weighted calculation on the hash value of each word according to the weight to obtain a weighted digital string;

s25, summing the digit strings of all words according to the position to obtain the final digit string;

s26, the final digital string is converted into a feature word of 64-bit bytes in 01 form.

In this embodiment, the text content is divided into words, and some nonsense words, such as "deep," "deep," and the like, are removed. And counting the occurrence frequency of each word in the text, and dividing the number by the total word number of the full text to obtain the word frequency as the weight of the word. Each word is mapped to a Hash value using a Hash algorithm, e.g. the word "robot" is mapped to 11001 (5 bits in the example, we are actually 64 bits). And then weighting the Hash value of each word according to the weight, wherein if the weight of the robot is 3, the weighted numeric string is as follows: 33-3-33 (1: 1x3 in hash, 0: -1x3 in hash). After the digit string for each word is computed, the digit strings for each word are summed bitwise, for example: the intelligent digital string is 6-666-6, and the digital string after the two words of the intelligent robot are summed according to the position is as follows:

(3+6) (3-6) (-3+6) (-3+6) (3-6)->9 -3 3 3 -3

and by analogy, carrying out bitwise summation on the words in the full text to obtain a final digital string. The final digital string is then converted into a feature word of 64-bit bytes in the form of 01 (the rule is 1 for greater than 0 and 0 for less than 0, resulting in 10110).

After the characteristic words are obtained, the similarity between the current document and the previous document is identified and judged, wherein in S3, whether the current document is similar to the previous document stored in the database is judged according to the characteristic words and the document data, and the specific steps are as follows:

s31, calculating the Hamming distance between the feature word of the current document and the feature word of the previous document, if the Hamming distance is more than or equal to 3, the current document is not similar to the previous document, otherwise, the current document is similar to the previous document, and obtaining a preliminary similar document;

s32, calculating word number difference between the text content of the current document and the text content of the preliminary similar document, if the word number difference is larger than 500, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and obtaining a second-degree similar document.

S33, extracting keywords of the current document and keywords of the second-degree similar document;

S34, calculating the data value occupancy of the current document and the data value occupancy of the three-degree similar document, if the data value occupancy of the two documents is the same and the data is different, the current document is not similar to the three-degree similar document, otherwise, the two documents are similar.

In this embodiment, the similarity determination is performed sequentially through four steps, the determination in the previous step is not similar, and the determination in the subsequent step is not required, and the determination in the subsequent step is performed when the determination in the previous step is similar.

The first step is to judge through the feature words, extract the key value of the previous document from the redis database, and obtain the feature words of the previous document according to the key value. And calculating the Hamming distance between the characteristic word of the current document and the characteristic word of the previous document, wherein if the Hamming distance is more than or equal to 3, the current document is not similar to the previous document. And if the Hamming distance is less than 3, the current document is similar to the previous document, and a preliminary similar document is obtained. Because the similarity judgment of the characteristic words obtained by the local sensitive hash algorithm has certain defects, the judgment is more accurate, and the subsequent further judgment is carried out.

And the second step is to judge through the word number, extract the text content of the previous document from the redis database, compare the word number difference of the text content of the two documents, if the word number difference is more than 500, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and obtain a second-degree similar document. Because the long space documents have a large number of words and may contain short space documents, the two spaces having a large difference may also be considered similar, and therefore, documents having a space length different by more than 500 words are judged to be dissimilar.

And thirdly, judging through the keywords, increasing the dimension of the keywords to judge, and improving the accuracy of judgment. Extracting keywords of the document by using a textrank algorithm, and dividing the keywords into two cases of fewer keywords and more keywords according to the number of extracted keywords: the keyword of one document is less than or equal to 3, if the number of the keywords of the two documents which are the same is less than 2, the two documents are not similar, otherwise, the two documents are similar, and a three-degree similar document is obtained; the number of the keywords of the two documents is larger than 3, if the number of the keywords of the two documents which are the same is smaller than 3, the two documents are not similar, otherwise, the two documents are similar, and the three-degree similar documents are obtained.

And fourthly, judging through data, wherein the news data is a template article with high content repetition degree but different data, and at the moment, the documents are not considered to be similar. If the data value occupancy of the two documents is the same but the data is different, the two documents are judged to be dissimilar. The data occupation amount, namely the byte number of the data occupation, such as stock class report articles, is the same as the daily template, but the data is different.

And when the current document is similar to the previous document, storing the current document, and if the current document is not similar to the previous document, not storing the current document. The method for storing the characteristic words and the document data of the current document into the database comprises the following specific steps:

In this embodiment, a redis database is used for data storage, and the characteristics of redis are used for caching of a large amount of data, so that the reading speed is very high. The article title, the ID number, the text content, the data source identification and the feature word of the document are stored in the redis, in order to ensure the uniqueness, the combination of the feature word and the ID is used as a Key Value of the redis, when the similarity of the document is calculated before the document is extracted from the redis, the feature word can be obtained only by extracting the Key Value, and the access speed is increased by a plurality of times compared with the access speed of taking the feature word as Value of the redis.

In summary, the method of the embodiment identifies and judges the similarity between the current document and the stored previous document, stores the dissimilar current document, and does not store the similar current document, thereby implementing real-time fast deduplication processing on similar documents from different sources, and avoiding repeated storage of the similar documents.

The second implementation:

the embodiment provides a real-time fast duplicate removal system for a multi-source data document, which is suitable for the real-time fast duplicate removal method for the multi-source data document described in the first embodiment, and as shown in fig. 2, the method includes:

And then calculating the feature words according to the filtered document data, wherein the calculating unit is specifically configured to:

counting the weight of each word by a word frequency statistical method;

mapping each word to a Hash value by using a Hash algorithm;

(3+6) (3-6) (-3+6) (-3+6) (3-6)->9 -3 3 3 -3

After the feature words are obtained, identifying and judging the similarity between the current document and the previous document, wherein the similarity judging unit is specifically configured to:

and calculating the word number difference between the text content of the current document and the text content of the preliminary similar document, wherein if the word number difference is more than 500, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and a second-degree similar document is obtained.

Extracting keywords of the current document and keywords of the second-degree similar document; when the number of the keywords of the current document or the number of the keywords of the second-degree similar document is less than or equal to 3, if the number of the same keywords is less than 2, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained; and when the number of the keywords of the current document and the number of the keywords of the second-degree similar document are both larger than 3, if the number of the same keywords is smaller than 3, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained.

And when the current document is similar to the previous document, storing the current document, and if the current document is not similar to the previous document, not storing the current document. The method for storing the characteristic words and the document data of the current document into the database specifically comprises the following steps:

In this embodiment, a redis database is used for data storage, the deduplication access unit stores data into the redis or extracts data from the redis, and the characteristics of the redis are used for caching a large amount of data, so that the reading speed is very high. The article title, the ID number, the text content, the data source identification and the feature word of the document are stored in the redis, in order to ensure the uniqueness, the combination of the feature word and the ID is used as a Key Value of the redis, when the similarity of the document is calculated before the document is extracted from the redis, the feature word can be obtained only by extracting the Key Value, and the access speed is increased by a plurality of times compared with the access speed of taking the feature word as Value of the redis.

In summary, the system of this embodiment performs the identification and determination of the similarity between the current document and the stored previous document, stores the current document that is not similar, and does not store the current document that is similar, thereby implementing the real-time fast deduplication processing on the similar documents from different sources, and avoiding the repeated storage of the similar documents.

Example three:

the embodiment provides a computer terminal, which includes a processor and a memory connected to the processor, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method of the first embodiment.

It should be understood that in the present embodiment, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.

The computer terminal of this embodiment executes the method of the first embodiment, and performs the identification and determination of the similarity between the current document and the stored previous document, stores the current document that is not similar, and does not store the current document that is similar, thereby implementing the real-time fast deduplication processing of the similar documents from different sources, and avoiding the repeated storage of the similar documents.

Those of ordinary skill in the art will appreciate that the system elements and method steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed method and system may be implemented in other ways. For example, the above division of elements is merely a logical division, and other divisions may be realized, for example, multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not executed. The units may or may not be physically separate, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. A real-time quick duplicate removal method for multi-source data documents is characterized by comprising the following steps:

if not, storing the characteristic words and the document data of the current document into a database, otherwise, not storing;

the method comprises the following steps of judging whether a current document is similar to a previous document stored in a database or not according to the characteristic words and the document data, and specifically comprising the following steps:

calculating the word number difference between the text content of the current document and the text content of the preliminary similar document, wherein if the word number difference is larger than M, the current document is not similar to the preliminary similar document, otherwise, the current document is similar to the preliminary similar document, and a second-degree similar document is obtained;

the judging whether the current document is similar to the previous document stored in the database or not according to the characteristic words and the document data further comprises:

when the number of the keywords of the current document and the number of the keywords of the second-degree similar document are both larger than 3, if the number of the same keywords is smaller than 3, the current document is not similar to the second-degree similar document, otherwise, the current document is similar to the second-degree similar document, and a third-degree similar document is obtained;

2. The method for removing the duplicate of the multi-source data document in real time according to claim 1, wherein the characteristic words of the document data are calculated by a local sensitivity hash algorithm, and the method comprises the following specific steps:

counting the weight of each word by a word frequency statistical method;

mapping each word to a Hash value by using a Hash algorithm;

3. The method according to claim 1, wherein the document data includes article title, ID number, text content and data source identifier, and the storing of the feature word and document data of the current document into the database comprises the following specific steps:

4. The method of claim 3, wherein before the similarity comparison between the current document and the previous document, the key value of the previous document is extracted from the database, and the feature word of the previous document is obtained according to the key value.

5. A real-time fast deduplication system for a multi-source data document, comprising:

the duplicate removal access unit is used for storing the characteristic words and the document data of the current document into the database if the current document is not similar to the previous document, and otherwise, the characteristic words and the document data are not stored;

the similarity judgment unit is specifically configured to:

6. A computer terminal comprising a processor and a memory coupled to the processor, the memory for storing a computer program, the computer program comprising program instructions, characterized in that the processor is configured to invoke the program instructions to perform the method according to any one of claims 1-4.