CN110909019A - Big data duplicate checking method and device, computer equipment and storage medium - Google Patents

Big data duplicate checking method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN110909019A
CN110909019A CN201911115294.1A CN201911115294A CN110909019A CN 110909019 A CN110909019 A CN 110909019A CN 201911115294 A CN201911115294 A CN 201911115294A CN 110909019 A CN110909019 A CN 110909019A
Authority
CN
China
Prior art keywords
data
group
value
checked
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911115294.1A
Other languages
Chinese (zh)
Other versions
CN110909019B (en
Inventor
林必毅
熊俊杰
宋梦培
朱吉山
袁爱钧
李颖
杨瑞
李靖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Saiji Smart City Construction Management Co Ltd
Original Assignee
Hunan Saiji Smart City Construction Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Saiji Smart City Construction Management Co Ltd filed Critical Hunan Saiji Smart City Construction Management Co Ltd
Priority to CN201911115294.1A priority Critical patent/CN110909019B/en
Publication of CN110909019A publication Critical patent/CN110909019A/en
Application granted granted Critical
Publication of CN110909019B publication Critical patent/CN110909019B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/244Grouping and aggregation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a big data duplicate checking method, a device, computer equipment and a storage medium, wherein the method comprises the steps of obtaining data to be duplicated; preprocessing data needing to be checked for duplication to obtain a Simhash value; according to the numerical value in the value, which needs to be confirmed to be repeated, group construction is carried out to obtain a group, and the data needing to be checked for repetition are classified into the group to obtain the number of the data in the group; when the number of the data in the group is not preset, eliminating the group with the largest number of the data in the group to obtain a target group; calculating a similarity comparison value for the data in the target group to obtain a similarity value; judging whether the similarity value exceeds a preset threshold value or not; if not, performing data copying and group subdivision on the group with the largest number of data in the removed group to obtain the number of data in each group after subdivision, and updating the number of data in the group according to the number of data in the group; and when the number of the data in the group does not meet the termination condition, returning to judge whether the number of the data in the group meets the preset condition or not. The invention has the advantages of small data processing amount and high calculation efficiency.

Description

Big data duplicate checking method and device, computer equipment and storage medium
Technical Field
The invention relates to a data processing method, in particular to a big data duplicate checking method, a big data duplicate checking device, computer equipment and a storage medium.
Background
The big data age is coming, and big data processing technology is more and more important. A considerable part of mass data stored in the database is repeated, and the repeated data not only affects the speed of data analysis and processing, but also affects the accuracy to a certain extent, so that the data duplication checking is necessary work. The Simhash algorithm is a main method for searching duplicate data at present, is a fingerprint generation algorithm mentioned in a paper detection Near-duplicate for Web browsing published by Google in 2007, and is applied to the webpage duplicate removal work of a Google search engine. In brief, the Simhash algorithm mainly works to perform dimension reduction on a text to generate a Simhash value, that is, a fingerprint, and compares hamming distances of different texts through the Simhash values of the two texts, so as to judge the similarity of the two texts.
However, for the traditional Simhash value retrieval, when reading one piece of data, the traditional Simhash algorithm needs to compare all the data to find a certain 8-bit same value, and then performs similarity calculation, the comparison quantity is huge when retrieving a large amount of data, and if the Simhash values are distributed unevenly, the running time is very long, so that the problems of large data processing quantity, long calculation time and low calculation efficiency exist.
Therefore, it is necessary to design a new method, which has the advantages of small data processing amount, short calculation time and high calculation efficiency in the data duplication checking process.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a big data duplicate checking method, a big data duplicate checking device, computer equipment and a storage medium.
In order to achieve the purpose, the invention adopts the following technical scheme: the big data duplicate checking method comprises the following steps:
acquiring data to be checked;
preprocessing the data needing to be checked to obtain a Simhash value;
performing group construction according to a numerical value needing to be confirmed to be repeated in the Simhash value to obtain a group, and classifying the data needing to be checked into the group to obtain the quantity of the data in the group;
judging whether the quantity of the data in the group meets a preset condition or not;
if the number of the data in the group meets a preset condition, rejecting the group with the largest number of the data in the group to obtain a target group;
calculating a similarity comparison value for the data in the target group to obtain a similarity value;
judging whether the similarity value exceeds a preset threshold value or not;
if the similarity value does not exceed a preset threshold value, performing data copying and group subdivision on the group with the largest number of removed data in the group to obtain the number of subdivided data in each group, and updating the number of data in the group according to the number of subdivided data in each group;
judging whether the quantity of the data in the group meets a termination condition;
and if the number of the data in the group does not meet the termination condition, returning to the judgment of whether the number of the data in the group meets the preset condition or not.
The further technical scheme is as follows: after judging whether the similarity value exceeds a preset threshold value, the method further comprises:
and if the similarity value exceeds a preset threshold value, outputting a notice that the repeated data needs to be checked to a terminal for displaying.
The further technical scheme is as follows: after the step of judging whether the number of the data in the group meets the termination condition, the method further comprises the following steps:
and if the number of the data in the group meets the termination condition, outputting a notice that the data needing to be checked are not repeated to a terminal for displaying.
The further technical scheme is as follows: the preprocessing the data to be checked to obtain the Simhash value comprises the following steps:
performing word segmentation on data needing to be checked to obtain single data;
acquiring a characteristic value of single data;
and carrying out hash value calculation on the characteristic value of the single data to obtain a Simhash value.
The further technical scheme is as follows: the preset condition comprises that the variance of the data quantity in the group is larger than a variance threshold value or the percentage of the total occupied by the data quantity in the group exceeds a percentage threshold value.
The further technical scheme is as follows: the data copying and group subdivision processing of the packet with the largest number of removed data in the group is performed to obtain the number of subdivided data in each group, and the number of data in the group is updated according to the number of subdivided data in each group, and the method comprises the following steps:
making multiple data copies for the packet with the largest quantity of data in the removed group to obtain a copied packet;
subdividing the copied groups to obtain the number of data in each subdivided group, and updating the number of data in each subdivided group according to the number of data in each subdivided group;
wherein the copied packets are subdivided into k packets,
Figure BDA0002273869370000031
n denotes the total number of bits and i denotes the number of copies of the data.
The further technical scheme is as follows: the termination condition comprises that the data percentage of any packet is not more than 50%, or the number of bits of the copied residual data in the packet with the largest number of data in the rejected group is less than 8, and the percentage of the data amount in the group of the packet with the largest number of data in the group after the group subdivision of the copied packet is performed to the packet with the largest number of data in the group to the data amount of the data to be checked is less than a certain set percentage, or the number of bits of the copied residual data in the packet with the largest number of data in the rejected group is not more than 3.
The invention also provides a big data duplicate checking device, which comprises:
the data acquisition unit is used for acquiring data needing to be checked for duplication;
the preprocessing unit is used for preprocessing the data needing to be checked for duplication to obtain a Simhash value;
the group construction unit is used for carrying out group construction according to the numerical value needing to be confirmed to be repeated in the Simhash value to obtain a group, and classifying the data needing to be checked to be repeated into the group to obtain the quantity of the data in the group;
the quantity judging unit is used for judging whether the quantity of the data in the group meets a preset condition or not;
the rejecting unit is used for rejecting the group with the largest number of data in the group to obtain a target group if the number of the data in the group meets a preset condition;
the similarity calculation unit is used for calculating a similarity comparison value for the data in the target group to obtain a similarity value;
the similarity judging unit is used for judging whether the similarity value exceeds a preset threshold value or not;
the group subdivision unit is used for copying and subdividing the data of the group with the largest quantity of the rejected data in the group to obtain the subdivided data quantity in each group and updating the data quantity in the group according to the subdivided data quantity in each group if the similarity value does not exceed a preset threshold value;
a termination judgment unit configured to judge whether the number of the data in the group satisfies a termination condition; and if the number of the data in the group does not meet the termination condition, returning to the judgment of whether the number of the data in the group meets the preset condition or not.
The invention also provides computer equipment which comprises a memory and a processor, wherein the memory is stored with a computer program, and the processor realizes the method when executing the computer program.
The invention also provides a storage medium storing a computer program which, when executed by a processor, is operable to carry out the method as described above.
Compared with the prior art, the invention has the beneficial effects that: the method comprises the steps of firstly obtaining a Simhash value, then carrying out group construction, then carrying out data copying, removing the groups with the data quantity exceeding the group under the condition of uneven distribution in the data copying process, carrying out similarity calculation on the remaining groups, carrying out another retrieval operation on the removed groups when the similarity value of the remaining groups does not exceed a preset threshold value, skipping the stage of comparing whether the data are repeated or not, and directly carrying out similarity calculation so as to realize small data processing amount, short calculation time and high calculation efficiency in the data duplication checking process.
The invention is further described below with reference to the accompanying drawings and specific embodiments.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a big data duplicate checking method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a big data duplicate checking method according to an embodiment of the present invention;
FIG. 3 is a sub-flow diagram of a big data duplicate checking method according to an embodiment of the present invention;
fig. 4 is a sub-flow diagram of a big data duplicate checking method according to an embodiment of the present invention;
FIG. 5 is a schematic block diagram of a big data duplicate checking device provided by an embodiment of the present invention;
FIG. 6 is a schematic block diagram of a preprocessing unit of a big data duplicate checking device provided by an embodiment of the present invention;
FIG. 7 is a schematic block diagram of a group subdivision unit of a big data duplication checking device provided by an embodiment of the present invention;
FIG. 8 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a big data duplicate checking method according to an embodiment of the present invention. Fig. 2 is a schematic flowchart of a big data duplicate checking method according to an embodiment of the present invention. The big data duplicate checking method is applied to the server. The server performs data interaction with the terminal, acquires original data needing to be subjected to duplicate checking from the terminal, performs duplicate checking processing on the original data by combining with an improved Simhash algorithm to obtain a duplicate checking result, and sends the duplicate checking result to the terminal for displaying.
Fig. 2 is a schematic flow chart of a big data duplicate checking method provided by an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S220.
And S110, acquiring data needing to be checked for duplication.
In this embodiment, the data to be duplicated refers to original data to be duplicated from a terminal, and is generally data such as documents input by a user through the terminal or documents acquired from some platforms through the terminal via a network, for example, papers, documents, patent documents, and the like input by the user.
And S120, preprocessing the data needing to be checked to obtain a Simhash value.
In this embodiment, the Simhash value refers to a fingerprint whose data needs to be checked for duplication, and is a parameter for measuring similarity.
In an embodiment, referring to fig. 3, the step S120 may include steps S121 to S123.
S121, performing word segmentation and division on data needing to be subjected to duplicate checking to obtain single data;
and S122, acquiring characteristic values of the single data.
Extracting keywords from the data to be checked, wherein the extracting includes word segmentation and weight calculation, and extracting n keywords and weight pairs, for example, a plurality of keywords (feature, weight), and the keyword and weight pairs are denoted as feature _ weight _ pairs [ fw1, fw2 … fwn ], where fwn ═ weight _ n.
In this embodiment, the single data refers to a keyword, and the feature value of the single data refers to a weight corresponding to the keyword.
And S123, carrying out hash value calculation on the characteristic value of the single data to obtain a Simhash value.
And performing hash calculation on the feature value in feature _ weight _ calls according to each keyword and weight to obtain a Simhash value.
And determining the Simhash value of each single datum through the hash value of the characteristic value of each single datum, wherein the finally obtained Simhash value is a 32-bit binary string.
S130, group construction is carried out according to numerical values needing to be confirmed to be repeated in the Simhash value to obtain groups, and the data needing to be checked to be repeated are classified into the groups to obtain the number of the data in the groups.
In the present embodiment, the number of data in a group herein refers to the total amount of the number in a single group.
Firstly, 8 bits needing to be confirmed to be repeated in the Simhash value are used as 256 groups to construct the groups, the groups can be sequentially constructed, all data needing to be checked are classified into the 256 groups after the construction is finished, so that the stage of comparing whether the data are repeated can be skipped during retrieval, and the similarity can be directly calculated, and a large amount of time can be saved.
And S140, judging whether the number of the data in the group meets a preset condition.
In this embodiment, the preset condition includes that the variance of the number of data in the group is greater than a variance threshold or that the percentage of the total number occupied by the number of data in the group exceeds a percentage threshold.
For any data set copy in the data to be checked, checking the data quantity in 256 groups, and if the data set copy is unevenly distributed, rejecting one or more groups with the largest quantity, wherein the uneven judgment mode comprises 2, ① calculating the variance of the data quantity in the 256 groups, and rejecting the group with the largest variance or the group with the second largest variance if the variance is larger than a variance threshold, or rejecting the group with the percentage of the data quantity in the direct group exceeding a certain percentage threshold.
S150, if the number of the data in the group does not meet the preset condition, all the groups are taken as target groups, and S170 is executed;
s160, if the number of the data in the group meets a preset condition, eliminating the group with the largest number of the data in the group to obtain a target group;
the conventional Simhash algorithm has a long running time if the Simhash values are not uniformly distributed. In the embodiment, the number of data in each group after the group sorting is checked, and if the distribution is not uniform, another 24-bit search operation is performed on one or more groups with an excessive number.
In this embodiment, the target packet refers to a packet left after removing a packet that does not satisfy a preset condition, that is, one or two packets having the largest number of data in the group.
And S170, calculating a similarity comparison value for the data in the target grouping to obtain a similarity value.
In the present embodiment, the similarity value refers to a result of similarity calculation performed on data in the target packet.
And calculating the similarity comparison result of the data left after one group is removed from the data set, and determining that the data are repeated if the similarity is greater than a certain value as the result of the data set. Similarity definition dup is defined as follows:
Figure BDA0002273869370000091
wherein, tok (a) and tok (b) respectively represent A, B sets of the two data after pre-processing word segmentation, the similarity is defined as the ratio of the number of repeated word groups between the two sets and the number of total included word groups, and the word group repetition in one set is not recorded as repetition.
And S180, judging whether the similarity value exceeds a preset threshold value.
S190, if the similarity value exceeds a preset threshold value, outputting a notification that the data needing to be checked are repeated to a terminal for displaying;
in this embodiment, the preset threshold refers to a similarity threshold, and when the similarity value exceeds the preset threshold, it indicates that there is duplication in the current two sets, and indicates that there is duplication in the data to be checked.
S200, if the similarity value does not exceed a preset threshold value, performing data copying and group subdivision on the group with the largest number of removed data in the group to obtain the number of the subdivided data in each group, and updating the number of the data in the group according to the number of the subdivided data in each group.
In this embodiment, when there is no duplicate data in the target packet, it is necessary to compare whether there is duplicate in the originally excluded packet.
In an embodiment, referring to fig. 4, the step S200 may include steps S201 to S202.
S201, data copying is carried out for a plurality of times in the packet with the largest quantity of the data in the removed group, so that the copied packet is obtained.
And copying the packets and the data for the removed packets again, wherein the copied packets refer to the packets subjected to data copying for multiple times.
And 4 data copies are made on the rest bits of the removed groups again, because a plurality of bits are the same and do not need to participate in calculation, each copy is divided into k groups again, and the retrieval operation is performed again so as to improve the accuracy of the whole duplicate checking.
S202, carrying out group subdivision on the copied groups to obtain the number of data in each subdivided group, and updating the number of data in each subdivided group according to the number of data in each subdivided group.
Wherein the copied packets are subdivided into k packets,
Figure BDA0002273869370000101
n denotes the total number of bits and i denotes the number of copies of the data.
S210, judging whether the number of the data in the group meets a termination condition;
if the number of the data in the group does not satisfy the termination condition, the step S140 is returned to.
For example, if an 18-bit data copy is to be made, copy 1 is divided into 16 groups according to the first 4 bits, 16 groups according to 5-8 bits, 16 groups according to 9-12 bits, and 64 groups according to the last 6 bits. And then returns to step S140 to perform similar calculation.
In this embodiment, the termination condition includes that the data percentage of any packet is not more than 50%, or the number of bits of the remaining data copied in the packet with the largest number of removed intra-group data is less than 8, and the percentage of the intra-group data amount of the packet with the largest number of intra-group data after the group subdivision of the copied packet is performed to the copied packet, which accounts for the data amount of the data to be checked, is less than a certain set percentage, or the number of bits of the remaining data copied in the packet with the largest number of removed intra-group data is not more than 3.
Any copy of data has 3 termination conditions, and any one is satisfied, namely a loop is terminated: one is that the percentage of the amount of data in the group without any one packet exceeds 50%. And secondly, the residual data bit number of the data copy is less than 8, and the percentage of the maximum grouped data number in the group to the data amount of the data to be checked is below a certain percentage, wherein the residual bit number is the total bit number of the data copy minus the grouped bit number. The reason for setting this end condition is that once the number of remaining bits is less than 8 and the next operation is needed, at least one copy of 4 copies of the next time has only 2 groups, and the next operation is inevitably needed again, and so on. The repeated multi-round operation has limited meaning, and can cause rapid expansion of thread number and influence the overall efficiency of the algorithm, so that if the absolute number of data can be accepted, calculation is directly carried out without removing a group of data more than 50%; third, in the case where the second is not satisfied, the number of remaining bits is not more than 3, which makes it clear that only the hamming distance between all data in one group can be calculated.
And S220, if the number of the data in the group meets the termination condition, outputting a notice that the data needing to be checked are not repeated to a terminal for displaying.
And when the similarity values of the target group and the removed group do not exceed the preset threshold value and the quantity of the data in the current group meets the termination condition, ending the duplication checking, and outputting a notice that the duplication checking data does not exist to the terminal for displaying at the terminal.
According to the big data duplicate checking method, the Simhash value is obtained firstly, then group construction is carried out, then data copying is carried out, in the data copying process, the groups with the data quantity exceeding the group are removed under the condition of uneven distribution, similarity calculation is carried out on the rest groups, when the similarity value of the rest groups does not exceed the preset threshold value, another retrieval operation is carried out on the removed groups, the repeated comparison stage is skipped, and the similarity calculation is directly carried out, so that the small data processing amount, the short calculation time and the high calculation efficiency in the data duplicate checking process are realized.
Fig. 5 is a schematic block diagram of a big data duplication checking apparatus 300 according to an embodiment of the present invention. As shown in fig. 5, the present invention also provides a big data duplicate checking device 300 corresponding to the big data duplicate checking method. The big data duplication checking apparatus 300 includes a unit for performing the big data duplication checking method, and the apparatus may be configured in a server.
Specifically, referring to fig. 5, the big data duplication checking apparatus 300 includes a data acquisition unit 301, a preprocessing unit 302, a group construction unit 303, a number judgment unit 304, a target grouping formation unit 312, a rejection unit 305, a similarity calculation unit 306, a similarity judgment unit 307, a first transmission unit 308, a group subdivision unit 309, a termination judgment unit 310, and a second transmission unit 311.
A data acquiring unit 301, configured to acquire data to be checked for duplication; a preprocessing unit 302, configured to preprocess the data to be duplicate checked to obtain a Simhash value; a group construction unit 303, configured to perform group construction according to a value that needs to be determined to be repeated within the Simhash value to obtain a group, and classify the data that needs to be determined to be repeated into the group to obtain the number of data within the group; a quantity judgment unit 304, configured to judge whether the quantity of the data in the group meets a preset condition; a target grouping forming unit 312, configured to take all the groups as target groupings if the number of data in the group does not meet a preset condition; a removing unit 305, configured to remove, if the number of the group internal data meets a preset condition, a group with the largest number of the group internal data to obtain a target group; a similarity calculation unit 306, configured to calculate a similarity comparison value for data in the target packet to obtain a similarity value; a similarity determination unit 307, configured to determine whether the similarity value exceeds a preset threshold; a first sending unit 308, configured to output, if the similarity value exceeds a preset threshold, a notification that duplicate data needs to be checked to a terminal for display; a group subdivision unit 309, configured to copy and subdivide the data of the group with the largest number of removed data in the group if the similarity value does not exceed a preset threshold, to obtain the number of subdivided data in each group, and update the number of subdivided data in each group with the number of subdivided data in each group; a termination judging unit 310, configured to judge whether the number of data in the group meets a termination condition; if the number of the data in the group does not meet the termination condition, returning to the judgment of whether the number of the data in the group meets the preset condition or not; a second sending unit 311, configured to output, if the number of the data in the group meets a termination condition, a notification that there is no duplication of the data to be checked to the terminal for displaying.
In one embodiment, as shown in fig. 6, the preprocessing unit 302 includes a dividing subunit 3021, a feature value obtaining subunit 3022, and a hash calculation subunit 3023.
A dividing subunit 3021, configured to perform word segmentation on data to be duplicate checked to obtain single data; a characteristic value acquisition subunit 3022 configured to acquire a characteristic value of individual data; a hash calculation subunit 3023, configured to perform hash value calculation on the feature value of the single data to obtain a Simhash value.
In one embodiment, as shown in FIG. 7, the group subdivision unit 309 includes a copy sub-unit 3091 and a subdivision update sub-unit 3092.
A copying subunit 3091, configured to copy data for multiple times in the packet with the largest number of data in the removed group, so as to obtain a copied packet; a subdivision updating subunit 3092, configured to perform group subdivision on the copied packets to obtain the number of data in each subdivided group, and update the number of data in each group with the number of data in each subdivided group; wherein the copied packets are subdivided into k packets,
Figure BDA0002273869370000131
Figure BDA0002273869370000132
n denotes the total number of bits and i denotes the number of copies of the data.
It should be noted that, as can be clearly understood by those skilled in the art, the detailed implementation process of the big data duplicate checking device 300 and each unit may refer to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, no further description is provided herein.
The big data duplication checking apparatus 300 may be implemented in the form of a computer program that can run on a computer device as shown in fig. 8.
Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of a plurality of servers.
Referring to fig. 8, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer programs 5032 include program instructions that, when executed, cause the processor 502 to perform a big data deduplication method.
The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.
The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be caused to execute a big data duplicate checking method.
The network interface 505 is used for network communication with other devices. Those skilled in the art will appreciate that the configuration shown in fig. 8 is a block diagram of only a portion of the configuration relevant to the present teachings and does not constitute a limitation on the computer device 500 to which the present teachings may be applied, and that a particular computer device 500 may include more or less components than those shown, or combine certain components, or have a different arrangement of components.
Wherein the processor 502 is configured to run the computer program 5032 stored in the memory to implement the following steps:
acquiring data to be checked; preprocessing the data needing to be checked to obtain a Simhash value; performing group construction according to a numerical value needing to be confirmed to be repeated in the Simhash value to obtain a group, and classifying the data needing to be checked into the group to obtain the quantity of the data in the group; judging whether the quantity of the data in the group meets a preset condition or not; if the number of the data in the group meets a preset condition, rejecting the group with the largest number of the data in the group to obtain a target group; calculating a similarity comparison value for the data in the target group to obtain a similarity value; judging whether the similarity value exceeds a preset threshold value or not; if the similarity value does not exceed a preset threshold value, performing data copying and group subdivision on the group with the largest number of removed data in the group to obtain the number of subdivided data in each group, and updating the number of data in the group according to the number of subdivided data in each group; judging whether the quantity of the data in the group meets a termination condition; and if the number of the data in the group does not meet the termination condition, returning to the judgment of whether the number of the data in the group meets the preset condition or not.
Wherein the preset condition includes that the variance of the number of data in the group is greater than a variance threshold or that the percentage of the total number occupied by the number of data in the group exceeds a percentage threshold.
The termination condition comprises that the data percentage of any packet is not more than 50%, or the number of bits of the copied residual data in the packet with the largest number of data in the rejected group is less than 8, and the percentage of the data amount in the group of the packet with the largest number of data in the group after the group subdivision of the copied packet is performed to the packet with the largest number of data in the group to the data amount of the data to be checked is less than a certain set percentage, or the number of bits of the copied residual data in the packet with the largest number of data in the rejected group is not more than 3.
In an embodiment, after implementing the step of determining whether the similarity value exceeds the preset threshold, the processor 502 further implements the following steps:
and if the similarity value exceeds a preset threshold value, outputting a notice that the repeated data needs to be checked to a terminal for displaying.
In an embodiment, after the step of determining whether the amount of data in the group meets the termination condition, the processor 502 further performs the following steps:
and if the number of the data in the group meets the termination condition, outputting a notice that the data needing to be checked are not repeated to a terminal for displaying.
In an embodiment, when the processor 502 implements the step of preprocessing the data to be checked to obtain the Simhash value, the following steps are specifically implemented:
performing word segmentation on data needing to be checked to obtain single data; acquiring a characteristic value of single data; and carrying out hash value calculation on the characteristic value of the single data to obtain a Simhash value.
In an embodiment, when implementing the steps of performing data copying and group subdivision on the packet with the largest number of removed data in the group to obtain the number of subdivided data in each group, and updating the number of data in the group by the number of subdivided data in each group, the processor 502 specifically implements the following steps:
making multiple data copies for the packet with the largest quantity of data in the removed group to obtain a copied packet; subdividing the copied groups to obtain the number of data in each subdivided group, and updating the number of data in each subdivided group according to the number of data in each subdivided group; wherein the copied packets are subdivided into k packets,
Figure BDA0002273869370000151
n denotes the total number of bits and i denotes the number of copies of the data.
It should be understood that, in the embodiment of the present Application, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It will be understood by those skilled in the art that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program instructing associated hardware. The computer program includes program instructions, and the computer program may be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
Accordingly, the present invention also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program, wherein the computer program, when executed by a processor, causes the processor to perform the steps of:
acquiring data to be checked; preprocessing the data needing to be checked to obtain a Simhash value; performing group construction according to a numerical value needing to be confirmed to be repeated in the Simhash value to obtain a group, and classifying the data needing to be checked into the group to obtain the quantity of the data in the group; judging whether the quantity of the data in the group meets a preset condition or not; if the number of the data in the group meets a preset condition, rejecting the group with the largest number of the data in the group to obtain a target group; calculating a similarity comparison value for the data in the target group to obtain a similarity value; judging whether the similarity value exceeds a preset threshold value or not; if the similarity value does not exceed a preset threshold value, performing data copying and group subdivision on the group with the largest number of removed data in the group to obtain the number of subdivided data in each group, and updating the number of data in the group according to the number of subdivided data in each group; judging whether the quantity of the data in the group meets a termination condition; and if the number of the data in the group does not meet the termination condition, returning to the judgment of whether the number of the data in the group meets the preset condition or not.
Wherein the preset condition includes that the variance of the number of data in the group is greater than a variance threshold or that the percentage of the total number occupied by the number of data in the group exceeds a percentage threshold.
The termination condition comprises that the data percentage of any packet is not more than 50%, or the number of bits of the copied residual data in the packet with the largest number of data in the rejected group is less than 8, and the percentage of the data amount in the group of the packet with the largest number of data in the group after the group subdivision of the copied packet is performed to the packet with the largest number of data in the group to the data amount of the data to be checked is less than a certain set percentage, or the number of bits of the copied residual data in the packet with the largest number of data in the rejected group is not more than 3.
In an embodiment, after the step of determining whether the similarity value exceeds the preset threshold value is implemented by executing the computer program, the processor further implements the following steps:
and if the similarity value exceeds a preset threshold value, outputting a notice that the repeated data needs to be checked to a terminal for displaying.
In an embodiment, after the step of determining whether the amount of data in the group meets the termination condition is implemented by the processor executing the computer program, the following steps are further implemented:
and if the number of the data in the group meets the termination condition, outputting a notice that the data needing to be checked are not repeated to a terminal for displaying.
In an embodiment, when the processor executes the computer program to implement the step of preprocessing the data to be checked to obtain the Simhash value, the following steps are specifically implemented:
performing word segmentation on data needing to be checked to obtain single data; acquiring a characteristic value of single data; and carrying out hash value calculation on the characteristic value of the single data to obtain a Simhash value.
In an embodiment, when the processor executes the computer program to implement the steps of copying data and subdividing the group with the largest number of removed groups of data in the group to obtain the number of subdivided groups of data in each group, and updating the number of data in each group with the number of subdivided groups of data in each group, the following steps are specifically implemented:
making multiple data copies for the packet with the largest quantity of data in the removed group to obtain a copied packet; subdividing the copied groups to obtain the number of data in each subdivided group, and updating the number of data in each subdivided group according to the number of data in each subdivided group; wherein the copied packets are subdivided into k packets,
Figure BDA0002273869370000181
n denotes the total number of bits and i denotes the number of copies of the data.
The storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, which can store various computer readable storage media.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be merged, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. The big data duplicate checking method is characterized by comprising the following steps:
acquiring data to be checked;
preprocessing the data needing to be checked to obtain a Simhash value;
performing group construction according to a numerical value needing to be confirmed to be repeated in the Simhash value to obtain a group, and classifying the data needing to be checked into the group to obtain the quantity of the data in the group;
judging whether the quantity of the data in the group meets a preset condition or not;
if the number of the data in the group meets a preset condition, rejecting the group with the largest number of the data in the group to obtain a target group;
calculating a similarity comparison value for the data in the target group to obtain a similarity value;
judging whether the similarity value exceeds a preset threshold value or not;
if the similarity value does not exceed a preset threshold value, performing data copying and group subdivision on the group with the largest number of removed data in the group to obtain the number of subdivided data in each group, and updating the number of data in the group according to the number of subdivided data in each group;
judging whether the quantity of the data in the group meets a termination condition;
and if the number of the data in the group does not meet the termination condition, returning to the judgment of whether the number of the data in the group meets the preset condition or not.
2. The big data duplicate checking method according to claim 1, wherein after determining whether the similarity value exceeds a preset threshold, the method further comprises:
and if the similarity value exceeds a preset threshold value, outputting a notice that the repeated data needs to be checked to a terminal for displaying.
3. The big data duplication checking method according to claim 1, wherein after determining whether the number of the data in the group meets a termination condition, the method further includes:
and if the number of the data in the group meets the termination condition, outputting a notice that the data needing to be checked are not repeated to a terminal for displaying.
4. The big data duplicate checking method according to claim 1, wherein the preprocessing the data to be duplicated to obtain the Simhash value comprises:
performing word segmentation on data needing to be checked to obtain single data;
acquiring a characteristic value of single data;
and carrying out hash value calculation on the characteristic value of the single data to obtain a Simhash value.
5. The big data duplication checking method of claim 1 wherein the predetermined condition includes a variance of the number of data in the group being greater than a variance threshold or a percentage of the total number occupied by the number of data in the group exceeding a percentage threshold.
6. The big data duplicate checking method according to claim 1, wherein the data copying and group subdivision processing on the packet with the largest number of removed data in the group to obtain the number of subdivided data in each group, and updating the number of data in the group with the number of subdivided data in each group comprises:
making multiple data copies for the packet with the largest quantity of data in the removed group to obtain a copied packet;
subdividing the copied groups to obtain the number of data in each subdivided group, and updating the number of data in each subdivided group according to the number of data in each subdivided group;
wherein the copied packets are subdivided into k packets,
Figure FDA0002273869360000021
n denotes the total number of bits and i denotes the number of copies of the data.
7. The big data duplicate checking method according to claim 1, wherein the termination condition includes that the data percentage of any packet is not more than 50%, or the number of bits of the copied residual data in the packet with the largest number of rejected packets is less than 8, and the percentage of the amount of the intra-group data of the packet with the largest number of intra-group data after the group subdivision on the copied packet to the amount of the data to be checked is less than a certain set percentage, or the number of bits of the copied residual data in the packet with the largest number of rejected packets is not more than 3.
8. Big data duplicate checking device, its characterized in that includes:
the data acquisition unit is used for acquiring data needing to be checked for duplication;
the preprocessing unit is used for preprocessing the data needing to be checked for duplication to obtain a Simhash value;
the group construction unit is used for carrying out group construction according to the numerical value needing to be confirmed to be repeated in the Simhash value to obtain a group, and classifying the data needing to be checked to be repeated into the group to obtain the quantity of the data in the group;
the quantity judging unit is used for judging whether the quantity of the data in the group meets a preset condition or not;
the rejecting unit is used for rejecting the group with the largest number of data in the group to obtain a target group if the number of the data in the group meets a preset condition;
the similarity calculation unit is used for calculating a similarity comparison value for the data in the target group to obtain a similarity value;
the similarity judging unit is used for judging whether the similarity value exceeds a preset threshold value or not;
the group subdivision unit is used for copying and subdividing the data of the group with the largest quantity of the rejected data in the group to obtain the subdivided data quantity in each group and updating the data quantity in the group according to the subdivided data quantity in each group if the similarity value does not exceed a preset threshold value;
a termination judgment unit configured to judge whether the number of the data in the group satisfies a termination condition; and if the number of the data in the group does not meet the termination condition, returning to the judgment of whether the number of the data in the group meets the preset condition or not.
9. A computer device, characterized in that the computer device comprises a memory, on which a computer program is stored, and a processor, which when executing the computer program implements the method according to any of claims 1 to 7.
10. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.
CN201911115294.1A 2019-11-14 2019-11-14 Big data duplicate checking method and device, computer equipment and storage medium Active CN110909019B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911115294.1A CN110909019B (en) 2019-11-14 2019-11-14 Big data duplicate checking method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911115294.1A CN110909019B (en) 2019-11-14 2019-11-14 Big data duplicate checking method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110909019A true CN110909019A (en) 2020-03-24
CN110909019B CN110909019B (en) 2022-04-08

Family

ID=69817374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911115294.1A Active CN110909019B (en) 2019-11-14 2019-11-14 Big data duplicate checking method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110909019B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103189867A (en) * 2012-10-30 2013-07-03 华为技术有限公司 Duplicated data search method and equipment
US20180191764A1 (en) * 2017-01-04 2018-07-05 Synack, Inc. Automatic webpage change detection
CN108573045A (en) * 2018-04-18 2018-09-25 同方知网数字出版技术股份有限公司 A kind of alignment matrix similarity retrieval method based on multistage fingerprint
CN109271614A (en) * 2018-10-30 2019-01-25 中译语通科技股份有限公司 A kind of data duplicate checking method
CN110309446A (en) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 The quick De-weight method of content of text, device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103189867A (en) * 2012-10-30 2013-07-03 华为技术有限公司 Duplicated data search method and equipment
US20180191764A1 (en) * 2017-01-04 2018-07-05 Synack, Inc. Automatic webpage change detection
CN108573045A (en) * 2018-04-18 2018-09-25 同方知网数字出版技术股份有限公司 A kind of alignment matrix similarity retrieval method based on multistage fingerprint
CN109271614A (en) * 2018-10-30 2019-01-25 中译语通科技股份有限公司 A kind of data duplicate checking method
CN110309446A (en) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 The quick De-weight method of content of text, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张帆: ""基于指纹检索的文本相似性检测技术研究与应用"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Also Published As

Publication number Publication date
CN110909019B (en) 2022-04-08

Similar Documents

Publication Publication Date Title
CN109189991B (en) Duplicate video identification method, device, terminal and computer readable storage medium
US11086912B2 (en) Automatic questioning and answering processing method and automatic questioning and answering system
CN109241274B (en) Text clustering method and device
CN111145737B (en) Voice test method and device and electronic equipment
WO2020215667A1 (en) Text content quick duplicate removal method and apparatus, computer device, and storage medium
KR101508260B1 (en) Summary generation apparatus and method reflecting document feature
CN108304371B (en) Method and device for mining hot content, computer equipment and storage medium
CN108897842A (en) Computer readable storage medium and computer system
CN110321466B (en) Securities information duplicate checking method and system based on semantic analysis
CN107180093A (en) Information search method and device and ageing inquiry word recognition method and device
CN107832444B (en) Event discovery method and device based on search log
CN110532388B (en) Text clustering method, equipment and storage medium
CN110837555A (en) Method, equipment and storage medium for removing duplicate and screening of massive texts
CN112162977A (en) MES-oriented massive data redundancy removing method and system
CN117216239A (en) Text deduplication method, text deduplication device, computer equipment and storage medium
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium
CN107133321B (en) Method and device for analyzing search characteristics of page
CN110909019B (en) Big data duplicate checking method and device, computer equipment and storage medium
CN113821630A (en) Data clustering method and device
CN115952332A (en) Core search phrase determining method based on co-occurrence word frequency
CN111026921A (en) Graph-based incidence relation obtaining method and device and computer equipment
CN116028873A (en) Multi-class server fault prediction method based on support vector machine
CN113609247A (en) Big data text duplicate removal technology based on improved Simhash algorithm
CN114418114A (en) Operator fusion method and device, terminal equipment and storage medium
US20160371331A1 (en) Computer-implemented method of performing a search using signatures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant