CN115293126A - Method and device for removing duplicate of large-scale text data, electronic equipment and storage medium - Google Patents

Method and device for removing duplicate of large-scale text data, electronic equipment and storage medium Download PDF

Info

Publication number
CN115293126A
CN115293126A CN202210700368.3A CN202210700368A CN115293126A CN 115293126 A CN115293126 A CN 115293126A CN 202210700368 A CN202210700368 A CN 202210700368A CN 115293126 A CN115293126 A CN 115293126A
Authority
CN
China
Prior art keywords
data
hash
deduplicated
deduplication
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210700368.3A
Other languages
Chinese (zh)
Inventor
孙羽菲
申峻宇
王昊天
李东闻
张玉志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202210700368.3A priority Critical patent/CN115293126A/en
Publication of CN115293126A publication Critical patent/CN115293126A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure discloses a method and a device for removing duplicate of large-scale text data, electronic equipment and a storage medium, which relate to the field of data processing, and the main technical scheme comprises the following steps: dividing first data to be deduplicated into at least two data segments, wherein each data segment comprises at least two data; in the first data segment, respectively executing a preset hash algorithm aiming at single data to obtain at least one hash block; performing duplicate removal calculation on at least two data in the first data segment to obtain second data to be subjected to duplicate removal; sequentially comparing the hash blocks in the second data to be deduplicated with hash blocks in a preset reference database; and performing secondary deduplication calculation according to the similarity of the comparison result, and continuously performing deduplication calculation in the remaining second data segment in the first data to be deduplicated. Compared with the prior art, the embodiment of the disclosure splits large-scale data into small segments of data, and then sequentially performs deduplication operation on each segment of data, thereby realizing that text data deduplication scale breaks through the device memory limitation.

Description

Method and device for removing duplicate of large-scale text data, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a method and an apparatus for removing duplicate text data in a large scale, an electronic device, and a storage medium.
Background
For massive and large-scale text data, the text data inevitably have characters or text meanings which are mutually repeated, and the deduplication is important preprocessing of massive text data. In the text data deduplication process, duplicate removal is required to be performed on completely same text data, and similar but not completely same text data is considered.
At present, a method for performing deduplication processing on text data exists, in which deduplication processing is performed on all data at one time, that is, all data to be deduplicated are directly read at one time, similarity calculation is performed on any two pieces of data to be deduplicated, and deduplication processing is performed according to a calculation result.
Although the duplication eliminating method realizes the duplication eliminating processing of the text data to a certain extent, the duplication eliminating method has some problems that when all data are subjected to the duplication eliminating processing at one time, all data to be duplicated need to be read into a memory at one time, the memory of the computing equipment is limited, and particularly when the duplication eliminating data are large in scale, the memory of the equipment cannot bear the load, so that the method has a large limit on the scale of the processable duplication eliminating data.
Disclosure of Invention
The disclosure provides a method and a device for removing duplicate of large-scale text data, electronic equipment and a storage medium. The method mainly aims to solve the problem that the duplication eliminating rule in the existing text data duplication eliminating technology is limited by the equipment memory. The embodiment of the disclosure can realize that the text data duplication elimination rule breaks through the equipment memory limitation.
According to a first aspect of the present disclosure, there is provided a method for deduplication of large-scale text data, comprising:
dividing first data to be deduplicated into at least two data segments, wherein each data segment comprises at least two data;
in the first data segment, respectively executing a preset hash algorithm aiming at single data to obtain at least one hash block;
performing duplicate removal calculation on data corresponding to at least two hash blocks in the first data segment to obtain second data to be subjected to duplicate removal;
comparing the hash blocks in the second data to be deduplicated with hash blocks in a preset reference database in sequence;
and performing secondary deduplication calculation according to the similarity of the comparison result, and continuously performing deduplication calculation in the remaining second data segment in the first data to be deduplicated.
Optionally, in the first data segment, executing a preset hash algorithm on the single data respectively to obtain at least one hash partition includes:
in the first data segment, respectively executing preset hash calculation on single data to obtain at least one hash block;
and calculating the at least one Hash block by using a preset Hash algorithm in sequence.
Optionally, the performing deduplication calculation on data corresponding to at least two hash chunks in the first data segment includes:
if the hash values of the hash blocks at any identical position between the two data in the first data segment are identical, determining that the two data are first data which can be repeated;
performing similarity calculation on first possibly repeated data in the first data segment based on a preset similarity calculation method;
if the similarity between the first possibly repeated data exceeds a first preset similarity threshold, discarding any data in the first possibly repeated data;
if the similarity between the first possible repeated data does not exceed the first preset similarity threshold, the first possible repeated data is reserved.
Optionally, the performing of the secondary duplicate removal calculation according to the similarity of the comparison result includes:
if the preset reference database is empty, storing second data to be deduplicated into the preset reference database;
if the preset reference database is not empty, judging whether hash values of hash blocks contained in first data in the second data to be deduplicated are the same as hash values of hash blocks contained in any second data in the preset reference database, judging whether hash block positions of the two pieces of data with the same hash values are also the same, and if the hash values of the two pieces of data at any same position are the same, determining that the first data and the second data are second data which can be repeated;
calculating the similarity between the first data and the second data based on a preset text similarity algorithm;
if the similarity exceeds a second preset similarity threshold, discarding the first data in the second data to be deduplicated;
and if the similarity does not exceed the second preset similarity threshold, reserving the first data in the second data to be deduplicated. Optionally, the method further includes:
and updating the preset reference database based on the second data to be deduplicated after the deduplication processing.
According to a second aspect of the present disclosure, there is provided an apparatus for updating a code detection model, including:
the dividing unit is used for dividing the first data to be deduplicated into at least two data segments, and each data segment comprises at least two data;
the first computing unit is used for executing a preset hash algorithm on the single data in the first data segment respectively to obtain at least one hash block;
the second computing unit is used for performing duplicate removal computation on data corresponding to at least two hash blocks in the first data segment to obtain second data to be subjected to duplicate removal;
the comparison unit is used for sequentially comparing the hash blocks in the second data to be deduplicated with the hash blocks in a preset reference database;
and the third calculating unit is used for performing secondary deduplication calculation according to the comparison result similarity of the comparison unit and continuously executing deduplication calculation in the remaining second data segment in the first data to be deduplicated.
Optionally, the first computing unit includes:
the first calculation module is used for executing preset signature calculation on single data in the first data segment to obtain at least one signature block;
and the second calculation module is used for calculating the at least one signature block by using a preset hash algorithm in sequence to obtain at least one hash block.
Optionally, the second computing unit includes:
the determining module is used for determining that the two data are first data which can be repeated when the hash values of the hash blocks at any identical position in the two data in the first data segment are identical;
a calculation module, configured to perform similarity calculation on first possibly repeated data in the first data segment based on a preset similarity calculation method;
a discarding module configured to discard any of the first potentially duplicated data when a similarity between the first potentially duplicated data exceeds a first preset similarity threshold;
a retention module that retains the first potentially duplicated data when a similarity between the first potentially duplicated data does not exceed the first preset similarity threshold.
Optionally, the third computing unit includes:
the storage module is used for storing the second data to be deduplicated into the preset reference database when the preset reference database is empty;
a determining module, configured to determine whether a hash value of a hash block included in first data in the second data to be deduplicated is the same as a hash value of a hash block included in any second data in the preset reference database, determine whether hash block positions of the two pieces of data having the same hash value are also the same, and determine that the first data and the second data are second data that may be repeated when the hash values of the two pieces of data having the same hash block at any same position are the same;
the calculation module is used for calculating the similarity between the first data and the second data based on a preset text similarity algorithm;
the discarding module is used for discarding the first data in the second data to be deduplicated when the similarity exceeds a second preset similarity threshold;
and the retaining module is used for retaining the first data in the second data to be deduplicated when the similarity does not exceed the second preset similarity threshold.
Optionally, the apparatus further comprises:
and the updating unit is used for updating the preset reference database based on the second data to be deduplicated after the deduplication processing.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the aforementioned first aspect.
According to a fifth aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as set forth in the preceding first aspect.
According to the large-scale text data deduplication method and device, the electronic equipment and the storage medium, the first data to be deduplicated is divided into at least two data sections, and each data section comprises at least two data; in the first data segment, respectively executing a preset hash algorithm aiming at single data to obtain at least one hash block; performing duplicate removal calculation on data corresponding to at least two hash blocks in the first data segment to obtain second data to be subjected to duplicate removal; comparing the hash blocks in the second data to be deduplicated with hash blocks in a preset reference database in sequence; and performing secondary deduplication calculation according to the similarity of the comparison result, and continuously performing deduplication calculation in the remaining second data segment in the first data to be deduplicated. Compared with the related art. The method and the device for removing the duplicate of the data have the advantages that local duplicate removal is firstly carried out on the basis of a Hash algorithm, integral duplicate removal of the data is achieved through the database, the efficiency of duplicate removal execution is improved to a certain extent, meanwhile, a better duplicate removal effect is achieved, when large-scale data is removed, large-scale data are firstly split into small sections of data, then duplicate removal operation is carried out on each section of data in sequence, the duplicate removal operation comprises the steps of completing first duplicate removal on the basis of the Hash algorithm on the data, obtaining second data to be removed, comparing Hash blocks of the second data to be removed with Hash blocks of data in a preset reference database, and completing second duplicate removal on the basis of the similarity of comparison results. By carrying out batch deduplication operation on large-scale data, the embodiment of the disclosure further realizes text data deduplication scale breaking through the device memory limitation.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a schematic flowchart of a method for removing duplicate text data in a large scale according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an overall deduplication process for large-scale text data according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a large-scale local deduplication process for text data according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a database-based deduplication process for large-scale text data according to an embodiment of the present disclosure;
FIG. 5 is a schematic structural diagram of a large-scale de-duplication device for text data according to the present disclosure;
FIG. 6 is a schematic diagram of another apparatus for removing duplicate text data in large scale according to the present disclosure;
FIG. 7 shows a schematic block diagram of an example electronic device 300 that may be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
A method, an apparatus, an electronic device, and a storage medium for deduplication of large-scale text data of the embodiments of the present disclosure are described below with reference to the drawings.
Fig. 1 is a schematic flowchart of a method for removing duplicate text data in a large scale according to an embodiment of the present disclosure. As shown in fig. 1, the method comprises the following steps:
step 101, dividing the first data to be deduplicated into at least two data segments, wherein each data segment contains at least two data.
The first data to be deduplicated is total data that is not deduplicated, and in order to facilitate processing of the data, in the embodiment of the present disclosure, before the first data to be deduplicated is processed, the first data to be deduplicated is divided into at least two data segments, and each data segment includes at least two data. When the total amount of the first data to be deduplicated is large in scale, data deduplication is performed by adopting segmentation processing of the first data to be deduplicated, large-scale first data to be deduplicated can be divided into small segments, and the small segments are sequentially input into computing equipment for deduplication processing, so that deduplication of the large-scale first data to be deduplicated is realized under the condition that computing resources of the computing equipment are limited.
And 102, respectively executing a preset hash algorithm on the single data in the first data segment to obtain at least one hash block.
The first data segment is one of at least two data segments generated by the first to-be-deduplicated data in step 101, and as can be seen from the above, the first data segment includes at least two pieces of data, and a preset hash algorithm is sequentially performed on a single piece of data to obtain at least one hash partition of each piece of data, where the hash partition is a hash code identifier.
In order to achieve the preliminary judgment of the repeated data possibly existing in the first data segment, the embodiment of the disclosure provides a possibility of implementation, that is, the data in the first data segment is compared with the hash blocks between the two data segments, when one of the hash blocks corresponding to two certain data segments is the same and the position of the two data segments is also the same, the two data segments are preliminarily judged to be possibly repeated data, and then final determination is made based on the calculation similarity, so that the preliminary judgment of the repeated data possibly existing in the first data segment is achieved.
Step 103, performing deduplication calculation on data corresponding to at least two hash blocks in the first data segment to obtain second data to be deduplicated.
And performing deduplication calculation on the data corresponding to the at least two hash blocks in the first data segment, namely performing similarity calculation on the data in the first data segment based on a text similarity calculation method.
In order to implement final determination of the data which is judged to be possibly repeated in step 102, the embodiment of the present disclosure provides a possibility of implementation, that is, similarity calculation is performed on two pieces of data which are judged to be possibly repeated in step 102 to obtain similarity of the two pieces of data, the similarity of the two pieces of data is compared with a first preset similarity threshold to judge whether the two pieces of data are repeated data, if the two pieces of data are repeated data, one of the two pieces of data is randomly deleted, and the remaining data constitutes second data to be deduplicated, so that final determination of the data which is judged to be possibly repeated in step 102 is implemented, and deduplication of the data in the first data segment is implemented.
And step 104, comparing the hash blocks in the second data to be deduplicated with hash blocks in a preset reference database in sequence.
In order to preliminarily determine the repeated data possibly existing in the second data to be deduplicated and the preset reference database, the embodiment of the disclosure compares the second data to be deduplicated obtained after the first data segment is deduplicated for the first time with the data in the preset reference database.
And comparing the hash blocks corresponding to one piece of data in the second data to be deduplicated with the hash blocks corresponding to the data in the preset reference database in sequence, preliminarily judging that two pieces of data are possibly duplicated data when the hash values of the two pieces of data respectively corresponding to the two pieces of data are the same and the positions of the two pieces of data are also the same, and finally determining based on the calculation similarity, thereby preliminarily judging the duplicated data possibly existing in the second data to be deduplicated and the preset reference database.
And 105, performing secondary deduplication calculation according to the similarity of the comparison result, and continuously performing deduplication calculation in the remaining second data segment in the first data to be deduplicated.
In order to finally determine the data which is judged to be possibly repeated in step 104, the embodiment of the present disclosure provides a possibility of implementation, that is, when one piece of data in the second data to be deduplicated and one piece of data in the preset reference database are judged to be duplicated data each other in step 104, based on a text similarity algorithm, calculating a similarity between the two pieces of data, and comparing the similarity between the two pieces of data with a second preset similarity threshold, so as to judge whether the two pieces of data are duplicated data, and if the two pieces of data are duplicated data, deleting the duplicated data of the second data to be deduplicated; if the two pieces of data are not duplicated data, storing the piece of data of the second data to be deduplicated into the database, thereby realizing final determination of the data which is judged to be possibly duplicated in the step 104, further realizing deduplication of the second data to be deduplicated, and completing deduplication of the first data segment.
And the second data segment is subjected to the same deduplication operation as the first data segment, so that deduplication of all the data segments is completed, and therefore complete deduplication of large-scale data is realized.
In order to more systematically show the overall deduplication process of the embodiment of the present disclosure, fig. 2 is a schematic diagram of an overall deduplication process of large-scale text data provided by the embodiment of the present disclosure, as shown in fig. 2, the embodiment of the present disclosure adopts a deduplication scheme that performs local deduplication based on a hash algorithm first, and then implements overall deduplication of data through a database, and for clarity and conciseness, specific processes are not repeated here.
The method for removing the duplication of the large-scale text data divides the first data to be removed into at least two data segments, wherein each data segment comprises at least two data; in the first data segment, respectively executing a preset hash algorithm aiming at single data to obtain at least one hash block; performing duplicate removal calculation on data corresponding to at least two hash blocks in the first data segment to obtain second data to be subjected to duplicate removal; comparing the hash blocks in the second data to be deduplicated with hash blocks in a preset reference database in sequence; and performing secondary deduplication calculation according to the similarity of the comparison result, and continuously performing deduplication calculation in the remaining second data segment in the first data to be deduplicated. Compared with the prior art, the method and the device have the advantages that local deduplication is performed on the basis of the Hash algorithm, overall deduplication of the data is achieved through the database, deduplication execution efficiency is improved to a certain extent, and meanwhile a better deduplication effect is achieved. By carrying out batch deduplication operation on large-scale data, the embodiment of the disclosure further realizes text data deduplication scale breaking through the device memory limitation.
As a refinement of the embodiment of the present disclosure, when the step 102 is executed in the first data segment, and the preset hash algorithm is respectively executed for the single data, so as to obtain at least one hash partition, the following implementation manners may be adopted, for example: in the first data segment, respectively executing preset signature calculation aiming at single data to obtain at least one signature block; and calculating the at least one signature block by using a preset hash algorithm to obtain at least one hash block.
For convenience of understanding the foregoing implementation manner, the following detailed description of the process of obtaining at least one hash partition is provided herein, where for each data in the first data segment, a hash signature is first calculated, each data corresponds to one signature partition, and finally the obtained signature partition is calculated based on the preset hash algorithm to obtain a hash partition corresponding to the signature partition, where the preset hash algorithm is a hash function. For example: calculating one data to obtain a hash signature corresponding to the data, blocking the hash signature corresponding to the data to obtain at least one signature block, and calculating the signature block based on a hash function to obtain a hash block corresponding to the signature block. Specifically, the embodiments of the present disclosure do not limit this.
As a refinement of the above embodiment, when performing deduplication calculation on data corresponding to at least two hash blocks in the first data segment in step 103, the following implementation manners, for example, may be adopted but not limited to: if the hash values of the hash blocks at any identical position between the two data in the first data segment are identical, determining that the two data are first data which can be repeated; performing similarity calculation on first possibly repeated data in the first data segment based on a preset similarity calculation method; if the similarity between the first possibly repeated data exceeds a first preset similarity threshold, discarding any data in the first possibly repeated data; if the similarity between the first possible repeated data does not exceed the first preset similarity threshold, the first possible repeated data is reserved.
Before performing the similarity calculation, the embodiments of the present disclosure may preliminarily determine whether the data is possibly duplicated based on hash partitioning, for example: if a certain hash block corresponding to a certain piece of data in the first data segment is the same as the hash block at the same position corresponding to another piece of data, preliminarily judging that the two pieces of data are possibly duplicated data, and then performing individual verification on the data preliminarily judged to be duplicated data, wherein the individual verification is to perform similarity calculation on the duplicated data. After the preliminary judgment on whether the data are repeated is finished, calculating the similarity of a pair of data which are judged to be possibly repeated, comparing the calculated similarity with a first preset similarity threshold, if the similarity of the two data is higher than the first preset similarity threshold, determining that the two data are repeated data, and randomly deleting one of the repeated data, and if the similarity of the two data is lower than the first preset similarity threshold, determining that the two data are not repeated data, and retaining all the data, thereby finishing the local deduplication. The primary determination method of the duplicate data and the determination method of the duplicate data are not limited in the embodiments of the present disclosure.
In order to more intuitively illustrate the process of local deduplication of large-scale text data according to the embodiment of the present disclosure, fig. 3 is a schematic diagram of a local deduplication process of large-scale text data according to the embodiment of the present disclosure, as shown in fig. 3.
As a refinement of the above embodiment, when performing the secondary deduplication calculation according to the similarity of the comparison result in step 105, the following implementation manners may be adopted, but are not limited to, for example: if the preset reference database is empty, storing second data to be deduplicated into the preset reference database; if the preset reference database is not empty, judging whether hash values of hash blocks contained in first data in the second data to be deduplicated are the same as hash values of hash blocks contained in any second data in the preset reference database, judging whether hash block positions of the two pieces of data with the same hash values are also the same, and if the hash values of the two pieces of data in any same position are the same, determining that the first data and the second data are second data which can be repeated; calculating the similarity between the first data and the second data based on a preset text similarity algorithm; if the similarity exceeds a second preset similarity threshold, discarding the first data in the second data to be deduplicated; and if the similarity does not exceed the second preset similarity threshold, reserving the first data in the second data to be deduplicated.
In order to more intuitively demonstrate the database-based deduplication process of the large-scale text data according to the embodiment of the present disclosure, fig. 4 is a schematic diagram of a database-based deduplication process of the large-scale text data according to the embodiment of the present disclosure, as shown in fig. 4.
As a refinement of the embodiment of the present disclosure, the method for removing duplicate of large-scale text data further includes updating the preset reference database based on the second data to be removed after the duplicate removal processing. For example: and storing the hash blocks corresponding to the data which are determined to be not repeated into the preset reference database so as to finish updating the preset reference database.
To sum up, the embodiment of the present disclosure can achieve the following effects:
1. when the duplication removal of the large-scale data is executed, the large-scale data is firstly split into small segments of data, then duplication removal operation is sequentially carried out on each segment of data, the duplication removal operation comprises the steps of completing the first duplication removal on the segment of data based on a Hash algorithm to obtain second data to be duplicated, comparing Hash blocks of the second data to be duplicated with Hash blocks of data in a preset reference database, and completing the second duplication removal based on the similarity of comparison results. By carrying out batch deduplication operation on large-scale data, the embodiment of the disclosure further realizes text data deduplication scale breaking through the device memory limitation.
2. According to the duplication elimination implementation scheme, time consumption comparison between the duplicated data in each batch and the database is reduced through the duplication elimination implementation scheme of firstly locally and then integrally, so that the execution speed of duplication elimination is increased, the duplication elimination execution efficiency is improved, and a good duplication elimination effect is also kept.
3. And performing duplicate removal calculation on data corresponding to at least two hash blocks in the first data segment to obtain second data to be subjected to duplicate removal. The determination of whether the data is the repeated data is realized, and the local deduplication of the data is completed.
4. And comparing the hash blocks in the second data to be deduplicated with the hash blocks in the preset reference database in sequence, so as to realize the preliminary judgment of the possible repeated data in the second data to be deduplicated and the preset reference database.
5. And performing secondary deduplication calculation according to the similarity of the comparison result, and continuously performing deduplication calculation in the remaining second data segment in the first data to be deduplicated. The determination of whether the data is the repeated data is realized, and the final deduplication of the data is completed.
6. And updating the preset reference database by the second data to be deduplicated after the deduplication processing. The updating of the data in the preset reference database is realized, so that the data in the preset reference database is the latest data when the duplication of the data to be duplicated is eliminated based on the preset reference database, and the duplication elimination quality of the large-scale text data is ensured.
Corresponding to the updating method of the code detection model, the invention also provides an updating device of the code detection model. Since the device embodiment of the present invention corresponds to the method embodiment described above, details that are not disclosed in the device embodiment may refer to the method embodiment described above, and are not described again in the present invention.
Fig. 5 is a schematic structural diagram of a large-scale text data deduplication device provided by the present disclosure.
The present invention further provides a device for removing duplicate of large-scale text data, as shown in fig. 5, including:
a dividing unit 21, configured to divide the first to-be-deduplicated data into at least two data segments, where each data segment includes at least two pieces of data;
the first computing unit 22 is configured to execute a preset hash algorithm on each piece of data in the first data segment to obtain at least one hash partition;
a second calculating unit 23, configured to perform deduplication calculation on data corresponding to at least two hash chunks in the first data segment to obtain second to-be-deduplicated data;
a comparison unit 24, configured to sequentially compare the hash blocks in the second data to be deduplicated with the hash blocks in a preset reference database;
and the third calculating unit 25 is configured to perform secondary deduplication calculation according to the comparison result similarity of the comparing unit, and continue to perform deduplication calculation in the remaining second data segment in the first data to be deduplicated.
The large-scale text data deduplication device divides first data to be deduplicated into at least two data segments, wherein each data segment comprises at least two data; in the first data segment, respectively executing a preset hash algorithm aiming at single data to obtain at least one hash block; performing duplicate removal calculation on data corresponding to at least two hash blocks in the first data segment to obtain second data to be subjected to duplicate removal; comparing the hash blocks in the second data to be deduplicated with hash blocks in a preset reference database in sequence; and performing secondary deduplication calculation according to the similarity of the comparison result, and continuously performing deduplication calculation in the remaining second data segment in the first data to be deduplicated. Compared with the prior art, the method and the device have the advantages that local deduplication is performed on the basis of the Hash algorithm, overall deduplication of the data is achieved through the database, deduplication execution efficiency is improved to a certain extent, and meanwhile a better deduplication effect is achieved. By carrying out batch deduplication operation on large-scale data, the embodiment of the disclosure further realizes text data deduplication scale breaking through the device memory limitation.
Further, in a possible implementation manner of this embodiment, as shown in fig. 6, fig. 6 is a schematic structural diagram of another apparatus for removing duplicate text data in a large scale according to the present disclosure, where the first calculating unit 22 further includes:
a first calculating module 221, configured to perform preset signature calculation on each piece of data in the first data segment, to obtain at least one signature block;
the second calculating module 222 is configured to calculate at least one hash block by sequentially using a preset hash algorithm for the at least one signature block.
Further, in a possible implementation manner of this embodiment, as shown in fig. 6, the second calculating unit 23 further includes:
the determining module 231 is configured to determine that the two pieces of data are first possibly repeated data when hash values of hash blocks at any same position in the first data segment are the same;
a calculating module 232, configured to perform similarity calculation on first possibly repeated data in the first data segment based on a preset similarity algorithm;
a discarding module 233, configured to discard any data of the first potentially repeated data when the similarity between the first potentially repeated data exceeds a first preset similarity threshold;
a retention module 234 to retain the first potentially duplicated data when the similarity between the first potentially duplicated data does not exceed the first preset similarity threshold.
Further, in a possible implementation manner of this embodiment, as shown in fig. 6, the third calculating unit 25 further includes:
a storing module 251, configured to store the second data to be deduplicated into the preset reference database when the preset reference database is empty;
a determining module 252, configured to, when the preset reference database is not empty, determine whether hash values of hash blocks included in first data in the second data to be deduplicated are the same as hash values of hash blocks included in any second data in the preset reference database, determine whether hash block positions of the two pieces of data that have the same hash values are also the same, and determine that the first data and the second data are second data that may be repeated when hash values of the two pieces of data that have the same hash blocks at any same position are the same;
a calculating module 253, configured to calculate a similarity between the first data and the second data based on a preset text similarity algorithm;
a discarding module 254, configured to discard the first data in the second to-be-deduplicated data when the similarity exceeds a second preset similarity threshold;
the retaining module 255 retains the first data in the second data to be deduplicated when the similarity does not exceed the second preset similarity threshold.
Further, in a possible implementation manner of this embodiment, as shown in fig. 6, the apparatus further includes:
an updating unit 26, configured to update the preset reference database based on the second data to be deduplicated after the deduplication processing.
It should be noted that the foregoing explanations of the method embodiments also apply to the apparatus of this embodiment, and the principle is the same, and this embodiment is not limited.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 shows a schematic block diagram of an example electronic device 300 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 300 includes a computing unit 301 that can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 302 or a computer program loaded from a storage unit 308 into a RAM (Random Access Memory) 303. In the RAM303, various programs and data necessary for the operation of the device 300 can also be stored. The computing unit 301, the ROM 302, and the RAM303 are connected to each other via a bus 304. An I/O (Input/Output) interface 305 is also connected to the bus 304.
Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, or the like; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 301 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing Unit 301 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 301 executes the above-described methods and processes, such as a detection method of abnormal clinical behavior. For example, in some embodiments, the method of detecting abnormal clinical behavior may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 308. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 300 via ROM 302 and/or communication unit 309. When the computer program is loaded into RAM303 and executed by the computing unit 301, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the aforementioned method of detecting abnormal clinical behavior in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, system On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be noted that artificial intelligence is a subject for studying a computer to simulate some human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and includes both hardware and software technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge map technology and the like.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (10)

1. A method for deduplication of large-scale text data, comprising:
dividing first data to be deduplicated into at least two data segments, wherein each data segment comprises at least two data;
in the first data segment, respectively executing a preset hash algorithm aiming at single data to obtain at least one hash block;
performing duplicate removal calculation on data corresponding to at least two hash blocks in the first data segment to obtain second data to be subjected to duplicate removal;
comparing the hash blocks in the second data to be deduplicated with hash blocks in a preset reference database in sequence;
and performing secondary deduplication calculation according to the similarity of the comparison result, and continuously performing deduplication calculation in the remaining second data segment in the first data to be deduplicated.
2. The deduplication method according to claim 1, wherein within the first data segment, performing a preset hash algorithm on the individual data respectively to obtain at least one hash block comprises:
in the first data segment, respectively executing preset signature calculation on single data to obtain at least one signature block;
and calculating the at least one signature block by using a preset hash algorithm in sequence to obtain at least one hash block.
3. The deduplication method of claim 1, wherein performing deduplication calculations on data corresponding to at least two hash chunks within the first data segment comprises:
if the hash values of the hash blocks at any identical position between the two data in the first data segment are identical, determining that the two data are first data which can be repeated;
performing similarity calculation on first possibly repeated data in the first data segment based on a preset similarity calculation method;
if the similarity between the first repeated data exceeds a first preset similarity threshold, discarding any data in the first repeated data;
if the similarity between the first repeated data does not exceed the first preset similarity threshold, the first repeated data is retained.
4. The method according to claim 1, wherein the performing the second deduplication calculation according to the similarity of the comparison results comprises:
if the preset reference database is empty, storing second data to be deduplicated into the preset reference database;
if the preset reference database is not empty, judging whether hash values of hash blocks contained in first data in the second data to be deduplicated are the same as hash values of hash blocks contained in any second data in the preset reference database, judging whether hash block positions of the two pieces of data with the same hash values are also the same, and if the hash values of the two pieces of data at any same position are the same, determining that the first data and the second data are second data which can be repeated;
calculating the similarity between the first data and the second data based on a preset text similarity algorithm;
if the similarity exceeds a second preset similarity threshold, discarding the first data in the second data to be deduplicated;
and if the similarity does not exceed the second preset similarity threshold, reserving the first data in the second data to be deduplicated.
5. The deduplication method of claim 4, wherein the method further comprises:
and updating the preset reference database based on the second data to be deduplicated after the deduplication processing.
6. A device for removing duplicate text data on a large scale, comprising:
the dividing unit is used for dividing the first data to be deduplicated into at least two data segments, and each data segment comprises at least two data;
the first computing unit is used for executing a preset hash algorithm on the single data in the first data segment to obtain at least one hash block;
the second computing unit is used for performing duplicate removal computation on data corresponding to at least two hash blocks in the first data segment to obtain second data to be subjected to duplicate removal;
the comparison unit is used for sequentially comparing the hash blocks in the second data to be deduplicated with the hash blocks in a preset reference database;
and the third calculating unit is used for performing secondary deduplication calculation according to the comparison result similarity of the comparison unit and continuously executing deduplication calculation in the remaining second data segment in the first data to be deduplicated.
7. The de-duplication device of claim 6 wherein the device further comprises:
and the updating unit is used for updating the preset reference database based on the second data to be deduplicated after the deduplication processing.
8. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.
9. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.
10. A computer program product, characterized in that it comprises a computer program which, when being executed by a processor, carries out the method according to any one of claims 1-5.
CN202210700368.3A 2022-06-20 2022-06-20 Method and device for removing duplicate of large-scale text data, electronic equipment and storage medium Pending CN115293126A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210700368.3A CN115293126A (en) 2022-06-20 2022-06-20 Method and device for removing duplicate of large-scale text data, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210700368.3A CN115293126A (en) 2022-06-20 2022-06-20 Method and device for removing duplicate of large-scale text data, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115293126A true CN115293126A (en) 2022-11-04

Family

ID=83820384

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210700368.3A Pending CN115293126A (en) 2022-06-20 2022-06-20 Method and device for removing duplicate of large-scale text data, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115293126A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117372933A (en) * 2023-12-06 2024-01-09 南京智绘星图信息科技有限公司 Image redundancy removing method and device and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117372933A (en) * 2023-12-06 2024-01-09 南京智绘星图信息科技有限公司 Image redundancy removing method and device and electronic equipment
CN117372933B (en) * 2023-12-06 2024-02-20 南京智绘星图信息科技有限公司 Image redundancy removing method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN113963110B (en) Texture map generation method and device, electronic equipment and storage medium
CN112559631B (en) Data processing method and device of distributed graph database and electronic equipment
CN113657289A (en) Training method and device of threshold estimation model and electronic equipment
CN115631273A (en) Big data duplicate removal method, device, equipment and medium
CN113344089A (en) Model training method and device and electronic equipment
CN115293126A (en) Method and device for removing duplicate of large-scale text data, electronic equipment and storage medium
CN114462598A (en) Deep learning model training method, and method and device for determining data category
CN113868434A (en) Data processing method, device and storage medium for graph database
CN112862017A (en) Point cloud data labeling method, device, equipment and medium
CN113868254B (en) Method, device and storage medium for removing duplication of entity node in graph database
CN112632251A (en) Reply content generation method, device, equipment and storage medium
CN115495151A (en) Rule engine migration method, device, equipment, storage medium and program product
CN113887101A (en) Visualization method and device of network model, electronic equipment and storage medium
CN115344627A (en) Data screening method and device, electronic equipment and storage medium
CN113313049A (en) Method, device, equipment, storage medium and computer program product for determining hyper-parameters
CN116737520B (en) Data braiding method, device and equipment for log data and storage medium
CN113553407B (en) Event tracing method and device, electronic equipment and storage medium
CN117743575A (en) Work order data processing method, device, equipment and medium
CN115293157A (en) Method and device for extracting Chinese text, electronic equipment and storage medium
CN113836358A (en) Data processing method and device, electronic equipment and storage medium
CN118034596A (en) Data storage processing method and device, electronic equipment and storage medium
CN115328956A (en) Data query method, device and storage medium based on artificial intelligence
CN112861505A (en) Method and device for detecting repeatability and electronic equipment
CN113836418A (en) Data pushing method and device, electronic equipment and storage medium
CN113360688A (en) Information base construction method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination