CN114281989A - Data deduplication method and device based on text similarity, storage medium and server - Google Patents

Data deduplication method and device based on text similarity, storage medium and server Download PDF

Info

Publication number
CN114281989A
CN114281989A CN202111516164.6A CN202111516164A CN114281989A CN 114281989 A CN114281989 A CN 114281989A CN 202111516164 A CN202111516164 A CN 202111516164A CN 114281989 A CN114281989 A CN 114281989A
Authority
CN
China
Prior art keywords
index
text data
data
secondary index
resident
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111516164.6A
Other languages
Chinese (zh)
Inventor
蒋溢
蔡可成
熊安萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202111516164.6A priority Critical patent/CN114281989A/en
Publication of CN114281989A publication Critical patent/CN114281989A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to the field of big data processing, in particular to a data duplication eliminating method and device based on text similarity, a storage medium and a server; the method comprises the steps of respectively preprocessing initial text data and incremental text data; extracting partial samples from the initial text data, classifying the samples by adopting a hierarchical clustering algorithm, and determining a first-level index; dividing all initial text data hash values into classes corresponding to a hierarchical clustering algorithm by adopting a k-means clustering algorithm, constructing a secondary index, and pointing the address of the secondary index to the primary index; and removing the duplicate of the initial text data by retrieving the secondary index, selecting the secondary index with the top rank as a resident index, and removing the duplicate of the incremental text data by combining the resident index, the secondary index and the primary index. The invention can improve the memory hit rate, effectively reduce the disk I/O operation, greatly reduce the occupation of the duplicate removal task on the system resources and effectively reduce the storage cost of enterprise data.

Description

Data deduplication method and device based on text similarity, storage medium and server
Technical Field
The invention relates to the field of big data processing, in particular to a data deduplication method and device based on text similarity, a storage medium and a server.
Background
The success of the internet of things and social media enables cloud platform service providers to face great challenges, and various data centers need to process and analyze massive data every day to dig out potential economic values. However, while data analysis is performed, huge data storage requirements need to be met, a distributed storage scheme introduced by cloud computing becomes an efficient and reliable scheme for solving a data storage problem, however, the cloud load is increased sharply due to the large increase of cloud data, and the cost for managing and storing data, the space of a data center and the energy consumption become more and more serious.
Statistically, up to 60% of global data is duplicated, while in backup and archival storage systems, duplicated data is up to 80% -90%. Data de-duplication is widely used in cloud storage backup and archiving systems. The technology is a professional data reduction scheme in a disk-based backup storage system, aims to eliminate redundant data in a data set and reduce investment cost of cloud service providers on infrastructure and hardware. However, the existing data deduplication improvement methods, such as Adaptive Similar Clustering Algorithm (ASCA), aggregation Binning, etc., are directed to improving deduplication rate and accuracy, but ignore system efficiency in practical application scenarios. Such improved compute-intensive data deduplication schemes often occupy a large amount of CPU resources and I/O resources of the system, and seriously affect the read-write performance and other service performance of the system.
Disclosure of Invention
In order to solve the technical problems, a calculation-intensive data deduplication scheme in the actual production process often occupies a large amount of CPU (central processing unit) resources and I/O (input/output) resources of a system, and seriously influences the read-write performance and other service performances of the system. The invention provides a data deduplication method and device based on text similarity, a storage medium and a server.
In a first aspect of the present invention, the present invention provides a data deduplication method based on text similarity, where the method includes:
preprocessing the initial text data and the incremental text data to respectively obtain digital fingerprints and hash values of the initial text data and the incremental text data;
extracting partial samples from the initial text data, classifying partial initial text data hash values by adopting a hierarchical clustering algorithm, determining K similar set centers, and taking the similar set center hash values and similar set center marks as first-level indexes;
dividing all initial text data hash values into corresponding classifications of K similarity set centers by adopting a K-means clustering algorithm, taking the initial text data hash values, digital fingerprints and storage paths as main keys, taking the main keys and the reference times as secondary indexes, and pointing the addresses of the secondary indexes to the primary indexes;
performing deduplication processing on initial text data with the same main key by retrieving a secondary index, and increasing the number of data references; counting the total number of the reference times of each secondary index data, arranging the secondary indexes according to the sequence of the reference times from large to small, and selecting the first K/4 secondary index as a resident index;
and constructing a bloom filter, filling the bloom filter with an initial text hash value in the resident index, and in the bloom filter, performing deduplication processing on the incremental text data by using the digital fingerprint and the hash value of the incremental text data and combining the resident index, the secondary index and the primary index.
In a second aspect of the present invention, the present invention further provides a data deduplication device based on text similarity, including:
the data processing unit is used for preprocessing the initial text data and the incremental text data to respectively obtain the digital fingerprints and the hash values of the initial text data and the incremental text data;
the first-level index unit is used for extracting partial samples from the initial text data, classifying partial initial text data hash values by adopting a hierarchical clustering algorithm, determining K similar set centers, and taking the similar set center hash values and similar set center marks as first-level indexes;
the secondary index unit is used for dividing all initial text data hash values into classifications corresponding to K similarity set centers by adopting a K-means clustering algorithm, taking the initial text data hash values, the digital fingerprints and the storage paths as main keys, taking the main keys and the reference times as secondary indexes, and pointing the addresses of the secondary indexes to the primary indexes;
the first duplicate removal unit is used for carrying out duplicate removal processing on initial text data with the same main key by retrieving the secondary index and increasing the number of data references;
the common index unit is used for counting the total number of the reference times of each secondary index data, arranging the secondary indexes according to the sequence of the reference times from large to small, and selecting the first K/4 secondary index as a resident index;
the bloom filter is used for storing the initial text hash value in the resident index and providing retrieval;
and the second deduplication unit is used for performing deduplication processing on the incremental text data by using the digital fingerprint and the hash value of the incremental text data in the bloom filter and combining the resident index, the secondary index and the primary index.
In a third aspect of the present invention, the present invention further provides a computer-readable storage medium storing a computer program, which when executed by a processor implements the data deduplication method based on text similarity according to the first aspect of the present invention.
In a fourth aspect of the invention, the invention also provides a server comprising a processor and a memory; the memory is used for storing a plurality of computer programs, and the computer programs are used for being loaded by the processor and executing the data deduplication method based on the text similarity according to the first aspect of the invention; the processor is configured to implement each of the plurality of computer programs.
The invention has the following advantages and beneficial effects:
according to the data deduplication method and device based on text similarity, the storage medium and the server, the initial massive data files are combined with hierarchical clustering and K-means clustering, the advantages of the two clustering methods are fully utilized to adaptively create a similarity set and a primary key, balance of data deduplication efficiency and deduplication rate is kept, a multilevel index is reasonably constructed, and a basis is provided for a subsequent incremental data deduplication task; in the duplication removal process of the incremental file, the self-adaptive memory replacement algorithm is utilized, the memory hit rate is improved, the disk I/O operation is effectively reduced, the occupation of duplication removal tasks on system resources is greatly reduced, and the storage cost of enterprise data is effectively reduced.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a diagram of a system architecture for deduplication of mass data in an embodiment of the present invention;
FIG. 2 is a flowchart of a data deduplication method based on text similarity according to an embodiment of the present invention;
FIG. 3 is a diagram of a multi-level index structure in an embodiment of the present invention;
FIG. 4 is a flow chart of another data deduplication method in an embodiment of the present invention;
FIG. 5 is a flow chart of initial text data deduplication in an embodiment of the present invention;
FIG. 6 is a flowchart illustrating incremental text data deduplication, in an embodiment of the present invention;
FIG. 7 is a block diagram of a data deduplication apparatus based on text similarity according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a system architecture diagram for removing duplicate of mass data in an embodiment of the present invention, as shown in fig. 1, in a cloud platform, only a system administrator has permission to perform a duplicate removal operation on disk data, where the cloud platform includes a plurality of subsystems, and the present invention is focused on a duplicate removal system, so that other app systems are not invented and do not pay attention to the present invention, in the present invention, resources requested by a user are all stored in a background disk through a file system, on one hand, files stored in the disk may be input into the duplicate removal system, and on the other hand, the duplicate removal system may perform a duplicate removal process on the files and output the files to the disk, thereby completing the duplicate removal of the disk files; due to the fact that the number of service systems in the cloud platform is large, in order to reduce influences on a service system server, an administrator selects a proper time node to perform duplicate removal operation on data content which is persistent in a background of the cloud platform.
Fig. 2 is a flowchart of a data deduplication method based on text similarity in an embodiment of the present invention, and as shown in fig. 2, the method includes:
101. preprocessing the initial text data and the incremental text data to respectively obtain digital fingerprints and hash values of the initial text data and the incremental text data;
in the embodiment of the present invention, the initial text data and the incremental text data need to be preprocessed, where the initial text data mainly refers to backup data accumulated for a long time without deduplication as the initial text data, and the incremental text data mainly refers to backup data newly added every day, so that the initial text data is massive, and the incremental text data may be only a part of the initial text data and has a certain correlation with the initial text data.
In the embodiment of the invention, the adopted preprocessing mode is mainly to carry out hash operation on each text data by using a local sensitive hash algorithm SimHash, and the SimHash has the following limitations: different text data will produce the same hash value with a very small probability, so that it is also necessary to generate a unique digital fingerprint for each text data using the message digest algorithm MD5, which is actually used as a data identifier for distinguishing different text data.
102. Extracting partial samples from the initial text data, classifying partial initial text data hash values by adopting a hierarchical clustering algorithm, determining K similar set centers, and taking the similar set center hash values and similar set center marks as first-level indexes;
in the embodiment of the invention, considering that the initial text data is massive, if all the initial text data is directly divided into K clusters, a large amount of computing resources are consumed, in order to reduce resource waste, the invention firstly needs to extract part of sample data from all the initial text data, the extraction proportion can be determined by actual conditions, after the data extraction is finished, a hierarchical clustering algorithm needs to be adopted to classify the extracted initial text data hash value, the extracted initial text data hash value is divided into K clusters, and a central hash value is selected from the clusters to serve as a clustering center, namely a similar clustering center; in the final stage of hierarchical clustering, a unique ID is randomly assigned to each similarity set center as a similarity set class flag, and the similarity set class flag and the similarity set center simhash value form two attribute fields in the initial first-level index structure, as shown in fig. 3.
In some embodiments of the present invention, the primary index also needs to be initialized, where the initialization can remove irrelevant information from the primary index.
103. Dividing all initial text data hash values into corresponding classifications of K similarity set centers by adopting a K-means clustering algorithm, taking the initial text data hash values, digital fingerprints and storage paths as main keys, taking the main keys and the reference times as secondary indexes, and pointing the addresses of the secondary indexes to the primary indexes;
in the embodiment of the invention, the similar set center, the K value and all initial text data hash values are used as the input of a K-means clustering algorithm, and all initial text data hash values are respectively divided into the corresponding K similar set centers.
In the above process, each initial text data is divided into corresponding similar set centers through twice clustering, so that each initial text data hash value belongs to one similar set center, and therefore, the clustering center where the initial text data hash value is located can be locked through a primary index formed by the similar set centers, in order to retrieve specific initial text data, a classification result of a k-means clustering algorithm is required, the classified initial text data hash value, a digital fingerprint and a storage path are used as unique main keys, the main keys and the reference times form an attribute field of a secondary index together, as shown in fig. 3, and simultaneously, an address of the secondary index points to the primary index, so that the corresponding text data can be directly found through the secondary index, and when the secondary index of the text data is unclear, the corresponding secondary index can be found by means of the primary index, and then the second-level index is utilized to complete the retrieval work.
In some embodiments of the present invention, a secondary index also needs to be initialized, and massive initial text data also needs to be deduplicated while the secondary index is initialized, and in this process, whether duplicate data exists is determined by retrieving a primary key of an existing secondary index; if the repeated main key exists, only updating the data reference frequency attribute of the secondary index record corresponding to the main key, not performing disk storage on the repeated data, and returning the information of the unique data as a metadata information element; and if the repeated primary key does not exist, carrying out disk storage on the data, and simultaneously adding a new secondary index record.
104. Performing deduplication processing on initial text data with the same main key by retrieving a secondary index, and increasing the number of data references; counting the total number of the reference times of each secondary index data, arranging the secondary indexes according to the sequence of the reference times from large to small, and selecting the first K/4 secondary index as a resident index;
in the embodiment of the present invention, it is further necessary to count the total number of times of reference of each secondary index data, sort the secondary indexes according to the size sequence, and take the first K/4 secondary indexes as the resident indexes to be placed in the memory according to the actual system production environment, and certainly, other numbers of secondary indexes may also be selected as the resident indexes to be placed in the memory.
In the preferred embodiment of the present invention, a resident index table needs to be maintained for all resident indexes, as shown in fig. 3, the resident index table is used for recording the similarity set center corresponding to the secondary index and the similarity set center corresponding to the secondary index associated therewith, and the replacement factor, wherein the attribute columns where the similarity set center corresponding to the associated index and the replacement factor are located are all marked as null values, in the subsequent process, the replacement factor provides a reference basis for the memory replacement policy, the resident index with the replacement factor smaller than the secondary index is replaced according to the recalculated index replacement factor, the secondary index with the replacement factor larger than the resident index is replaced, and the resident index table can be updated.
105. And constructing a bloom filter, filling the bloom filter with an initial text hash value in the resident index, and in the bloom filter, performing deduplication processing on the incremental text data by using the digital fingerprint and the hash value of the incremental text data and combining the resident index, the secondary index and the primary index.
In the embodiment of the invention, a bloom filter is constructed by Redis, the size of the bloom filter is automatically adjusted according to the actual system production environment, and the bloom filter is filled by using the initial text data hash value attribute in the common index.
For the deduplication process of the incremental text data, the following is performed:
if the hash value of the incremental text data exists in the bloom filter, performing deduplication processing on the incremental text data with the same main key by retrieving the resident index, and updating the number of references in the resident index; storing incremental text data without the same primary key into a disk, and adding a new secondary index record;
if the hash value of the incremental text data does not exist in the bloom filter, calculating the Hamming distance between the hash value and the center of the similarity set in the primary index, finding out the similarity set where the incremental text data is located through the minimum Hamming distance, performing deduplication processing on the incremental text data with the same main key by retrieving the corresponding secondary index according to the address of the secondary index stored in the primary index, and updating the reference times in the secondary index; and storing the incremental text data without the same primary key into a magnetic disk, and adding a secondary index record.
Because the deduplication process of the incremental text data involves the replacement problem of the common index, the replacement factor of the secondary index involved in the incremental text data needs to be calculated, the resident index with the replacement factor smaller than the secondary index is replaced according to the recalculated replacement factor, the secondary index with the replacement factor larger than the resident index is replaced, and the resident index table is updated.
The calculation formula of the substitution factor is expressed as:
wap(X)=F(X)+Find(Swap(Y)R(Y,X)),
Figure BDA0003394497590000081
wherein, Swap (X) is the replacement factor of the secondary index X; f (X) is the memory retention factor of the secondary index X under the current task, and F (Y) is the memory retention factor of the resident index Y; r (Y, X) is the correlation coefficient of the resident indexes Y and X; find represents traversal of a resident index table, and alpha represents a first weight factor, so that the stability and the reasonability of a memory retention factor are ensured, and the condition of overlarge gap is avoided; p1 represents the participation probability of the secondary index X, namely the number of files in which X participates under the current task accounts for the proportion of all data files; DXi is the number of times of the secondary index X used by the file i, Dni is the total number of data blocks of the file i, and n is the number of duplicate removal files participated by the secondary index X under the current task; f (X) memory retention factor for index X calculated for the last task.
The calculation formula of the correlation coefficient is expressed as:
r (X, Y) ═ Ω max (R (X, Y1), R (X, Y2) … … R (X, YJ)) ± (1- Ω) R '(X, Y'), where max is used to resolve the association of the secondary index X with other indices, the association factor is selected to be the largest; r '(X, Y') is a correlation factor of the secondary index Y 'most closely correlated with the secondary index X in the last task, and when Y is equal to Y', it is positive, otherwise, it is negative, and Ω represents a second weighting factor, so as to reduce the subsequent memory jitter problem; (ii) a r (X, Yj) represents the association factor of the secondary index X and other secondary indexes Yj, and the expression is as follows:
Figure BDA0003394497590000082
wherein J ═ 1,2, …, J denotes the number of secondary indices associated with secondary index X; p2 is a conditional probability, that is, when a secondary index X participates in a certain file under the current task, the probability that an index Y also participates is obtained; m is the total number of files under the current task; the count (X, Yj) is a function for counting the number of times of simultaneous occurrence before and after the secondary index X record and the Yj record.
Fig. 4 is a flowchart of a data deduplication method according to an embodiment, and as shown in fig. 4, in the embodiment of the present invention, a system administrator needs to select a suitable deduplication time node first, and if there is no deduplication time node meeting a set condition, the deduplication task of this time is abandoned; if the duplication removing time node meeting the set condition exists, then selecting initial data duplication removing operation or incremental data duplication removing operation according to whether the processed index record exists in the memory.
Fig. 5 is a flow chart of the initial text data deduplication process in the embodiment of the present invention, and as shown in fig. 5, the process includes:
if the initial data deduplication operation is selected, all initial text data hash values are required to be input firstly, and all initial text data are preprocessed; carrying out equidistant random sampling on all text data, carrying out simple initial classification operation on hash value data of the extracted samples through an agglomeration hierarchical clustering algorithm to obtain K major classes, and selecting a central hash value from the major classes as a clustering center, namely a similarity set center; in the final stage of hierarchical clustering, a unique ID is randomly allocated to each similarity set center as a similarity set class mark, and the similarity set class mark and the similarity set center simhash value form two attribute fields in the initial first-level index structure. Taking the similar set center, the K value and all initial text data hash values as the input of a K-means clustering algorithm, and carrying out secondary clustering operation, wherein a secondary index creating task and an initial text data deduplication task are simultaneously completed in the clustering process; and (3) forming a unique main key of each initial text data by the hash value, the md5 value and the storage path of the initial text data, and adding the data reference times to form an attribute field of the secondary index together. And counting the total number of the reference times of each secondary index data, sequencing the secondary indexes according to the size sequence, and taking the first K/4 secondary indexes as resident indexes to be placed in a memory according to the actual system production environment.
Fig. 6 is a flowchart illustrating incremental text data deduplication in an embodiment of the present invention, where as shown in fig. 6, the process includes:
if the incremental text data deduplication operation is selected, preprocessing the incremental text data under the current task; calculating a simhash value and an md5 value of the incremental text data, and searching whether the simhash value of the incremental text data exists in a bloom filter;
if the simhash value of the incremental text data does not exist in the bloom filter, calculating the Hamming distance between the simhash value of the incremental text data and the centers of all similar sets in the primary index, finding the cluster where the incremental text data is located by using the minimum Hamming distance, and adding a new secondary index record; in this process, if a large amount of data does not exist in the common index, the common index needs to be replaced, that is, the resident index and the resident index table need to be updated by recalculating the replacement factor of the secondary index related to the incremental text data.
If the simhash value of the incremental text data exists in the bloom filter, due to the defects of the bloom filter: retrieving the value that exists does not necessarily exist, so further checking by the resident index is still required at this time.
If the simhash value of the incremental text data also exists in the resident index, updating a secondary index record related to the incremental text data, simultaneously not storing the incremental text data, and returning corresponding metadata information; finally, the resident index and the resident index table are updated by recalculating the replacement factor of the secondary index related to the incremental text data;
if the simhash value of the incremental text data does not exist in the resident index, calculating the Hamming distance between the simhash value of the incremental text data and the centers of all similar sets in the primary index, finding the cluster where the incremental text data is located by using the minimum Hamming distance, and inquiring the corresponding secondary index. If the simhash value of the incremental text data does not exist in the secondary index, adding a new record for the secondary index; the data corresponding to the simhash value of the incremental text data is non-repeated data, needs to be stored in a disk, and returns corresponding metadata information; if the simhash value of the incremental text data exists in the secondary index, updating the reference frequency field of the de-duplicated data recorded corresponding to the secondary index, and because the data corresponding to the simhash value of the incremental text data is duplicated data, the incremental text data is not stored, and corresponding metadata information is returned; finally, the resident index and the resident index table are updated by recalculating the replacement factor of the secondary index related to the incremental text data.
In this embodiment, it is necessary to further determine whether the hash value of the incremental text data exists in the resident index, and perform the next determination according to whether the hash value exists in the resident index, so as to avoid the defect of the bloom filter and ensure the accuracy and validity of the retrieval process.
Fig. 7 is a structural diagram of a data deduplication device based on text similarity according to an embodiment of the present invention, and as shown in fig. 5, the data deduplication device 200 includes:
a data processing unit 201, configured to pre-process the initial text data and the incremental text data, and obtain a digital fingerprint and a hash value of the initial text data and the incremental text data, respectively;
a first-level index unit 202, configured to extract a part of samples from the initial text data, classify hash values of the part of initial text data by using a hierarchical clustering algorithm, determine K similar set centers, and use the hash values of the similar set centers and similar set center marks as first-level indexes;
the secondary index unit 203 is used for dividing all the initial text data hash values into classifications corresponding to K similarity set centers by adopting a K-means clustering algorithm, taking the initial text data hash values, the digital fingerprints and the storage paths as main keys, taking the main keys and the reference times as secondary indexes, and pointing the addresses of the secondary indexes to the primary indexes;
a first deduplication unit 204, configured to perform deduplication processing on initial text data having the same primary key by retrieving a secondary index, and increase the number of data references;
the common index unit 205 is configured to count the total number of reference times of each piece of secondary index data, arrange the secondary indexes in an order from a large reference time to a small reference time, and select the first K/4 secondary index as a resident index;
a bloom filter 206 for storing the initial text hash value in the resident index and providing a search;
and a second deduplication unit 207, configured to perform deduplication processing on the incremental text data in the bloom filter by using the digital fingerprint and the hash value of the incremental text data, in combination with the resident index, the secondary index, and the primary index.
Embodiments of the present invention also provide a server, which may have relatively large differences due to different configurations or performances, and may include one or more Central Processing Units (CPUs) (e.g., one or more processors) and a memory, and one or more storage media (e.g., one or more mass storage devices) for storing applications or data. The memory and storage medium may be, among other things, transient or persistent storage. The program stored on the storage medium may include one or more modules, each of which may include a series of instruction operations for the server. Still further, the central processor may be configured to communicate with the storage medium and execute a series of instruction operations in the storage medium on the server.
Specifically, the application program stored in the storage medium includes an application program for data deduplication, and the program may include the data processing unit 201, the primary index unit 202, the secondary index unit 203, the first deduplication unit 204, the common index unit 205, the bloom filter 206, and the second deduplication unit 207 in the data deduplication device 200, which is not described herein again. Further, the central processor 20 may be configured to communicate with the storage medium, and perform a series of operations corresponding to the application program for deduplication of data stored in the storage medium on the server.
The server may also include one or more power supplies, one or more wired or wireless network interfaces, and/or one or more operating systems, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and the like.
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of a server, may implement:
preprocessing the initial text data and the incremental text data to respectively obtain digital fingerprints and hash values of the initial text data and the incremental text data;
extracting partial samples from the initial text data, classifying partial initial text data hash values by adopting a hierarchical clustering algorithm, determining K similar set centers, and taking the similar set center hash values and similar set center marks as first-level indexes;
dividing all initial text data hash values into corresponding classifications of K similarity set centers by adopting a K-means clustering algorithm, taking the initial text data hash values, digital fingerprints and storage paths as main keys, taking the main keys and the reference times as secondary indexes, and pointing the addresses of the secondary indexes to the primary indexes;
performing deduplication processing on initial text data with the same main key by retrieving a secondary index, and increasing the number of data references; counting the total number of the reference times of each secondary index data, arranging the secondary indexes according to the sequence of the reference times from large to small, and selecting the first K/4 secondary index as a resident index;
and constructing a bloom filter, filling the bloom filter with an initial text hash value in the resident index, and in the bloom filter, performing deduplication processing on the incremental text data by using the digital fingerprint and the hash value of the incremental text data and combining the resident index, the secondary index and the primary index.
In the description of the present invention, it is to be understood that the terms "coaxial", "bottom", "one end", "top", "middle", "other end", "upper", "one side", "top", "inner", "outer", "front", "center", "both ends", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "disposed," "connected," "fixed," "rotated," and the like are to be construed broadly, e.g., as meaning fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; the terms may be directly connected or indirectly connected through an intermediate, and may be communication between two elements or interaction relationship between two elements, unless otherwise specifically limited, and the specific meaning of the terms in the present invention will be understood by those skilled in the art according to specific situations.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A data deduplication method based on text similarity is characterized by comprising the following steps:
preprocessing the initial text data and the incremental text data to respectively obtain digital fingerprints and hash values of the initial text data and the incremental text data;
extracting partial samples from the initial text data, classifying partial initial text data hash values by adopting a hierarchical clustering algorithm, determining K similar set centers, and taking the similar set center hash values and similar set center marks as first-level indexes;
dividing all initial text data hash values into corresponding classifications of K similarity set centers by adopting a K-means clustering algorithm, taking the initial text data hash values, digital fingerprints and storage paths as main keys, taking the main keys and the reference times as secondary indexes, and pointing the addresses of the secondary indexes to the primary indexes;
performing deduplication processing on initial text data with the same main key by retrieving a secondary index, and increasing the number of data references; counting the total number of the reference times of each secondary index data, arranging the secondary indexes according to the sequence of the reference times from large to small, and selecting the first K/4 secondary index as a resident index;
and constructing a bloom filter, filling the bloom filter with an initial text hash value in the resident index, and in the bloom filter, performing deduplication processing on the incremental text data by using the digital fingerprint and the hash value of the incremental text data and combining the resident index, the secondary index and the primary index.
2. The data deduplication method based on text similarity according to claim 1, wherein obtaining the digital fingerprint and the hash value of the initial text data and the incremental text data comprises calculating the digital fingerprint for each of the initial text data and the incremental text data by using a message digest algorithm MD 5; and calculating the hash value by adopting a local sensitive hash algorithm SimHash.
3. The data deduplication method based on text similarity according to claim 1, wherein the selecting of the top K/4 secondary index as the resident index further includes constructing a resident index table from all the resident indexes, where the resident index table is used to record a similarity set center corresponding to the secondary index, that is, a similarity set center corresponding to a similarity set center attribute in the primary index and a similarity set center corresponding to the secondary index associated therewith, and a substitution factor, and marking the similarity set center corresponding to the associated index and an attribute column where the substitution factor is located as a null value.
4. The data deduplication method based on text similarity according to claim 1, wherein the deduplication processing on the incremental text data by combining the resident index, the secondary index and the primary index includes:
if the hash value of the incremental text data exists in the bloom filter, performing deduplication processing on the incremental text data with the same main key by retrieving the resident index, and updating the number of references in the resident index; storing incremental text data without the same primary key into a disk, and adding a new secondary index record; if the hash value of the incremental text data does not exist in the bloom filter, calculating the Hamming distance between the hash value and the center of the similarity set in the primary index, finding out the similarity set where the incremental text data is located through the minimum Hamming distance, performing deduplication processing on the incremental text data with the same main key by retrieving the corresponding secondary index according to the address of the secondary index stored in the primary index, and updating the reference times in the secondary index; and storing the incremental text data without the same primary key into a magnetic disk, and adding a secondary index record.
5. The method as claimed in claim 4, further comprising calculating a replacement factor of the secondary index related to the incremental text data after the incremental text data is deduplicated, replacing the resident index with the replacement factor smaller than the secondary index according to the recalculated replacement factor, and updating the resident index table by replacing the secondary index with the replacement factor larger than the resident index.
6. The data deduplication method based on text similarity according to claim 5, wherein the calculation formula of the replacement factor is represented as:
Swap(X)=F(X)+Find(Swap(Y)R(Y,X)),
Figure FDA0003394497580000021
wherein, Swap (X) is the replacement factor of the secondary index X; f (X) is the memory retention factor of the secondary index X under the current task, and F (Y) is the memory retention factor of the resident index Y; r (Y, X) is the correlation coefficient of the resident indexes Y and X; find represents traversal resident index table, alpha represents a first weight factor, and p1 represents participation probability of the secondary index X; DXi is the frequency of the secondary index X used by the file i, Dni is the total number of data blocks of the file i, and n is the number of duplicate removal files in which the secondary index X participates under the current task; (X) a factor is reserved for the secondary index X memory calculated for the last task.
7. The data deduplication method based on text similarity according to claim 6, wherein the calculation formula of the correlation coefficient is represented as:
R(X,Y)=Ωmax(r(X,Y1),r(X,Y2)……r(X,YJ))±(1-Ω)R’(X,Y’)
wherein, R '(X, Y') is an association factor of a secondary index Y 'most closely associated with the secondary index X in the last task, and when Y is equal to Y', it is positive, otherwise it is negative, and Ω represents a second weighting factor; r (X, Yj) represents the association factor of the secondary index X and other secondary indexes Yj, and the expression is as follows:
Figure FDA0003394497580000031
wherein J ═ 1,2, …, J denotes the number of secondary indices associated with secondary index X; p2 is a conditional probability, that is, when a secondary index X is already participated in a certain file under the current task, the index Yj is also participated in; m is the total number of files under the current task; the count (X, Yj) is a function for counting the number of times of simultaneous occurrence before and after the secondary index X record and the secondary index Yj record.
8. A data deduplication device based on text similarity, comprising:
the data processing unit is used for preprocessing the initial text data and the incremental text data to respectively obtain the digital fingerprints and the hash values of the initial text data and the incremental text data;
the first-level index unit is used for extracting partial samples from the initial text data, classifying partial initial text data hash values by adopting a hierarchical clustering algorithm, determining K similar set centers, and taking the similar set center hash values and similar set center marks as first-level indexes;
the secondary index unit is used for dividing all initial text data hash values into classifications corresponding to K similarity set centers by adopting a K-means clustering algorithm, taking the initial text data hash values, the digital fingerprints and the storage paths as main keys, taking the main keys and the reference times as secondary indexes, and pointing the addresses of the secondary indexes to the primary indexes;
the first duplicate removal unit is used for carrying out duplicate removal processing on initial text data with the same main key by retrieving the secondary index and increasing the number of data references;
the common index unit is used for counting the total number of the reference times of each secondary index data, arranging the secondary indexes according to the sequence of the reference times from large to small, and selecting the first K/4 secondary index as a resident index;
the bloom filter is used for storing the initial text hash value in the resident index and providing retrieval;
and the second deduplication unit is used for performing deduplication processing on the incremental text data by using the digital fingerprint and the hash value of the incremental text data in the bloom filter and combining the resident index, the secondary index and the primary index.
9. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the data deduplication method based on text similarity according to any one of claims 1 to 7.
10. A server, comprising a processor and a memory; the memory is used for storing a plurality of computer programs, and the computer programs are used for being loaded by the processor and executing the data deduplication method based on the text similarity according to any one of claims 1 to 7; the processor is configured to implement each of the plurality of computer programs.
CN202111516164.6A 2021-12-06 2021-12-06 Data deduplication method and device based on text similarity, storage medium and server Pending CN114281989A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111516164.6A CN114281989A (en) 2021-12-06 2021-12-06 Data deduplication method and device based on text similarity, storage medium and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111516164.6A CN114281989A (en) 2021-12-06 2021-12-06 Data deduplication method and device based on text similarity, storage medium and server

Publications (1)

Publication Number Publication Date
CN114281989A true CN114281989A (en) 2022-04-05

Family

ID=80872258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111516164.6A Pending CN114281989A (en) 2021-12-06 2021-12-06 Data deduplication method and device based on text similarity, storage medium and server

Country Status (1)

Country Link
CN (1) CN114281989A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116383334A (en) * 2023-06-05 2023-07-04 长沙丹渥智能科技有限公司 Method, device, computer equipment and medium for removing duplicate report

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116383334A (en) * 2023-06-05 2023-07-04 长沙丹渥智能科技有限公司 Method, device, computer equipment and medium for removing duplicate report
CN116383334B (en) * 2023-06-05 2023-08-08 长沙丹渥智能科技有限公司 Method, device, computer equipment and medium for removing duplicate report

Similar Documents

Publication Publication Date Title
KR102007070B1 (en) Reference block aggregating into a reference set for deduplication in memory management
US10579661B2 (en) System and method for machine learning and classifying data
JP5732536B2 (en) System, method and non-transitory computer-readable storage medium for scalable reference management in a deduplication-based storage system
US10089017B2 (en) Method and apparatus for SSD storage access
US20160350302A1 (en) Dynamically splitting a range of a node in a distributed hash table
JP2017512338A (en) Implementation of semi-structured data as first class database elements
US20120254173A1 (en) Grouping data
Yue et al. Building an efficient put-intensive key-value store with skip-tree
JP2017526027A (en) Clustering storage method and apparatus
US20180144061A1 (en) Edge store designs for graph databases
CN101963982A (en) Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
De Vries et al. Robust record linkage blocking using suffix arrays and bloom filters
CN104054071A (en) Method for accessing storage device and storage device
CN109241023A (en) Distributed memory system date storage method, device, system and storage medium
CN113535670B (en) Virtual resource mirror image storage system and implementation method thereof
CN110908589A (en) Data file processing method, device and system and storage medium
US10719554B1 (en) Selective maintenance of a spatial index
US10599614B1 (en) Intersection-based dynamic blocking
Gao et al. Real-time social media retrieval with spatial, temporal and social constraints
JP2022137281A (en) Data query method, device, electronic device, storage medium, and program
CN110019017B (en) High-energy physical file storage method based on access characteristics
CN114281989A (en) Data deduplication method and device based on text similarity, storage medium and server
WO2021082928A1 (en) Data reduction method and apparatus, computing device, and storage medium
CN112416879A (en) Block-level data deduplication method based on NTFS (New technology File System)
Narang et al. Real-time approximate range motif discovery & data redundancy removal algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination