CN115292248B

CN115292248B - Data cleaning method, system and equipment based on multiple data versions

Info

Publication number: CN115292248B
Application number: CN202211204960.0A
Authority: CN
Inventors: 王敏; 张雷; 李本学
Original assignee: Zhongfu Safety Technology Co Ltd
Current assignee: Zhongfu Safety Technology Co Ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-03
Anticipated expiration: 2042-09-30
Also published as: CN115292248A

Abstract

The application discloses a data cleaning method, system and device based on multiple data versions, mainly relates to the technical field of data cleaning, and is used for solving the problem that the existing effective version metadata and total data block number are low in acquisition efficiency. The method comprises the following steps: when the version metadata are generated, adding the version metadata ID and the updating time of the version metadata as version metadata index information into a preset batch of files of a preset index file; acquiring an effective version time threshold to determine index information positioned by the effective version time threshold; determining valid version metadata; determining a valid file data block; further, according to the preset sizes of the effective file data blocks and the bloom filters, a plurality of bloom filters with dynamic linked list structures are established; and synchronously traversing the disk data blocks through a plurality of bloom filters to determine whether the valid file data blocks are valid. According to the method, the efficiency of acquiring the effective version metadata and the total number of the data blocks is improved.

Description

Data cleaning method, system and equipment based on multiple data versions

Technical Field

The present application relates to the field of data cleaning technologies, and in particular, to a method, a system, and an apparatus for cleaning data based on multiple data versions.

Background

Data cleansing refers to the process of correcting and deleting inaccurate data records from a database or data table. Data cleansing includes identifying and replacing incomplete, inaccurate, irrelevant, or problematic data and records.

At present, the method for cleaning data mainly comprises the following steps: dividing data in a storage server into three parts of version metadata, file metadata and file block data; the relationship among the three types of data is as follows: I. the file block data is file block data obtained by splitting file data according to a preset size, and is provided with a block data identifier; file metadata stores a block data identification list corresponding to file data; and III, storing a cluster of file metadata updating information in the version metadata. Scanning the whole disk space through cleaning equipment to obtain the total number of data blocks, and further constructing a bloom filter; and carrying out accurate cleaning on the invalid version metadata smaller than the threshold value of the valid version number through a bloom filter.

However, when data is cleaned by the existing cleaning device, the entire disk needs to be scanned, metadata information of each version needs to be analyzed, and the metadata information is compared with a threshold value of an effective version number to obtain the metadata of the effective version, so that the efficiency is low. In addition, before the bloom filter is constructed, the cleaning equipment needs to scan the data of the whole disk, so that the total number of the data blocks is obtained, and the efficiency is low.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, the present invention provides a method, a system and a device for data cleaning based on multiple data versions, so as to solve the above-mentioned technical problems.

In a first aspect, the present application provides a data cleaning method based on multiple data versions, including: when the version metadata are generated, adding the version metadata ID and the updating time of the version metadata as version metadata index information into a preset batch of files of a preset index file; the method comprises the steps that a preset index file is divided into a plurality of preset batches of files according to a preset segmentation time period, and a plurality of index information in the same preset batch of files are arranged according to an updating time sequence; acquiring an effective version time threshold to determine index information of the effective version time threshold positioned in a preset batch of files of a preset index file; determining that all the version metadata corresponding to all the index information after the update time point corresponding to the index information are valid version metadata; determining valid file data blocks based on the valid version metadata; then according to the effective file data block and the preset size of the bloom filter, a plurality of bloom filters with dynamic linked list structures are created; and synchronously traversing the disk data blocks through a plurality of bloom filters to determine whether the valid file data blocks are valid.

Further, the method further comprises: creating a preset batch file in a preset index file according to a preset batch interval; and the filenames of the preset batch files carry time information.

Further, acquiring an effective version time threshold to determine index information of the effective version time threshold located in a preset batch of files of a preset index file specifically includes: acquiring an effective version time threshold through a preset acquisition interface; determining a batch of files to be detected from a plurality of preset batch files based on the carrying time information and the effective version time threshold of the preset batch files; and determining the version metadata corresponding to the effective version time threshold value based on the updating time corresponding to each version metadata in the to-be-detected batch of files.

Further, according to the preset sizes of the valid file data blocks and the bloom filters, a plurality of bloom filters with a dynamic linked list structure are created, and the method specifically comprises the following steps: initializing a bloom filter with a dynamic linked list structure, presetting the size of the bloom filter as n and the HASH value as k, and pre-creating a bloom filter node at the head of the dynamic linked list; and after scanning the effective file data blocks according to the effective version metadata, writing the ID of the effective file data blocks into the bloom filter and counting, and when the count is greater than n, newly constructing the bloom filter and adding the bloom filter into the dynamic linked list.

In a second aspect, the present application provides a data cleansing system based on multiple data versions, the system comprising: the adding module is used for adding the version metadata ID and the updating time of the version metadata as version metadata index information into a preset batch of files of a preset index file when the version metadata are generated; the method comprises the steps that a preset index file is divided into a plurality of preset batches of files according to a preset segmentation time period, and a plurality of index information in the same preset batch of files are arranged according to an updating time sequence; the determining module is used for acquiring an effective version time threshold so as to determine index information of the effective version time threshold positioned in a preset batch of files of a preset index file; determining that all the version metadata corresponding to all the index information after the update time point corresponding to the index information are valid version metadata; the traversal module is used for determining an effective file data block based on the effective version metadata; then according to the effective file data block and the preset size of the bloom filter, a plurality of bloom filters with dynamic linked list structures are created; and synchronously traversing the disk data blocks through a plurality of bloom filters to determine whether the valid file data blocks are valid.

Further, the determining module further comprises a determining unit; the time threshold of the effective version is acquired through a preset acquisition interface; determining a batch of files to be detected from a plurality of preset batch files based on the carrying time information and the effective version time threshold of the preset batch files; and determining the version metadata corresponding to the effective version time threshold value based on the updating time corresponding to each version metadata in the to-be-detected batch of files.

Further, the traversing module comprises an adding unit; the method comprises the steps of initializing a bloom filter with a dynamic linked list structure, presetting the size of the bloom filter as n and the HASH value as k, and pre-establishing a bloom filter node at the head of the dynamic linked list; and after scanning the effective file data blocks according to the effective version metadata, writing the ID of the effective file data blocks into the bloom filter and counting, and when the count is greater than n, newly constructing the bloom filter and adding the bloom filter into the dynamic linked list.

In a third aspect, the present application provides a data cleansing apparatus based on multiple data versions, the apparatus comprising: a processor; and a memory having executable code stored thereon, the executable code, when executed, causing the processor to perform a multiple data version based data scrubbing method of any of the above.

As can be appreciated by those skilled in the art, the present invention has at least the following beneficial effects:

(1) The version metadata generates index files according to the updating time sequence, the index files are divided into batch files according to time, when the version metadata is searched, the batch index files are positioned according to the effective version time threshold value, then the batch index files are analyzed, the closest value of the version metadata is positioned according to the updating time, and the effective version positioning efficiency is improved.

(2) The bloom filter adopts a dynamic linked list mode, the number of the bloom filters is dynamically expanded according to the number of the effective file data blocks in the process of searching the effective file data blocks, the numerical values are stored in the linked list mode, and the query can be executed on the multi-node bloom filter in parallel in the process of checking the effective file data blocks, so that one-time full-disk scanning of a disk is reduced, and the effective checking efficiency is improved.

Drawings

Some embodiments of the present disclosure are described below with reference to the accompanying drawings, in which:

fig. 1 is a flowchart of a data cleaning method based on multiple data versions according to an embodiment of the present application.

Fig. 2 is a schematic diagram of an internal structure of index information provided in an embodiment of the present application.

Fig. 3 is a schematic diagram of an internal structure of a preset index file according to an embodiment of the present application.

Fig. 4 is a schematic diagram of internal structures of a plurality of bloom filters having a dynamic linked list structure according to an embodiment of the present application.

Fig. 5 is a schematic diagram of an internal structure of a data cleansing system based on multiple data versions according to an embodiment of the present application.

Fig. 6 is a schematic diagram of an internal structure of a data cleansing device based on multiple data versions according to an embodiment of the present application.

Detailed Description

It should be understood by those skilled in the art that the embodiments described below are only preferred embodiments of the present disclosure, and do not mean that the present disclosure can be implemented only by the preferred embodiments, which are merely for explaining the technical principles of the present disclosure and are not intended to limit the scope of the present disclosure. All other embodiments that can be derived by one of ordinary skill in the art from the preferred embodiments provided by the disclosure without undue experimentation will still fall within the scope of the disclosure.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The technical solutions proposed in the embodiments of the present application are described in detail below with reference to the accompanying drawings.

The embodiment of the present application further provides a data cleaning method based on multiple data versions, as shown in fig. 1, the method provided in the embodiment of the present application mainly includes the following steps:

step 110, when the version metadata is generated, adding the version metadata ID and the update time of the version metadata as the version metadata index information to a preset batch file of a preset index file.

Note that the index information includes a version metadata ID and an update time (for example, fig. 2). The preset index file is divided into a plurality of preset batches of files according to the preset segmentation time period, a plurality of index information in the same preset batch of files are arranged according to the updating time sequence (for example, as shown in fig. 3), and when any index information is accurately positioned, the effective version metadata index storage area and the invalid version metadata index storage area can be quickly found.

In order to effectively and quickly find out metadata of any version in a plurality of batches of files, marks can be preset in preset file names to distinguish different files, the preset batches of files are positioned, and then the metadata of the version is accurately positioned. As an example, the present application may create a preset batch file in a preset index file according to a preset batch interval; and the filenames of the preset batch files carry time information.

Step 120, obtaining an effective version time threshold value to determine index information of the effective version time threshold value positioned in a preset batch of files of a preset index file; and determining that all the version metadata corresponding to the index information after the update time point corresponding to the index information are valid version metadata.

It should be noted that, according to step 110, it can be known that the index information of the files in the same predetermined batch are arranged according to the update time sequence. Therefore, the index information corresponds to all index information after the update time point, and is valid data/information that can be detected.

The "obtaining the effective version time threshold to determine the index information of the effective version time threshold located in the preset batch of files of the preset index file" may specifically be: acquiring an effective version time threshold through a preset acquisition interface; determining a batch of files to be detected from a plurality of preset batch files based on the carrying time information and the effective version time threshold of the preset batch files; and determining the version metadata corresponding to the effective version time threshold value based on the updating time corresponding to each version metadata in the to-be-detected batch of files.

Step 130, determining valid file data blocks based on the valid version metadata; then according to the effective file data block and the preset size of the bloom filter, a plurality of bloom filters with dynamic linked list structures are created; and synchronously traversing the disk data blocks through a plurality of bloom filters to determine whether the valid file data blocks are valid.

Wherein, the creating of the bloom filters with the dynamic linked list structure according to the preset sizes of the effective file data blocks and the bloom filters may specifically be: initializing a bloom filter with a dynamic linked list structure, presetting the size of the bloom filter as n and the HASH value as k, and pre-creating a bloom filter node at the head of the dynamic linked list; after the valid file data blocks are scanned according to the valid version metadata, writing the ID of the valid file data blocks into a bloom filter and counting, and when the count is greater than n, newly constructing the bloom filter and adding the bloom filter into a dynamic linked list (for example, as shown in fig. 4).

And synchronously traversing the disk data blocks by a plurality of bloom filters, wherein when one bloom filter returns data, the file data block is valid data, and otherwise, the file data block is invalid data.

Based on the description, the method and the device can be used for storing the version metadata in a preset batch file segmentation mode, so that the effective version metadata positioning query time is reduced, the version metadata file analysis time is reduced, and the effective version metadata retrieval efficiency is improved; by using the dynamic chain table bloom filter, the one-time full-disk scanning time for acquiring the total number of the file data blocks is reduced, the data block validity detection can be concurrently performed, and the redundant data cleaning efficiency is improved.

In addition, fig. 5 is a data cleansing system based on multiple data versions according to an embodiment of the present application. As shown in fig. 5, the system provided in the embodiment of the present application mainly includes:

an adding module 210, configured to add, when the version metadata is generated, the version metadata ID and the update time of the version metadata as version metadata index information to a preset batch of files of a preset index file; the method comprises the steps that a preset index file is divided into a plurality of preset batches of files according to a preset segmentation time period, and a plurality of index information in the same preset batch of files are arranged according to an updating time sequence;

a determining module 220, configured to obtain an effective version time threshold, so as to determine index information of the effective version time threshold located in a preset batch of files of a preset index file; determining that all the version metadata corresponding to all the index information after the update time point corresponding to the index information are valid version metadata;

the determination module 220 further comprises a determination unit 221; the time threshold of the effective version is acquired through a preset acquisition interface; determining a batch of files to be detected from a plurality of preset batch files based on the carrying time information and the effective version time threshold of the preset batch files; and determining the version metadata corresponding to the effective version time threshold value based on the updating time corresponding to each version metadata in the to-be-detected batch of files.

A traversal module 230 configured to determine valid file data blocks based on the valid version metadata; then according to the effective file data block and the preset size of the bloom filter, a plurality of bloom filters with dynamic linked list structures are created; and synchronously traversing the disk data blocks through a plurality of bloom filters to determine whether the valid file data blocks are valid.

The traverse module 230 includes an adding unit 231; the method comprises the steps of initializing a bloom filter with a dynamic linked list structure, presetting the size of the bloom filter as n and the HASH value as k, and pre-establishing a bloom filter node at the head of the dynamic linked list; and after scanning the effective file data blocks according to the effective version metadata, writing the ID of the effective file data blocks into the bloom filter and counting, and when the count is greater than n, newly constructing the bloom filter and adding the bloom filter into the dynamic linked list.

Besides, the embodiment of the present application also provides a data cleansing device based on multiple data versions, as shown in fig. 6, on which executable instructions are stored, and when the executable instructions are executed, a data cleansing method based on multiple data versions as described above is implemented. Specifically, the server sends an execution instruction to the memory through the bus, and when the memory receives the execution instruction, sends an execution signal to the processor through the bus so as to activate the processor.

It should be noted that, the processor is configured to add the version metadata ID and the update time of the version metadata as the version metadata index information to the preset batch file of the preset index file when generating the version metadata; the method comprises the steps that a preset index file is divided into a plurality of preset batches of files according to a preset segmentation time period, and a plurality of index information in the same preset batch of files are arranged according to an updating time sequence; acquiring an effective version time threshold to determine index information of the effective version time threshold positioned in a preset batch of files of a preset index file; determining that all the version metadata corresponding to all the index information after the update time point corresponding to the index information are valid version metadata; determining valid file data blocks based on the valid version metadata; then according to the effective file data block and the preset size of the bloom filter, a plurality of bloom filters with dynamic linked list structures are created; and synchronously traversing the disk data blocks through a plurality of bloom filters to determine whether the valid file data blocks are valid.

So far, the technical solutions of the present disclosure have been described in connection with the foregoing embodiments, but it is easily understood by those skilled in the art that the scope of the present disclosure is not limited to only these specific embodiments. The technical solutions in the above embodiments can be split and combined, and equivalent changes or substitutions can be made on related technical features by those skilled in the art without departing from the technical principles of the present disclosure, and any changes, equivalents, improvements, etc. made within the technical concept and/or technical principles of the present disclosure will fall within the protection scope of the present disclosure.

Claims

1. A method for data scrubbing based on multiple data versions, the method comprising:

when the version metadata are generated, adding the version metadata ID and the updating time of the version metadata as version metadata index information into a preset batch file of a preset index file; the method comprises the steps that a preset index file is divided into a plurality of preset batches of files according to a preset segmentation time period, and a plurality of index information in the same preset batch of files are arranged according to an updating time sequence;

acquiring an effective version time threshold to determine index information of the effective version time threshold positioned in a preset batch of files of a preset index file; determining that all the version metadata corresponding to the index information after the update time point corresponding to the index information are valid version metadata;

determining valid file data blocks based on the valid version metadata; further, according to the effective file data blocks and the preset sizes of the bloom filters, a plurality of bloom filters with dynamic linked list structures are established; and synchronously traversing the disk data blocks through a plurality of bloom filters to determine whether the valid file data blocks are valid.

2. The multiple data version based data cleansing method of claim 1, further comprising:

creating a preset batch file in a preset index file according to a preset batch interval; and the filename of the preset batch of files carries time information.

3. The method for cleaning data based on multiple data versions according to claim 2, wherein obtaining an effective version time threshold to determine index information of the effective version time threshold located in a predetermined batch of files in a predetermined index file specifically comprises:

acquiring an effective version time threshold through a preset acquisition interface;

determining a batch of files to be detected from a plurality of preset batch files based on the carrying time information and the effective version time threshold of the preset batch files;

and determining the version metadata corresponding to the effective version time threshold value based on the updating time corresponding to each version metadata in the batch of files to be detected.

4. The method for cleaning data based on multiple data versions according to claim 1, wherein creating a plurality of bloom filters having a dynamic linked list structure according to the preset sizes of the valid file data blocks and the bloom filters specifically includes:

initializing a bloom filter with a dynamic linked list structure, presetting the size of the bloom filter as n and the HASH value as k, and pre-creating a bloom filter node at the head of the dynamic linked list;

and after scanning the effective file data blocks according to the effective version metadata, writing the ID of the effective file data blocks into the bloom filter and counting, and when the count is greater than n, newly constructing the bloom filter and adding the bloom filter into the dynamic linked list.

5. A data cleansing system based on multiple data versions, the system comprising:

the adding module is used for adding the version metadata ID and the updating time of the version metadata as version metadata index information into a preset batch of files of a preset index file when the version metadata are generated; the method comprises the steps that a preset index file is divided into a plurality of preset batches of files according to a preset segmentation time period, and a plurality of index information in the same preset batch of files are arranged according to an updating time sequence;

the determining module is used for acquiring an effective version time threshold so as to determine index information of the effective version time threshold positioned in a preset batch of files of a preset index file; determining that all the version metadata corresponding to the index information after the update time point corresponding to the index information are valid version metadata;

the traversal module is used for determining an effective file data block based on the effective version metadata; further, according to the effective file data blocks and the preset sizes of the bloom filters, a plurality of bloom filters with dynamic linked list structures are established; and synchronously traversing the disk data blocks through a plurality of bloom filters to determine whether the valid file data blocks are valid.

6. The multiple data version-based data cleansing system of claim 5, wherein the determining module further comprises a determining unit;

the time threshold of the effective version is acquired through a preset acquisition interface; determining a batch of files to be detected from a plurality of preset batch files based on the carrying time information and the effective version time threshold of the preset batch files; and determining the version metadata corresponding to the effective version time threshold value based on the updating time corresponding to each version metadata in the batch of files to be detected.

7. The multiple data version-based data cleansing system of claim 5, wherein the traversal module comprises an add unit;

the method comprises the steps of initializing a bloom filter with a dynamic linked list structure, presetting the size of the bloom filter as n and the HASH value as k, and pre-establishing a bloom filter node at the head of the dynamic linked list; and after scanning the effective file data blocks according to the effective version metadata, writing the ID of the effective file data blocks into the bloom filter and counting, and when the count is greater than n, newly constructing the bloom filter and adding the bloom filter into the dynamic linked list.

8. A data scrubbing apparatus based on multiple data versions, the apparatus comprising:

a processor;

and a memory having executable code stored thereon, which when executed, causes the processor to perform a multiple data version based data cleansing method as claimed in any one of claims 1-4.