CN116774918A

CN116774918A - Data cleaning method, device, equipment and storage medium

Info

Publication number: CN116774918A
Application number: CN202310063853.9A
Authority: CN
Inventors: 毛梦依; 李婉悦; 赵万龙
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2023-01-12
Filing date: 2023-01-12
Publication date: 2023-09-19

Abstract

The application discloses a data cleaning method, a device, equipment and a storage medium. The method comprises the following steps: determining a logic block corresponding to each partition of the storage medium to be cleaned and a corresponding target cleaning strategy; hierarchical cleaning is carried out on the logic blocks based on target cleaning strategies of the logic blocks until the storage media to be cleaned are cleaned; the target cleaning strategy is one of a first strategy for full cleaning, a second strategy for partial cleaning and a third strategy for data cleaning. Therefore, each logic block of the storage medium partition to be cleaned can be cleaned in a grading manner based on the first strategy, the second strategy and the third strategy, namely, a targeted cleaning strategy is provided for the logic blocks with different functions of the storage medium to be cleaned, so that the data cleaning efficiency is improved, the data safety is considered, and the user experience is improved.

Description

Data cleaning method, device, equipment and storage medium

Technical Field

The present application relates to the field of data cleaning technologies, and in particular, to a data cleaning method, device, equipment, and storage medium.

Background

Computing power networks involving the mutual collaboration among cloud computing, edge computing, sea computing, and the like are becoming the mainstream of computing technologies today, bare metal servers are an important component of the computing power network infrastructure layer, and more users rent bare metal servers through networks to provide computing power services. When a user rents a bare metal server, user data such as user files, usage marks and the like have residues with different degrees.

In the prior art, the disk cleaning mode adopts full cleaning or only cleaning of metadata information of the disk. The full cleaning is to overwrite all areas of the disk repeatedly to reset the data so as to achieve the aim of data erasure, but the method takes longer time and has low efficiency for the global cleaning of the disk. If only the metadata information of the disk is cleaned, the cleaning speed can be improved, but the file data stored in the disk is not damaged actually, partial residual data can be recovered through the modes of partition table data recovery and the like, and certain potential safety hazards are brought to user data.

Disclosure of Invention

In view of the above, the embodiments of the present application provide a data cleaning method, apparatus, device, and storage medium, which aim to improve the cleaning efficiency of data and the security of data cleaning.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a data cleaning method, where the method includes:

determining a logic block corresponding to each partition of the storage medium to be cleaned and a corresponding target cleaning strategy;

hierarchical cleaning is carried out on the logic blocks based on a target cleaning strategy of each logic block until the storage medium to be cleaned is cleaned;

the target cleaning strategy is one of a first strategy for full cleaning, a second strategy for partial cleaning and a third strategy for not cleaning data.

In the above scheme, the determining the logic block corresponding to each partition of the storage medium to be cleaned and the corresponding target cleaning policy includes:

determining the starting address and the type of each corresponding logic block based on the block descriptor of each partition of the storage medium to be cleaned;

determining the target cleaning strategy based on the type of the logic block;

accordingly, hierarchical cleaning of the logical blocks based on a target cleaning policy for each logical block includes:

and cleaning the logic block based on the target cleaning strategy and the starting address of the logic block.

In the above solution, the hierarchical cleaning of the logic blocks based on the target cleaning policy of each logic block includes:

if the type of the logic block is an i-node bitmap or a logic block bitmap, not cleaning the i-node bitmap or the logic block bitmap based on the third strategy;

if the type of the logic block is a logic block, performing partial cleaning based on the second strategy and the starting address of the logic block;

and if the type of the logic block is the i node bitmap, the logic block bitmap and the logic block except the logic block, performing full cleaning based on the first strategy and the starting address of the corresponding logic block.

In the above solution, the types of the logical blocks include a directory block and a data block, the second policy indicates to perform overall cleaning on the directory block and cleaning on the data block based on a preset cleaning proportion, and the performing partial cleaning based on the second policy and a start address of the logical block includes:

determining a type of the logical block based on the block descriptor;

if the type of the logic block is a directory block, performing full cleaning based on the starting address of the directory block;

If the type of the logic block is a data block, cleaning the target data block of the same file based on the starting address according to a preset cleaning proportion and a file identification.

In the above scheme, the cleaning the target data block corresponding to the same file based on the start address according to the preset cleaning proportion and the file identifier includes:

determining a start address of at least one data block belonging to the same file based on the hash table and the file identification; the hash table is used for representing the starting address of at least one data block corresponding to each file identifier;

dividing the starting address of at least one data block of the same file based on a partitioning rule, and determining a target data block corresponding to the same file; the target data block corresponding to the same file comprises each block after the same file is divided and the starting address corresponding to each block;

and cleaning the target data blocks corresponding to the same file based on the start address based on the preset cleaning proportion.

In the above scheme, the cleaning the target data block corresponding to the same file based on the preset cleaning proportion and the starting address includes:

determining the cleaning priority of each block based on the starting address of each block in the target data block and the preset cleaning priority;

And cleaning the target data blocks corresponding to the same file based on the starting address based on the cleaning priority of each block and the preset cleaning proportion.

In the above solution, the partitioning rule is a content variable length partitioning rule, and the partitioning method includes partitioning a start address of at least one data block of the same file based on the partitioning rule, and determining a target data block corresponding to the same file, where the partitioning rule includes:

dividing the starting address of at least one data block of the same file based on the content variable length partitioning rule, and determining a target data block corresponding to the same file; wherein each block size in the target data block is close to normal distribution.

In the above scheme, the method further comprises:

determining a start address of at least one data block corresponding to each same file based on directory information in the directory block and each file identifier; the catalog information comprises a starting address of a data block corresponding to each file identifier;

generating a hash value corresponding to each file identifier based on each file identifier and a hash algorithm;

and generating a hash table based on the hash value and the start address of the data block corresponding to each same file.

In the above solution, before the performing the partial cleaning based on the second policy and the start address of the logical block, the method further includes:

acquiring a cleaning grade parameter;

and determining the cleaning proportion based on the cleaning grade parameter.

In the above scheme, the method further comprises:

responding to an unsubscribe request, and starting data cleaning of the storage medium to be cleaned;

after the storage medium to be cleaned is determined to be cleaned, switching the storage medium to be cleaned to a first state capable of being ordered;

and in the data cleaning process, the storage medium to be cleaned is in a second state which cannot be ordered.

In the above scheme, the method further comprises:

obtaining a partition table of a storage medium to be cleaned, wherein the partition table comprises a global unique identification partition table or a main guide record partition table;

and performing full-scale cleaning on the partition table based on the first strategy.

In a second aspect, an embodiment of the present application provides a data cleaning apparatus, including:

the determining module is used for determining a logic block corresponding to each partition of the storage medium to be cleaned and a corresponding target cleaning strategy;

the cleaning module is used for carrying out hierarchical cleaning on the logic blocks based on a target cleaning strategy of each logic block until the storage medium to be cleaned is cleaned; the target cleaning strategy is one of a first strategy for full cleaning, a second strategy for partial cleaning and a third strategy for not cleaning data.

In a third aspect, an embodiment of the present application provides a data cleaning apparatus, including: a processor and a memory for storing a computer program capable of running on the processor, wherein,

the processor is configured to execute the steps of the method according to the first aspect when the computer program is run.

In a fourth aspect, embodiments of the present application provide a computer storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the method of the first aspect.

According to the technical scheme provided by the embodiment of the application, the logic blocks corresponding to each partition of the storage medium to be cleaned and the corresponding target cleaning strategies are determined; hierarchical cleaning is carried out on the logic blocks based on target cleaning strategies of the logic blocks until the storage media to be cleaned are cleaned; the target cleaning strategy is one of a first strategy for full cleaning, a second strategy for partial cleaning and a third strategy for data cleaning. Therefore, each logic block of the storage medium partition to be cleaned can be cleaned in a grading manner based on the first strategy, the second strategy and the third strategy, namely, a targeted cleaning strategy is provided for the logic blocks with different functions of the storage medium to be cleaned, so that the data cleaning efficiency is improved, the data safety is considered, and the user experience is improved.

Drawings

Fig. 1 is a schematic flow chart of a data cleaning method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a related art data cleansing method in an application example of the present application;

FIG. 3 is a schematic diagram illustrating a method for cleaning data in an application example of the present application;

FIG. 4 is a schematic diagram illustrating a method for cleaning blocks according to an embodiment of the present application;

FIG. 5 is a schematic diagram of creating a LAB table in an example application of the present application;

FIG. 6 is a schematic diagram of the partitioning and cleaning of file data blocks in an application example of the present application;

FIG. 7 is a schematic diagram illustrating a data cleaning apparatus according to an embodiment of the present application

Fig. 8 is a schematic structural diagram of a data cleaning device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings and examples.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

The embodiment of the application provides a data cleaning method which can be applied to data cleaning equipment, as shown in fig. 1, and comprises the following steps:

Step 110: and determining a logic block corresponding to each partition of the storage medium to be cleaned and a corresponding target cleaning strategy.

Here, the storage medium refers to a carrier storing data. Including but not limited to at least one of: magnetic disk, floppy disk, optical disk, DVD, hard disk, flash Memory, U disk, CF card, SD card, MMC card, SM card, memory Stick (Memory Stick), xD card, etc. Storage media, including solid state and/or mechanical hard drives, are typically partitioned, whether solid state or mechanical hard drives, with the most important meaning of partitioning being to facilitate management of data. The data or software can be classified into different partitions according to different attributes such as the type, the use frequency and the use of the data or the software, and the data or the software is stored in different partitions, so that the overall performance is kept and maintained.

Illustratively, taking a disk as an example, there are two types of disk partition formats: 1. MBR (master boot record partition table) form, which is a special boot sector that exists in the beginning of the disk drive; up to 2TB disks are supported, and up to 4 partitions. 2. GPT ((GUID Partition Table)) partition form. GPT means GUID (globally unique identifier, globally unique identification) partition table. It is a disk organization that is started using UEFI (unified extensible firmware interface ). The GPT is different, 128 partitions can be supported, and the supported hard disk is large in size and far exceeds the needs of people.

In a disk, partitions are contained, and each partition contains its own file system. A disk is a storage medium, a partition is a structure built on the disk, which makes a disk look like several disks, and a file system is a system built on the partition, which is implemented by a lot of data solidified on the disk, and its key purpose is to provide wear leveling of the storage medium, while the data can be stored in a file manner when the file system is used. The common file system consists of six parts, namely a guide block, a super block, an inode bitmap, a logic block bitmap, an inode and a logic block.

The logical blocks herein include at least one of the above-described boot blocks, superblocks, inodes bitmaps, logical block bitmaps, inodes, and logical blocks. And determining a target cleaning strategy corresponding to each logic block based on different types of the logic blocks.

Step 120: hierarchical cleaning is carried out on the logic blocks based on a target cleaning strategy of each logic block until the storage medium to be cleaned is cleaned; the target cleaning strategy is one of a first strategy for full cleaning, a second strategy for partial cleaning and a third strategy for not cleaning data.

The logical blocks are cleaned in a grading manner after the target cleaning policy of each logical block is determined, until the storage medium to be cleaned is cleaned. The target cleansing policy includes one of a first policy for full cleansing, a second policy for partial cleansing, and a third policy for not cleansing data. The blocks with different functions adopt a targeted cleaning strategy to clean the logic blocks in a grading way.

Therefore, each logic block of the storage medium partition to be cleaned can be cleaned in a grading manner based on the first strategy, the second strategy and the third strategy, namely, a targeted cleaning strategy is provided for the logic blocks with different functions of the storage medium to be cleaned, so that the data cleaning efficiency is improved, the data safety is considered, and the user experience is improved.

In some embodiments, the determining the logical block corresponding to each partition of the storage medium to be cleaned and the corresponding target cleaning policy includes:

determining the target cleaning strategy based on the type of the logic block;

Here, each logical block has a corresponding block descriptor describing it, which is used to describe the overall information of a certain logical block. Here, the start address and type of each logical block can be known by reading the block descriptor. Illustratively, the first logical block address of the disk where the boot block is located can be known by reading the block descriptor.

After determining the type and the start address of each logic block, determining a target cleaning strategy of the logic block according to the type of the logic block, and finally, hierarchical cleaning the logic block based on the target cleaning strategy and the start address of the logic block.

In some embodiments, the logical block may be flushed in a flush manner that generates a random number override prior to flushing the logical block based on the target flush policy and the start address of the logical block. Namely, after selecting the corresponding target cleaning strategy according to different logic blocks, each logic block can be cleaned by adopting a cleaning mode of generating random number overwriting, and all logic blocks can be cleaned by adopting a random number overwriting mode or partial logic blocks can be cleaned by adopting a random number overwriting mode. The data can be rewritten more quickly and safely by the random number rewriting.

In some embodiments, the hierarchical cleaning of the logical blocks based on the target cleaning policy for each of the logical blocks includes:

Here, after determining the type of each logical block, different target cleaning policies are configured for each logical block according to its function. Taking a disk as an example, the logical blocks in the disk include a boot block, a superblock, an inode bitmap, a logical block bitmap, an inode, and a logical block. Firstly, the i node bitmap is used for recording the service conditions of all i nodes, the logic block bitmap is used for recording the service conditions of all logic blocks, and the information stored in the two logic blocks is less and is not cleaned. Therefore, if the type of the logical block is an i-node bitmap or a logical block bitmap, the i-node bitmap or the logical block bitmap is not cleaned based on the third policy. The information cleaned in the logic block is more and more complex, so if the type of the logic block is a logic block, the partial cleaning is performed based on the second policy and the start address of the logic block. If the logical blocks are of the type i node bitmap, the logical block bitmap, and logical blocks other than the logical blocks, such as boot blocks and super blocks, then a full cleaning may be performed based on the first policy and the start addresses of the corresponding logical blocks.

In some embodiments, the types of the logic blocks include a directory block and a data block, the second policy indicates that the directory block is flushed in full and the data block is flushed based on a preset flushing proportion, and the flushing is performed based on the second policy and a start address of the logic block, including:

determining a type of the logical block based on the block descriptor;

Here, the logical block types are divided into directory blocks and data blocks. The second strategy, namely, partially cleaning the logical blocks, includes indicating to perform a full-scale cleaning of the directory blocks and cleaning the data blocks based on a preset cleaning ratio. Illustratively, whether the type of the logic block contained in the logic block is a directory block or a data block is determined based on the block descriptor, and if the type of the logic block is the directory block, full-scale cleaning is performed based on the starting address of the directory block; if the type of the logic block is a data block, cleaning the target data block of the same file based on the starting address according to a preset cleaning proportion and a file identification.

Here, the files are stored in the data blocks, and the file identifications are used for identifying the files, and each file has a unique file identification for the file, so that the file is convenient to find. And determining a file with the same file identifier, namely a target data block of the same file, based on the file identifier and a preset cleaning proportion, and cleaning based on the starting address.

In some embodiments, the cleaning the target data block corresponding to the same file based on the start address according to the preset cleaning proportion and the file identifier includes:

Here, the hash table is used to characterize a start address of at least one data block corresponding to each of the file identifications. The file is stored in at least one data block. One file identifier corresponds to at least one data block, and the hash table stores the starting address of at least one data block corresponding to each file identifier, so that the same file identifier, namely the same file, can correspond to a plurality of data blocks and the starting addresses of the data blocks in the hash table based on the file identifier.

After the initial address of the data block corresponding to the same file is found, dividing the initial address of at least one data block of the same file based on a partitioning rule to obtain a target data block corresponding to the same file. The target data block corresponding to the same file comprises each block after the same file is divided and the starting address corresponding to each block.

In some embodiments, the cleaning the target data block corresponding to the same file based on the preset cleaning ratio based on the start address includes:

Here, the target data block of the same file is composed of each divided block and the start address of the block, the target data block is cleaned, the cleaning priority of each block can be determined based on the preset cleaning priority, and then the target data block corresponding to the same file is cleaned based on the start address based on the cleaning priority pair and the preset cleaning proportion. For example, the cleaning level of each block may be arranged from low to high according to the size of the starting address of each block, or the initial block of each block, that is, the block corresponding to the initial logical address in the target data block may be cleaned preferentially, and other blocks may be cleaned uniformly. For example, if the target data block a corresponding to the same file a is divided into three blocks, namely, the block 1, the block 2 and the block 3, at a time according to the logical block address. The cleaning priority of the target data block a is divided into three levels according to the sequence of 1, 2 and 3 or two levels according to the sequence of 1, 2 and 3, namely the block 1 is cleaned preferentially, and then the 2 and 3 are cleaned simultaneously, and the cleaning mode is uniform cleaning.

In some embodiments, the partitioning rule is a content variable length partitioning rule, the partitioning the start address of at least one data block of the same file based on the partitioning rule, and determining the target data block corresponding to the same file includes:

Here, the content variable partitioning rule is that for a file, the stored data blocks are first accumulated to obtain the total storage space of the file, and then the length of the data block to be partitioned is defined to be between a specified minimum value and a specified maximum value. The minimum value is a byte of block size a corresponding to a logical block address, the maximum value is 10% of the storage space size of the file, and if the file is too small, the maximum value and the minimum value are the same value when the maximum value is smaller than the minimum value. According to the rabin principle, a data fingerprint of a sliding window win with a fixed length can be obtained through calculation, when the data fingerprint in the sliding window is matched with a preset value, the position is the boundary of the block, and the process is repeated until the whole file is blocked. After being divided based on the above method, a file is divided into blocks with variable lengths, and the block sizes tend to be normally distributed.

In some embodiments, the method further comprises:

Here, the directory block stores index information and file directory information of all files, including file inodes, file types, and file names. Here, the file name is a file identifier, and the inode of the directory block can be queried to obtain the address of the data block corresponding to each file name is the file identifier, but the same file may correspond to at least one data block, namely at least one data block address. Thus, the starting address of at least one data block corresponding to the same file identifier is found in the directory block based on the file identifier, that is, the file name, and if the file identifiers, that is, the file names are a and b, the starting address of the data block 1 and the starting address of the data block 3 corresponding to the file identifier a are obtained by searching in the directory block based on the file identifiers a and b, and the starting address of the data block 2 and the starting address of the data block 4 corresponding to the file identifier b are obtained by searching in the directory block. And generating hash values corresponding to the file identifications, namely the file names, based on the file identifications, namely the file names and the hash algorithm. Thus, based on the hash value and the start address of the data block corresponding to the same file, a start address capable of characterizing at least one data block corresponding to each file identifier is generated.

In some embodiments, before the partial flush based on the second policy and the start address of the logical block, the method further comprises:

acquiring a cleaning grade parameter;

and determining the cleaning proportion based on the cleaning grade parameter.

Here, the cleaning ratio may also be obtained based on the cleaning level parameter. A purge level parameter may be obtained before the partial purge based on the second policy and the start address of the logical block, and a purge ratio may be determined based on the level parameter, and the purge level parameter may be set to 5, which indicates a purge ratio of 50%, for example.

In some embodiments, the method further comprises:

Here, before starting data cleaning of the storage medium to be cleaned, the start may be based on an unsubscribe request of the user. Namely, the cleaning process is automatically triggered after the storage medium to be cleaned completes 'unsubscribing', so that the time occupation of starting the 'order' process based on the 'order' request can be avoided. After the storage medium to be cleaned is determined to be cleaned, switching the storage medium to be cleaned to a first state capable of being ordered; in the process of cleaning data, the storage medium to be cleaned is in a second state which cannot be ordered.

In some embodiments, the method further comprises:

The partition table of the storage medium to be cleaned generally comprises an MBR main boot record partition table form or a GUID globally unique identification partition table according to different partition forms, and the partition tables are cleaned by adopting a first strategy of full cleaning. The GUID partition table (referred to as GPT for short, and the disk using the GUID partition table is referred to as GPT disk) is a standard of newer disk partition table structures. GPT provides a more flexible disk partitioning mechanism than the currently commonly used Master Boot Record (MBR) partitioning scheme. GPT uses a 16 byte Globally Unique Identifier (GUID) value to identify the partition type, which makes the partition type less prone to conflicts.

Embodiments of the present application will be described in further detail below with reference to application examples. In the application example, the storage medium to be cleaned is a bare metal disk, and based on the bare metal disk, the application example provides an automatic bare metal server disk data cleaning method.

In order to ensure the safety of user data, the prior art scheme is to clean metadata of a disk before a bare metal server deploys a user image, and the specific method is that a bare metal management platform sends a deployment request, distributes the deployment request to the bare metal server through a scheduling strategy and starts a memory operating system by adopting a pre-start execution environment (Preboot eXecution Environment, PXE), as shown in figure 2. The proxy service of the memory operating system starts a disk metadata cleaning program, and firstly, the signature of each disk and each partition is erased, wherein the signature comprises information such as equipment name, file system type, ID, label and the like; and deleting the partition table of the whole disk, wherein the partition table comprises the information such as sector distribution, partition size and the like of each partition. After the disk metadata is cleaned, the damaged bare metal server local disk file system cannot normally perform read-write operation; the proxy service of the memory operating system downloads the user image from the image service of the remote end and writes the user image into the local disk of the bare metal server so as to realize the deployment of the new instance.

Because the existing scheme cleans the residual data of the user in the instance deployment stage, in order to ensure the timeliness of the instance deployment, the existing disk cleaning strategy is to improve the cleaning speed, so that only the metadata information of the disk is cleaned. However, the file data stored in the disk is not damaged actually, and part of residual data can be recovered by means of partition table data recovery and the like, so that a certain potential safety hazard is brought to user data.

In addition, the general disk cleaning mode in the industry is to rewrite all areas of the disk repeatedly for a plurality of times to reset the data so as to achieve the purpose of data erasure, but the mode takes longer time for cleaning the whole area of the disk.

From the above, the existing bare metal server disk cleaning has the following disadvantages:

(1) The user initiates a 'order' request on the bare metal management platform, namely, enters the order flow of the bare metal server, and the bare metal server is really available for the user, so that the user mirror image is required to be completely loaded and the operating system is required to be successfully started from the local disk. The existing disk cleaning process is implemented in an example deployment stage, the disk cleaning process occupies a part of ordering time, and the failure of the disk cleaning process can cause the ordering failure. Longer subscription times and subscription failures can reduce user experience;

(2) The cleaning mode of the disk metadata only erases the signatures of the disk and the partition table of the file system because the data in the actual storage block of the file is not destroyed. The residual user data may be restored prior to the disk formatting operation by prior art techniques such as restoring partition table information for the file system. Therefore, the method has the risk of exposing the user data, and lacks guarantee on the safety of the user data. Whereas conventional full disk data bit resets are inefficient and time consuming.

Therefore, the application example provides a method for cleaning the disk data of the bare metal server, which can ensure that the user data cannot be completely recovered through the prior art and improves the efficiency of cleaning the disk. And after the bare metal server finishes the unsubscribing, the disk cleaning process is automatically triggered, so that the time occupation of the ordering process is avoided. According to the disk data cleaning scheme, the partitions, the file systems and the files are cleaned in a grading manner, so that the data security is ensured, and meanwhile, the data cleaning efficiency is improved compared with the traditional disk whole-area data coverage.

The application example provides a system and a flow chart for cleaning a bare metal magnetic disk, as shown in fig. 3, the current use of the bare metal magnetic disk is to create a partition of a physical magnetic disk, taking an MBR partition as an example, and at most 4 partitions can be created. While the GPT partition form has no limit on the number of partitions. Fig. 3 illustrates 4 partitions, each of which may create a respective file system. The layout of a common file system on a disk is composed of six parts, namely a boot block, a super block, an inode bitmap, a logical block bitmap, inodes and a logical block. The file system includes each block, namely the logical block corresponding to each partition of the storage medium to be cleaned. Next, the functions of each logic block will be described.

(1) Guide block: the boot blocks of the disk are used for booting the operating system and typically only exist on the root file system, with the other file systems being empty. The boot block is located at the first logical block address of the disk.

(2) Super block: the superblock of the disk stores the number of file system data blocks, the number of inodes and the idle condition record.

(3) i node bitmap: the usage of all inodes is recorded.

(4) Logical block bitmap: the usage of all logical blocks is recorded.

(5) i node: the inode records meta information of the file, including the size of the file, rights information, and index data block pointers. This part stores such important information as the logical block to which the file data corresponds.

(6) Logic block: the logical blocks are divided into directory blocks and data blocks. The directory block stores file directory information including file inodes, file types, and file names. The directory block stores index information of all files, and recorded file data blocks.

(7) Partition table: the MBR partition table is followed by a guide block and is positioned at the address of the second logic block of the magnetic disk; the GPT partition table exists at multiple locations on the disk, with the region defined by the GPT header, typically occupying LBA (Logical Block Address ) 2-LBA 33 sectors, with each partition occupying 128bytes in a sector, called a partition table entry. The GPT header contains a pointer to the partition table, logical block address LBA1.

Based on the above structure, in the application example of the present application, each logic block has its corresponding target cleaning policy, where the target cleaning policy includes one of a first policy for full cleaning, a second policy for partial cleaning, and a third policy for not performing data cleaning, and the detailed allocation is as follows:

(1) And for the i-node bitmap and the logical block bitmap, cleaning the i-node bitmap or the logical block bitmap based on a third strategy.

Because the information stored in the two parts of the i-node bitmap and the logic block bitmap is less and can not be cleaned.

(2) For the logical block, then, a partial clean is performed based on the second policy and the start address of the logical block. The detailed cleaning process is described in more detail below.

(3) For the inode bitmap, the logical block bitmap, and the logical blocks other than the logical blocks, i.e., the boot block, the superblock, and the inode, the full cleaning is performed based on the first policy, i.e., the inode blocks, the boot block, and the superblock shown in fig. 3 are cleaned up by 100% of the full.

As shown in FIG. 3, the cleaning scheme of the bare metal magnetic disk comprises the following steps:

1. the bare metal management platform initiates a unsubscribe request and then automatically triggers a clear process.

The bare metal server side completes the 'unsubscribing' process after finishing the resource clearing, and then automatically triggers the 'cleaning' process, and the bare metal server is in a 'cleaning' state and is in an unsubscribable state at the moment, so that the ordering process of a user is not influenced. Namely, responding to an unsubscribe request, and starting data cleaning of the storage medium to be cleaned; and in the data cleaning process, the storage medium to be cleaned is in a second state which cannot be ordered.

2. Determining a logic block corresponding to each partition of the storage medium to be cleaned and a corresponding target cleaning strategy; and carrying out hierarchical cleaning on the logic blocks based on the target cleaning strategy of each logic block until the storage medium to be cleaned is cleaned. And starting a disk data cleaning program and transmitting cleaning grade parameters. After the cleaning process is triggered, the bare metal server starts a memory operating system by adopting a starting execution environment, and proxy service in the memory operating system starts a disk data cleaning program and transmits cleaning grade parameters; illustratively, the cleanup level parameter here is 5, with cleanup level parameter 5 representing overwriting 50% of the data blocks of each file. In addition, when the bare metal magnetic disk is cleaned, all logic blocks of the magnetic disk can be cleaned by adopting a cleaning mode of generating random number to overwrite.

Taking partition 1 of the bare metal magnetic disk in fig. 3 as an example, a specific cleaning step of a logic block corresponding to partition 1 is described, as shown in fig. 4, including the following steps:

step 401: all the disks of the bare metal server are identified and recorded, and partition information is recorded.

The bare metal management platform determines the partition form of the disk based on the partition form of the bare metal disk, whether the partition form is a GPT partition form or an MBR partition form.

Step 402: the starting address and type of each logical block is determined based on the block descriptor.

Identifying the start address and type of each logical block corresponding to each partition based on the block descriptor, including: the start logical block address of the logical blocks such as boot block, superblock, inode, and logical block (including directory block and data block). The types of logical blocks include: boot blocks, superblocks, inodes, and logic blocks. The target policy includes one of a first policy for full scrubbing, a second policy for partial scrubbing, and a third policy for no data scrubbing.

Step 403: a LAB hash table is created.

The logical block is divided into a directory block and a data block, and before cleaning the data block, an LBA hash table representing a start address of at least one data block corresponding to each file name needs to be established.

And identifying the directory blocks and index information of all files recorded by the directory blocks, and establishing an index to record all data blocks associated with each file in sequence. As shown in fig. 3, the directory block stores directory information of a file, including a file inode, a file type, and a file name. Illustratively, the file name "a.txt" corresponds to inode 1, and the logical address of the data block corresponding to the file name "a.txt" is stored in inode 1. And traversing all files with the file type of 'directory' in the directory block information, searching the files with the file type of 'file', inquiring the inodes of the files to obtain the logical block addresses of the data blocks of the files, establishing a hash table to store the logical block addresses corresponding to the data blocks with the file type of 'file', and taking the hash value to store according to the file name. Thus, the file with the file type of "directory" is not recorded in the LBA hash table, and the LBA hash table can be established after the retrieval of all the directories of the directory block is completed (as shown in fig. 5). The file names in the directory block in fig. 5 are a, b, c, e and f, which correspond to inodes 3, 4, 5, 6, 7, and 8. And inquiring the logical address of the data block corresponding to each file based on the inode corresponding to each file name. Thereby creating an LBA hash table on the right side of fig. 5 for characterizing the logical block address of at least one data block corresponding to the same file. The hash table on the left side of fig. 5 is a hash table organized in the related art, and although files corresponding to each data block are also recorded, it is known from the left-right comparison that the logical start addresses of several data blocks and data blocks corresponding to the file a cannot be clearly represented.

Step 404: the target data blocks are cleaned up for each file based on the content variable length blocks and the cleaning level parameters.

After the LBA hash table is established, it is necessary to divide blocks on the basis of the content variable length rule for each file. Dividing the starting address of at least one data block of the same file based on a partitioning rule, and determining a target data block corresponding to the same file; the target data block corresponding to the same file comprises each block and the starting address corresponding to each block after the same file is divided. The chunking rule here is a content variable length chunking rule.

As shown in fig. 6, for a file, taking a file a as an example, the stored data blocks are first accumulated to obtain the total storage space of the file, and then the length of the data block to be divided is defined to be between a specified minimum value and a specified maximum value. The minimum value is a byte of block size a corresponding to a logical block address, the maximum value is 10% of the storage space size of the file, and if the file is too small, the maximum value and the minimum value are the same value when the maximum value is smaller than the minimum value. According to the rabin principle, a data fingerprint of a sliding window win with a fixed length can be obtained through calculation, when the data fingerprint in the sliding window is matched with a preset value, the position is the boundary of the block, and the process is repeated until the whole file is blocked. After the division based on the above method, the file a is divided into blocks with variable length, the block sizes of which tend to be normally distributed (as shown in fig. 6), and the file a is sequentially divided into 9 blocks according to the logical block addresses. Here, the file a divided into 9 blocks is a target data block.

After determining the target data block, determining the cleaning grade of the target data block; and determining the cleaning priority of each block based on the starting address of each block in the target data block and the preset cleaning priority.

The cleaning priority can be arranged from low to high according to the size of the initial address of each block, or the initial block of each block, that is, the block corresponding to the initial logical address in the target data block, can be cleaned preferentially, and other blocks are cleaned uniformly. Illustratively, as shown in FIG. 6, the chunk location, i.e., the cleaning level, tends to be exponentially distributed (as in FIG. 6), i.e., from low to high in terms of the size of the starting address of each chunk of the file. I.e. the cleaning priority is continuously reduced from 1 to 9, or the broken line part shown in the file a of fig. 6 is cleaned according to the priority of the initial block of each block, i.e. the block corresponding to the initial logical address in the target data block, and other blocks are uniformly cleaned. Regardless of the manner of selecting the above, the blocked file a, i.e., the block at the head position of the target data block, is preferentially cleaned. After determining the cleaning priority, at least one data block corresponding to each same file can be cleaned based on the cleaning grade parameter.

For the logic block, the logic block is divided into a directory block and a data block, if the type of the logic block is the data block, 50% of the data blocks and file identifications of each file are cleaned according to the cleaning grade parameter 5, and the target data blocks of the same file are cleaned based on the starting address. The file identifier is a file name, as shown in fig. 3, the file name is "a.txt", the grade parameter is 5, and the grade parameter refers to that the data block 1+data block 2+data block 3 occupied by the "a.txt" is cleaned by 50% of the accumulated total capacity, the data block can be cleaned based on a preset proportion, and the preset proportion can be set according to the needs of a user.

After the cleaning grade and the cleaning grade parameters are established, selecting the blocks of the file according to the cleaning grade, and cleaning the target data blocks based on the cleaning grade parameters. The file data blocks recorded in the LBA hash table are divided into a plurality of irregular blocks by a content-based variable length block dividing method one by one in a single file unit, 50% of the blocks are selected, and random numbers are generated to overwrite the selected logical block addresses, so that the cleaning of the file data blocks is completed. The method selectively cleans the blocks based on the strategy, ensures the damage of the data integrity, and improves the cleaning efficiency compared with the traditional data block comprehensive overwriting cleaning mode. Each file is divided into blocks with variable length based on the content, the data blocks corresponding to the partial blocks are selected according to the cleaning level, and random numbers are generated to overwrite the disk sectors of the corresponding data blocks so as to achieve the aim of unrecoverable data.

Step 405: and performing full-scale cleaning on the directory blocks.

Determining a type of the logical block based on the block descriptor; if the type of the logic block is a directory block, performing full cleaning based on the initial address of the directory block; since the directory block stores index information of all files, the directory block is located at the logical block address starting position E, F of the disk by reading the block descriptor, and random numbers are generated to comprehensively overwrite the logical block addresses E to F, which are (F-E) x a bytes. I.e., the full 100% cleaning of the directory blocks shown in fig. 3.

Step 406: the boot block, superblock, and inode are cleaned based on a first policy.

For the boot block, the first logical block address of the disk where the boot block is located is known by reading the block descriptor, and taking a logical block address with a byte size as an example, a byte of the logical block is comprehensively overwritten by the generated random number.

For super blocks, the leading block is located at the logical block address starting position A, B of the disk by reading the block descriptor, and random numbers are generated to comprehensively overwrite the logical block addresses A to B, which are (B-A) x ase:Sub>A bytes.

For the inode, the inode block is located at the logical block address starting position C, D of the disk by reading the block descriptor, and random numbers are generated to comprehensively overwrite the logical block addresses C to D, and the total (D-C) x a bytes are shared. I.e., the boot block, superblock, and inode shown in fig. 3 are cleaned up by 100% of the total.

Step 407: repeating steps 402-405 to complete the cleaning of all the partitions and delete the disk partition table.

Deleting the partition table of the whole disk, including deleting the sector distribution, partition size and other information of each partition. And for the partition table of the bare metal magnetic disk, performing full cleaning on the partition table based on the first strategy. Specifically, the MBR partition table immediately follows the boot block, and the second logic block address of the disk is erased by adopting a random number; the GPT partition table exists at a plurality of positions of the disk, and after the GPT head is identified, the following 33 logic block addresses are comprehensively overwritten by adopting random numbers.

3. After completing the disk cleaning operation, the agent program restores the bare metal server to a subscribed state.

And switching the storage medium to be cleaned to a first state capable of being ordered after the storage medium to be cleaned is determined to be cleaned.

The application example relates to improvement of a bare metal server disk data cleaning scheme, compared with traditional disk data global overwriting, the technology has obvious advantages in the aspect of improving data cleaning efficiency, and compared with original disk metadata cleaning, the technology has obvious advantages in the aspect of improving data security.

In conclusion, the technology has the advantages of high efficiency and high safety, ensures the safety of user data on the bare metal server, reduces the time of the ordering process, and comprehensively improves the user experience. And public cloud bare metal manufacturers have a resource cleaning function, so the technology has wide market application prospect.

In order to implement the method according to the embodiment of the present application, the embodiment of the present application further provides a data cleaning device, where the data cleaning device corresponds to the method of the cleaning device, and each step in the method embodiment of the data cleaning device is also fully applicable to the embodiment of the data cleaning device. As shown in fig. 7, the data cleaning device 700 includes: a determination module 710 and a cleaning module 720. The determining module 710 is configured to determine a logic block corresponding to each partition of the storage medium to be cleaned and a corresponding target cleaning policy; a cleaning module 720, configured to perform hierarchical cleaning on the logical blocks based on a target cleaning policy of each logical block until the storage medium to be cleaned is cleaned; the target cleaning strategy is one of a first strategy for full cleaning, a second strategy for partial cleaning and a third strategy for not cleaning data.

In some embodiments, the determining module 710 is further configured to determine a start address and a type of each corresponding logical block based on each distinguished block descriptor of the storage medium to be cleaned;

determining the target cleaning strategy based on the type of the logic block;

the scrubbing module 720 is further configured to scrub the logical block based on the target scrubbing policy and a start address of the logical block.

In some embodiments, the cleaning module 720 is further configured to, if the type of the logical block is an inode bitmap or a logical block bitmap, not clean the inode bitmap or the logical block bitmap based on the third policy;

In some embodiments, the types of the logical blocks include directory blocks and data blocks, the second policy indicates that the directory blocks are cleaned in full and the data blocks are cleaned based on a preset cleaning ratio, and the determining module 710 is further configured to determine the types of the logical blocks based on the block descriptors; the cleaning module 720 is further configured to perform full cleaning based on a start address of the directory block if the type of the logical block is the directory block;

In some embodiments, the determining module 710 is further configured to determine a starting address of at least one data block belonging to the same file based on the hash table and the file identification; the hash table is used for representing the starting address of at least one data block corresponding to each file identifier; dividing the starting address of at least one data block of the same file based on a partitioning rule, and determining a target data block corresponding to the same file; the target data block corresponding to the same file comprises each block after the same file is divided and the starting address corresponding to each block; the cleaning module 720 is further configured to clean the target data block corresponding to the same file based on the start address based on the preset cleaning ratio.

In some embodiments, the determining module 710 is further configured to determine a cleaning priority of each block based on the starting address of each block in the target data block and a preset cleaning priority; the cleaning module 720 is further configured to clean the target data block corresponding to the same file based on the start address based on the cleaning priority of each block and the preset cleaning proportion.

In some embodiments, the partitioning rule is a content variable length partitioning rule, and the determining module 710 is further configured to divide a start address of at least one data block of the same file based on the content variable length partitioning rule, and determine a target data block corresponding to the same file; wherein each block size in the target data block is close to normal distribution.

In some embodiments, the determining module 710 is further configured to determine a start address of at least one data block corresponding to each of the same files based on directory information in the directory blocks and each of the file identifications; the catalog information comprises a starting address of a data block corresponding to each file identifier;

the data cleaning device further includes a generating module 730, configured to generate a hash value corresponding to each file identifier based on each file identifier and a hash algorithm;

In some embodiments, the data cleaning apparatus further includes an obtaining module 740 configured to obtain a cleaning level parameter; the determination module 710 is also configured to determine the cleaning ratio based on the cleaning level parameter.

In some embodiments, the data cleaning device further includes a starting module 750 and a switching module 760, where the starting module 750 is configured to start data cleaning of the storage medium to be cleaned in response to a unsubscribe request; the switching module 760 is configured to switch the storage medium to be cleaned to a first state that can be ordered after determining that the storage medium to be cleaned is cleaned;

In some embodiments, the obtaining module 740 is further configured to obtain a partition table of the storage medium to be cleaned, where the partition table includes a globally unique identification partition table or a master boot record partition table;

the cleaning module 720 is further configured to perform full cleaning on the partition table based on the first policy.

In practical applications, the determining module 710, the cleaning module 720, the generating module 730, the obtaining module 740, the starting module 750 and the switching module 760 may be implemented by a processor in the data cleaning device. Of course, the processor needs to run a computer program in memory to implement its functions.

It should be noted that: in the data cleaning device provided in the above embodiment, only the division of each program module is used for illustration, and in practical application, the processing allocation may be performed by different program modules according to needs, that is, the internal structure of the device is divided into different program modules, so as to complete all or part of the processing described above. In addition, the data cleaning device and the data cleaning method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the data cleaning device and the data cleaning method are detailed in the method embodiments and are not described herein again.

Based on the hardware implementation of the program modules, and in order to implement the method of the embodiment of the present application, the embodiment of the present application further provides a data cleaning device. Fig. 8 shows only an exemplary structure of the data cleaning apparatus, not all of which may be implemented as needed.

As shown in fig. 8, a data cleaning device 800 provided in an embodiment of the present application includes: the various components of the at least one processor 801, memory 802, user interface 803, and at least one network interface 804 data cleansing device 800 are coupled together by a bus system 805. It is appreciated that the bus system 805 is used to enable connected communications between these components. The bus system 805 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration, the various buses are labeled as bus system 805 in fig. 8.

The user interface 803 may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc.

The memory 802 in embodiments of the present application is used to store various types of data to support the operation of the data cleansing device. Examples of such data include: any computer program for operating on a data cleaning device.

The data cleaning method disclosed in the embodiment of the application can be applied to the processor 801 or implemented by the processor 801. The processor 801 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the data cleansing method may be performed by integrated logic circuits of hardware in the processor 801 or instructions in the form of software. The processor 801 may be a general purpose processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 801 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the application can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium, such as a memory 802, where the processor 801 reads information from the memory 802 and, in combination with its hardware, performs the steps of the data cleansing method provided by the embodiments of the present application.

In an exemplary embodiment, the data cleansing device may be implemented by one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logic Device), field programmable gate arrays (FPGA, field Programmable Gate Array), general purpose processors, controllers, microcontrollers (MCU, micro Controller Unit), microprocessors (Microprocessor), or other electronic components for performing the aforementioned methods.

It is to be appreciated that memory 802 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Wherein the nonvolatile Memory may be Read Only Memory (ROM), programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read Only Memory (EEPROM, electrically Erasable Programmable Read-Only Memory), magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk Read Only Memory (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile memory may be random access memory (RAM, random Access Memory), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory), dynamic random access memory (DRAM, dynamic Random Access Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), double data rate synchronous dynamic random access memory (ddr SDRAM, double Data Rate Synchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The memory described by embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

In an exemplary embodiment, the present application also provides a storage medium, i.e. a computer storage medium, which may be specifically a computer readable storage medium, for example, including a memory 802 storing a computer program, where the computer program may be executed by the processor 801 of the data cleaning device to perform the steps described in the method according to the embodiment of the present application. The computer readable storage medium may be ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM.

It should be noted that: "first," "second," etc. are used to distinguish similar objects and not necessarily to describe a particular order or sequence.

In addition, the embodiments of the present application may be arbitrarily combined without any collision.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of data cleansing, the method comprising:

2. The method of claim 1, wherein determining the logical block corresponding to each partition of the storage medium to be cleaned and the corresponding target cleaning policy comprises:

determining the target cleaning strategy based on the type of the logic block;

3. The method of claim 2, wherein the hierarchical cleaning of the logical blocks based on a target cleaning policy for each of the logical blocks comprises:

4. The method of claim 3, wherein the type of logical block includes a directory block and a data block, wherein the second policy indicates a full-scale scrubbing of the directory block and a partial scrubbing of the data block based on a preset scrubbing ratio, wherein the partial scrubbing based on the second policy and a start address of the logical block includes:

determining a type of the logical block based on the block descriptor;

5. The method of claim 4, wherein the cleaning the target data block corresponding to the same file based on the start address according to the preset cleaning ratio and the file identifier comprises:

6. The method of claim 5, wherein the cleaning the target data block corresponding to the same file based on the preset cleaning ratio based on the start address includes:

7. The method of claim 5, wherein the partitioning rule is a content variable length partitioning rule, the partitioning the start address of at least one data block of the same file based on the partitioning rule, and determining the target data block corresponding to the same file includes:

8. The method of claim 5, wherein the method further comprises:

9. The method of claim 4, wherein prior to the partially cleaning based on the second policy and the start address of the logical block, the method further comprises:

acquiring a cleaning grade parameter;

and determining the cleaning proportion based on the cleaning grade parameter.

10. The method according to claim 1, wherein the method further comprises:

11. The method according to claim 1, wherein the method further comprises:

12. A data cleaning device, comprising:

13. A data cleaning apparatus, comprising: a processor and a memory for storing a computer program capable of running on the processor, wherein,

the processor being adapted to perform the steps of the method of any of claims 1 to 11 when the computer program is run.

14. A computer storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method of any of claims 1 to 11.