CN111753518A

CN111753518A - Autonomous file consistency checking method

Info

Publication number: CN111753518A
Application number: CN202010806690.5A
Authority: CN
Inventors: 张玉启; 任伟; 王传安
Original assignee: Shenzhen Chaoshu Software Technology Co ltd
Current assignee: Shenzhen Chaoshu Software Technology Co ltd
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2020-10-09
Anticipated expiration: 2040-08-12
Also published as: CN111753518B

Abstract

An autonomous file consistency test method, which adopts a file attribute judgment method and a user-defined check function judgment method to test the consistency of files, if the files at a source end and a target end are inconsistent, the files are retransmitted and judged until the files are consistent. The file attribute judgment method is realized by judging the file names, the file lengths and the file final modification time of the source end and the target end files. The self-defined check function judging method judges whether the files are consistent by judging the lengths of the source end file and the target end file, taking the length modulo and respectively taking the logarithm taking 2 as the base number for comparison. Whether the files are consistent or not can be judged by applying a file attribute judgment method or a self-defined check function judgment method according to the sizes of the files, and the file size threshold value can be set.

Description

Autonomous file consistency checking method

Technical Field

The invention relates to the technical field of big data and new generation information, in particular to an autonomous file consistency checking method.

Background

With the advance of 5G, big data, industrial internet, mobile internet, digital economy and digital industrialization, data is becoming bigger and bigger. Data is an underlying resource and is also an important productivity.

In the course of continuous development, various data are generated, but data can be classified into three categories.

Structuring data: structured data refers to data that can be represented and stored in a two-dimensional form using a relational database. The general characteristics are as follows: data is in row units, one row of data represents information of one entity, and the attribute of each row of data is the same. The storage and arrangement of the structured data is very regular, which is helpful for operations such as query and modification.

Semi-structured data: semi-structured data is a form of structured data that does not conform to the structure of a data model in which relational databases or other forms of data tables are associated, but contains relevant tags to separate semantic elements and to stratify records and fields. It is therefore also referred to as a self-describing structure. Semi-structured data, belonging to the same class of entities, may have different attributes, even if they are grouped together, the order of these attributes is not important. Common semi-structured data are logs, CSV, XML and JSON.

Unstructured data: as the name implies, there is no fixed structure of data. Various documents, pictures, video/audio, e-mails, contracts, documents, etc. belong to unstructured data. For such data, the whole is generally directly stored, and the data is generally stored in a binary data format.

According to the statistics of the world authorities, the Chinese data expand at a 14-fold speed from 2015 to 2025. The Chinese data from 2018 to 2025 lead the world with 30% of annual average growth rate, which is 3% higher than the world, and the Chinese data from 2025 will increase to 48.6ZB, which accounts for 27.8% of the world. According to investigation, structured data only accounts for 9% of the total amount of data information in the stored mass information, and unstructured data accounts for 91% of the total amount of data information.

The unstructured data has the characteristic that most files are compound storage files, so-called compound storage files, and the characteristic that the increase of data is not linear, and as long as the data is increased, the file structure is broken and rebuilt again. For example, a file generated by Word, a presentation document generated by PowerPoint, etc., as long as the file changes by one byte, even by only one byte, the structure of the whole file changes, and all files are "rewritten".

Consistency and availability of data is important because backup is error prone (especially large files are more error prone on spare parts) so a consistency check is performed. If the file backed up by the target end is inconsistent with the original file of the source end, generally, the file is basically unavailable (unless there is a part of binary file, such as MP3, video, etc.). Because the backed-up file is used for recovering when the original file is lost, and if the backed-up file is not available, the backed-up file cannot be used even if recovered, which is equivalent to data loss.

Therefore, when unstructured data is backed up, whether files of a source end and a target end are consistent or not must be judged, and if the files are not consistent, the files need to be retransmitted and judged until the files are consistent.

Analysis and comparison of the closest technology in the existing situation.

At present, the closest technical solution in the existing situation is: application publication No. CN110096483A entitled "a duplicate file detection method, terminal and server" (hereinafter referred to as "comparison document 1") and publication No. CN104408111B entitled "a method and apparatus for deleting duplicate data" (hereinafter referred to as "comparison document 2").

The comparison document 1 is related to a duplicate file detection method, a terminal and a server, and the comparison document 2 is related to a method and a device for deleting duplicate data, but is greatly different from the present case.

The comparison document 1 is an invention, and provides a duplicate file detection method, a terminal and a server, wherein the method comprises the following steps: when a file to be processed, which is required to be uploaded to a server by a user, is sent to the server, a terminal acquires the size of the file to be processed, detects a target value interval to which the size of the file to be processed belongs, calculates a hash value of the file to be processed according to a file hash value calculation mode corresponding to the target value interval, sends sending information containing the hash value of the file to be processed to the server, the server determines whether the file to be processed is a duplicate file according to the sending information, and sends a response result to the terminal, wherein the response result contains information that the file to be processed is the duplicate file or information that the file to be processed is a non-duplicate file.

The method mentioned in the scheme is different from the comparison file 1, the comparison file 1 adopts a method which is universal in the industry and is used for calculating the hash value of the file, the scheme adopts a file attribute judgment method and a self-defined check function judgment method, and the self-defined check function judgment method judges whether the file is the 'file consistency' by judging the lengths of the source end file and the target end file, taking the length modulo and taking the logarithm taking 2 as the base number for comparison instead of the hash value of the file. Meanwhile, the file attribute judgment method and the user-defined check function judgment method are optional, and the threshold value of the file can be set.

The comparison file 2 is also an invention, and is based on an Openstack Object Storage system to delete repeated data, the proxy node comprises a deduplication middleware module, the Storage node comprises a deduplication service process module, the deduplication middleware module is constructed with a deduplication hash ring, each node of the deduplication hash ring is a root node of a red-black tree, and the deduplication service process module sends a fingerprint file to the deduplication middleware module; after finding the root node of the black-red tree, the duplicate removal middleware module judges whether a duplicate file exists; reserving a duplicate file in a storage node, deleting other duplicate files, and storing a redirection file pointing to the reserved duplicate file at the position where other duplicate files are originally stored; if there is no duplicate file, the value of the virtual node partition contained in each fingerprint file, and the MD5 value of the file content are inserted into the red-black tree child node.

Unlike the comparison document 2, the method of the present invention uses a method common in the industry, in which a hash ring, a node partition value, and an MD5 value are used to determine whether or not a document is duplicated. The scheme adopts a file attribute judgment method and a user-defined check function judgment method, and the user-defined check function judgment method judges whether the file is a consistency file by judging the lengths of the source end file and the target end file, taking the length modulo and taking the logarithm taking 2 as the base number for comparison instead of the hash value of the file. Meanwhile, the file attribute judgment method and the user-defined check function judgment method are optional, and the threshold value of the file can be set.

Disclosure of Invention

In order to solve the technical problems, the invention aims to: the method is characterized in that a ' file attribute judgment method ' and a ' custom check function judgment method ' are adopted for unstructured data to carry out consistency check, and if ' files ' of a source end and a target end are inconsistent ', the files are retransmitted and judged until ' files are consistent '.

The file attribute judging method is realized by judging the file names, the file lengths and the file final modification dates of the source end files and the target end files. If the file name, the file length and the file last modification date are completely the same, the file is judged to be consistent, and if one is the same, the file is judged to be inconsistent.

The self-defined check function judging method firstly obtains the lengths Ls and Ld of the source end file and the target end file by judging the lengths of the source end file and the target end file, and if the lengths of the Ls and the Ld are not equal, the files are judged to be inconsistent; if Ls and Ld are equal, respectively taking the length of the file modulo and respectively taking the logarithm taking 2 as the base number to compare, thereby judging whether the file is the consistency file.

Moreover, the file attribute judgment method and the custom check function judgment method are optional, and the threshold value of the file can be set to 500MB by default.

Drawings

FIG. 1 is a flowchart of a file attribute determination method.

FIG. 2 is a flow chart of a custom check function determination method.

FIG. 3 is a flowchart of two file consistency check methods.

Detailed Description

Fig. 1 is a flowchart of a file attribute determination method. Firstly, after the file transmission is finished, respectively judging the file names, the file lengths and the file final modification time of the source end files and the target end files, and judging that the files at the two ends are consistent only if the file names, the file lengths and the file final modification time of the source end files and the target end files are the same; if one item (file name, file length, and file last modification time) is different, the file is judged to be inconsistent.

If the source end and the target end are inconsistent, deleting the target end file, and then retransmitting and judging the target end file from the source end until the files are consistent.

Fig. 2 is a flow chart of a custom check function determination method. First, the lengths Ls and Ld of the source and target files are obtained.

If Ls is not equal to Ld, judging that the files of the source end and the target end are inconsistent; if Ls is equal to Ld, go to the next step.

The Ls and Ld are divided by 8, and if the division is not complete, the remainder is directly discarded, i.e., the modulo Ls '= mod (Ls,8) and Ld' = mod (Ld,8) are taken.

Then 8 bytes are taken from Ls '. multidot.7, and are recorded as Ls ' -8, the bytes are converted into 10-system numbers, and the logarithm taking the base number of 2 is taken as log2(Ls ' -8); taking 8 bytes from Ld '. multidot.7, marking as Ld ' -8, converting the bytes into a 10-system number, and taking the logarithm with the base 2, log2(Ld ' -8).

Taking 8 bytes from Ls '. sub.5, recording as Ls ' -6, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, and recording as log2(Ls ' -6); taking 8 bytes from Ld ' 5, marking as Ld ' -6, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the log2(Ld ' -6).

Taking 8 bytes from Ls '. sub.3, marking as Ls ' -4, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, and marking as log2(Ls ' -4); taking 8 bytes from Ld '. times.3, marking as Ld ' -4, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the log2(Ld ' -4).

Taking 8 bytes from Ls '. sub.1, recording as Ls ' -2, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the base number of the 10-system number, and recording as log2(Ls ' -2); taking 8 bytes from Ld '. sub.1, marking as Ld ' -2, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the log2(Ld ' -2).

The source and target files can be judged as "consistent files" only if log2(Ls '-8) = log2 (Ld' -8) and log2(Ls '-6) = log2 (Ld' -6) and log2(Ls '-4) = log2 (Ld' -4) and log2(Ls '-2) = log2 (Ld' -2).

If one of the items is not equal, the file is judged to be inconsistent.

Fig. 3 is a flowchart of two file consistency check methods, which aims to reduce the time for judging file consistency. Specifically, the file classification is realized by adopting different classification checking methods, specifically:

the transmitted content is divided into a large file and a small file, and the threshold values of the large file and the small file can be set by the user, the preferred scheme is 500MB, and the threshold values can be defined by the user. If the files of the user are all 'big', for example: taking high definition video and other files as main, this value can be improved, for example: 1GB, 2GB, etc.; if the user's files are all "small", for example: taking documents such as office documents as the main, the value can be improved, for example: 100MB, 200MB, etc.

Different verification methods can be applied to the large files and the small files, the preferable scheme for the large files is a file attribute judgment method, and the preferable scheme for the small files is a user-defined verification function judgment method.

Finally, it should be emphasized that the detailed description of the embodiments of the present application with reference to the drawings is not limited to the above embodiments, and those skilled in the art can make various modifications or alterations without departing from the spirit and scope of the claims of the present application. Other embodiments obtained by those skilled in the art according to the technical solutions of the present disclosure also belong to the scope of protection of the present disclosure.

Claims

1. An autonomous method for checking consistency of files is characterized in that two methods are adopted for checking consistency of the files, specifically:

file attribute judgment: judging the file names, the file lengths and the file final modification time of the source end files and the target end files, if the file names, the file lengths and the file final modification time are completely the same, judging that the files are consistent, and if one of the files is different, judging that the files are inconsistent;

the self-defined check function judgment method specifically comprises the following steps:

firstly, acquiring lengths Ls and Ld of files of a source end and a target end;

if Ls is not equal to Ld, judging that the files are inconsistent;

if Ls is equal to Ld, dividing Ls and Ld by 8 respectively, and if the Ls and Ld cannot be divided completely, directly cutting off the remainder, namely taking the modulus Ls '= mod (Ls,8) and Ld' = mod (Ld, 8);

then 8 bytes are taken from Ls '. multidot.7, and are recorded as Ls ' -8, the bytes are converted into 10-system numbers, and the logarithm taking the base number of 2 is taken as log2(Ls ' -8); taking 8 bytes from Ld ' 7, recording as Ld ' -8, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, and recording as log2(Ld ' -8);

taking 8 bytes from Ls '. sub.5, recording as Ls ' -6, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, and recording as log2(Ls ' -6); taking 8 bytes from Ld ' 5, recording as Ld ' -6, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, recording as log2(Ld ' -6);

taking 8 bytes from Ls '. sub.3, marking as Ls ' -4, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, and marking as log2(Ls ' -4); taking 8 bytes from Ld '. multidot.3, recording as Ld ' -4, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, recording as log2(Ld ' -4);

taking 8 bytes from Ls '. sub.1, recording as Ls ' -2, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the base number of the 10-system number, and recording as log2(Ls ' -2); taking 8 bytes from Ld '. multidot.1, recording as Ld ' -2, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, recording as log2(Ld ' -2);

the source and target files are judged to be "file consistent" only when log2(Ls '-8) = log2 (Ld' -8) and log2(Ls '-6) = log2 (Ld' -6) and log2(Ls '-4) = log2 (Ld' -4) and log2(Ls '-2) = log2 (Ld' -2); if one item is not equal, the file is judged to be inconsistent.

2. An autonomous file consistency check method is characterized in that if a source end and a target end are inconsistent, a target end file is deleted, and then the source end retransmits and judges the target end file until the source end and the target end are consistent.

3. An autonomous file consistency check method is characterized in that a method of classifying and verifying files is adopted to reduce judgment time, and specifically comprises the following steps:

dividing the transmitted content into a large file and a small file, and setting the threshold values of the large file and the small file by a user, wherein the preferred scheme is 500 MB;

different verification methods can be applied to the large file, and the preferred scheme is a file attribute judgment method;

different verification methods can be applied to the small files, and the preferred scheme is a self-defined verification function judgment method.