CN111753518A - Autonomous file consistency checking method - Google Patents

Autonomous file consistency checking method Download PDF

Info

Publication number
CN111753518A
CN111753518A CN202010806690.5A CN202010806690A CN111753518A CN 111753518 A CN111753518 A CN 111753518A CN 202010806690 A CN202010806690 A CN 202010806690A CN 111753518 A CN111753518 A CN 111753518A
Authority
CN
China
Prior art keywords
file
taking
files
log2
bytes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010806690.5A
Other languages
Chinese (zh)
Other versions
CN111753518B (en
Inventor
张玉启
任伟
王传安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Chaoshu Software Technology Co ltd
Original Assignee
Shenzhen Chaoshu Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Chaoshu Software Technology Co ltd filed Critical Shenzhen Chaoshu Software Technology Co ltd
Priority to CN202010806690.5A priority Critical patent/CN111753518B/en
Publication of CN111753518A publication Critical patent/CN111753518A/en
Application granted granted Critical
Publication of CN111753518B publication Critical patent/CN111753518B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An autonomous file consistency test method, which adopts a file attribute judgment method and a user-defined check function judgment method to test the consistency of files, if the files at a source end and a target end are inconsistent, the files are retransmitted and judged until the files are consistent. The file attribute judgment method is realized by judging the file names, the file lengths and the file final modification time of the source end and the target end files. The self-defined check function judging method judges whether the files are consistent by judging the lengths of the source end file and the target end file, taking the length modulo and respectively taking the logarithm taking 2 as the base number for comparison. Whether the files are consistent or not can be judged by applying a file attribute judgment method or a self-defined check function judgment method according to the sizes of the files, and the file size threshold value can be set.

Description

Autonomous file consistency checking method
Technical Field
The invention relates to the technical field of big data and new generation information, in particular to an autonomous file consistency checking method.
Background
With the advance of 5G, big data, industrial internet, mobile internet, digital economy and digital industrialization, data is becoming bigger and bigger. Data is an underlying resource and is also an important productivity.
In the course of continuous development, various data are generated, but data can be classified into three categories.
Structuring data: structured data refers to data that can be represented and stored in a two-dimensional form using a relational database. The general characteristics are as follows: data is in row units, one row of data represents information of one entity, and the attribute of each row of data is the same. The storage and arrangement of the structured data is very regular, which is helpful for operations such as query and modification.
Semi-structured data: semi-structured data is a form of structured data that does not conform to the structure of a data model in which relational databases or other forms of data tables are associated, but contains relevant tags to separate semantic elements and to stratify records and fields. It is therefore also referred to as a self-describing structure. Semi-structured data, belonging to the same class of entities, may have different attributes, even if they are grouped together, the order of these attributes is not important. Common semi-structured data are logs, CSV, XML and JSON.
Unstructured data: as the name implies, there is no fixed structure of data. Various documents, pictures, video/audio, e-mails, contracts, documents, etc. belong to unstructured data. For such data, the whole is generally directly stored, and the data is generally stored in a binary data format.
According to the statistics of the world authorities, the Chinese data expand at a 14-fold speed from 2015 to 2025. The Chinese data from 2018 to 2025 lead the world with 30% of annual average growth rate, which is 3% higher than the world, and the Chinese data from 2025 will increase to 48.6ZB, which accounts for 27.8% of the world. According to investigation, structured data only accounts for 9% of the total amount of data information in the stored mass information, and unstructured data accounts for 91% of the total amount of data information.
The unstructured data has the characteristic that most files are compound storage files, so-called compound storage files, and the characteristic that the increase of data is not linear, and as long as the data is increased, the file structure is broken and rebuilt again. For example, a file generated by Word, a presentation document generated by PowerPoint, etc., as long as the file changes by one byte, even by only one byte, the structure of the whole file changes, and all files are "rewritten".
Consistency and availability of data is important because backup is error prone (especially large files are more error prone on spare parts) so a consistency check is performed. If the file backed up by the target end is inconsistent with the original file of the source end, generally, the file is basically unavailable (unless there is a part of binary file, such as MP3, video, etc.). Because the backed-up file is used for recovering when the original file is lost, and if the backed-up file is not available, the backed-up file cannot be used even if recovered, which is equivalent to data loss.
Therefore, when unstructured data is backed up, whether files of a source end and a target end are consistent or not must be judged, and if the files are not consistent, the files need to be retransmitted and judged until the files are consistent.
Analysis and comparison of the closest technology in the existing situation.
At present, the closest technical solution in the existing situation is: application publication No. CN110096483A entitled "a duplicate file detection method, terminal and server" (hereinafter referred to as "comparison document 1") and publication No. CN104408111B entitled "a method and apparatus for deleting duplicate data" (hereinafter referred to as "comparison document 2").
The comparison document 1 is related to a duplicate file detection method, a terminal and a server, and the comparison document 2 is related to a method and a device for deleting duplicate data, but is greatly different from the present case.
The comparison document 1 is an invention, and provides a duplicate file detection method, a terminal and a server, wherein the method comprises the following steps: when a file to be processed, which is required to be uploaded to a server by a user, is sent to the server, a terminal acquires the size of the file to be processed, detects a target value interval to which the size of the file to be processed belongs, calculates a hash value of the file to be processed according to a file hash value calculation mode corresponding to the target value interval, sends sending information containing the hash value of the file to be processed to the server, the server determines whether the file to be processed is a duplicate file according to the sending information, and sends a response result to the terminal, wherein the response result contains information that the file to be processed is the duplicate file or information that the file to be processed is a non-duplicate file.
The method mentioned in the scheme is different from the comparison file 1, the comparison file 1 adopts a method which is universal in the industry and is used for calculating the hash value of the file, the scheme adopts a file attribute judgment method and a self-defined check function judgment method, and the self-defined check function judgment method judges whether the file is the 'file consistency' by judging the lengths of the source end file and the target end file, taking the length modulo and taking the logarithm taking 2 as the base number for comparison instead of the hash value of the file. Meanwhile, the file attribute judgment method and the user-defined check function judgment method are optional, and the threshold value of the file can be set.
The comparison file 2 is also an invention, and is based on an Openstack Object Storage system to delete repeated data, the proxy node comprises a deduplication middleware module, the Storage node comprises a deduplication service process module, the deduplication middleware module is constructed with a deduplication hash ring, each node of the deduplication hash ring is a root node of a red-black tree, and the deduplication service process module sends a fingerprint file to the deduplication middleware module; after finding the root node of the black-red tree, the duplicate removal middleware module judges whether a duplicate file exists; reserving a duplicate file in a storage node, deleting other duplicate files, and storing a redirection file pointing to the reserved duplicate file at the position where other duplicate files are originally stored; if there is no duplicate file, the value of the virtual node partition contained in each fingerprint file, and the MD5 value of the file content are inserted into the red-black tree child node.
Unlike the comparison document 2, the method of the present invention uses a method common in the industry, in which a hash ring, a node partition value, and an MD5 value are used to determine whether or not a document is duplicated. The scheme adopts a file attribute judgment method and a user-defined check function judgment method, and the user-defined check function judgment method judges whether the file is a consistency file by judging the lengths of the source end file and the target end file, taking the length modulo and taking the logarithm taking 2 as the base number for comparison instead of the hash value of the file. Meanwhile, the file attribute judgment method and the user-defined check function judgment method are optional, and the threshold value of the file can be set.
Disclosure of Invention
In order to solve the technical problems, the invention aims to: the method is characterized in that a ' file attribute judgment method ' and a ' custom check function judgment method ' are adopted for unstructured data to carry out consistency check, and if ' files ' of a source end and a target end are inconsistent ', the files are retransmitted and judged until ' files are consistent '.
The file attribute judging method is realized by judging the file names, the file lengths and the file final modification dates of the source end files and the target end files. If the file name, the file length and the file last modification date are completely the same, the file is judged to be consistent, and if one is the same, the file is judged to be inconsistent.
The self-defined check function judging method firstly obtains the lengths Ls and Ld of the source end file and the target end file by judging the lengths of the source end file and the target end file, and if the lengths of the Ls and the Ld are not equal, the files are judged to be inconsistent; if Ls and Ld are equal, respectively taking the length of the file modulo and respectively taking the logarithm taking 2 as the base number to compare, thereby judging whether the file is the consistency file.
Moreover, the file attribute judgment method and the custom check function judgment method are optional, and the threshold value of the file can be set to 500MB by default.
Drawings
FIG. 1 is a flowchart of a file attribute determination method.
FIG. 2 is a flow chart of a custom check function determination method.
FIG. 3 is a flowchart of two file consistency check methods.
Detailed Description
Fig. 1 is a flowchart of a file attribute determination method. Firstly, after the file transmission is finished, respectively judging the file names, the file lengths and the file final modification time of the source end files and the target end files, and judging that the files at the two ends are consistent only if the file names, the file lengths and the file final modification time of the source end files and the target end files are the same; if one item (file name, file length, and file last modification time) is different, the file is judged to be inconsistent.
If the source end and the target end are inconsistent, deleting the target end file, and then retransmitting and judging the target end file from the source end until the files are consistent.
Fig. 2 is a flow chart of a custom check function determination method. First, the lengths Ls and Ld of the source and target files are obtained.
If Ls is not equal to Ld, judging that the files of the source end and the target end are inconsistent; if Ls is equal to Ld, go to the next step.
The Ls and Ld are divided by 8, and if the division is not complete, the remainder is directly discarded, i.e., the modulo Ls '= mod (Ls,8) and Ld' = mod (Ld,8) are taken.
Then 8 bytes are taken from Ls '. multidot.7, and are recorded as Ls ' -8, the bytes are converted into 10-system numbers, and the logarithm taking the base number of 2 is taken as log2(Ls ' -8); taking 8 bytes from Ld '. multidot.7, marking as Ld ' -8, converting the bytes into a 10-system number, and taking the logarithm with the base 2, log2(Ld ' -8).
Taking 8 bytes from Ls '. sub.5, recording as Ls ' -6, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, and recording as log2(Ls ' -6); taking 8 bytes from Ld ' 5, marking as Ld ' -6, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the log2(Ld ' -6).
Taking 8 bytes from Ls '. sub.3, marking as Ls ' -4, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, and marking as log2(Ls ' -4); taking 8 bytes from Ld '. times.3, marking as Ld ' -4, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the log2(Ld ' -4).
Taking 8 bytes from Ls '. sub.1, recording as Ls ' -2, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the base number of the 10-system number, and recording as log2(Ls ' -2); taking 8 bytes from Ld '. sub.1, marking as Ld ' -2, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the log2(Ld ' -2).
The source and target files can be judged as "consistent files" only if log2(Ls '-8) = log2 (Ld' -8) and log2(Ls '-6) = log2 (Ld' -6) and log2(Ls '-4) = log2 (Ld' -4) and log2(Ls '-2) = log2 (Ld' -2).
If one of the items is not equal, the file is judged to be inconsistent.
If the source end and the target end are inconsistent, deleting the target end file, and then retransmitting and judging the target end file from the source end until the files are consistent.
Fig. 3 is a flowchart of two file consistency check methods, which aims to reduce the time for judging file consistency. Specifically, the file classification is realized by adopting different classification checking methods, specifically:
the transmitted content is divided into a large file and a small file, and the threshold values of the large file and the small file can be set by the user, the preferred scheme is 500MB, and the threshold values can be defined by the user. If the files of the user are all 'big', for example: taking high definition video and other files as main, this value can be improved, for example: 1GB, 2GB, etc.; if the user's files are all "small", for example: taking documents such as office documents as the main, the value can be improved, for example: 100MB, 200MB, etc.
Different verification methods can be applied to the large files and the small files, the preferable scheme for the large files is a file attribute judgment method, and the preferable scheme for the small files is a user-defined verification function judgment method.
Finally, it should be emphasized that the detailed description of the embodiments of the present application with reference to the drawings is not limited to the above embodiments, and those skilled in the art can make various modifications or alterations without departing from the spirit and scope of the claims of the present application. Other embodiments obtained by those skilled in the art according to the technical solutions of the present disclosure also belong to the scope of protection of the present disclosure.

Claims (3)

1. An autonomous method for checking consistency of files is characterized in that two methods are adopted for checking consistency of the files, specifically:
file attribute judgment: judging the file names, the file lengths and the file final modification time of the source end files and the target end files, if the file names, the file lengths and the file final modification time are completely the same, judging that the files are consistent, and if one of the files is different, judging that the files are inconsistent;
the self-defined check function judgment method specifically comprises the following steps:
firstly, acquiring lengths Ls and Ld of files of a source end and a target end;
if Ls is not equal to Ld, judging that the files are inconsistent;
if Ls is equal to Ld, dividing Ls and Ld by 8 respectively, and if the Ls and Ld cannot be divided completely, directly cutting off the remainder, namely taking the modulus Ls '= mod (Ls,8) and Ld' = mod (Ld, 8);
then 8 bytes are taken from Ls '. multidot.7, and are recorded as Ls ' -8, the bytes are converted into 10-system numbers, and the logarithm taking the base number of 2 is taken as log2(Ls ' -8); taking 8 bytes from Ld ' 7, recording as Ld ' -8, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, and recording as log2(Ld ' -8);
taking 8 bytes from Ls '. sub.5, recording as Ls ' -6, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, and recording as log2(Ls ' -6); taking 8 bytes from Ld ' 5, recording as Ld ' -6, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, recording as log2(Ld ' -6);
taking 8 bytes from Ls '. sub.3, marking as Ls ' -4, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, and marking as log2(Ls ' -4); taking 8 bytes from Ld '. multidot.3, recording as Ld ' -4, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, recording as log2(Ld ' -4);
taking 8 bytes from Ls '. sub.1, recording as Ls ' -2, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the base number of the 10-system number, and recording as log2(Ls ' -2); taking 8 bytes from Ld '. multidot.1, recording as Ld ' -2, converting the bytes into a 10-system number, and taking the logarithm taking the base 2 as the logarithm of the number, recording as log2(Ld ' -2);
the source and target files are judged to be "file consistent" only when log2(Ls '-8) = log2 (Ld' -8) and log2(Ls '-6) = log2 (Ld' -6) and log2(Ls '-4) = log2 (Ld' -4) and log2(Ls '-2) = log2 (Ld' -2); if one item is not equal, the file is judged to be inconsistent.
2. An autonomous file consistency check method is characterized in that if a source end and a target end are inconsistent, a target end file is deleted, and then the source end retransmits and judges the target end file until the source end and the target end are consistent.
3. An autonomous file consistency check method is characterized in that a method of classifying and verifying files is adopted to reduce judgment time, and specifically comprises the following steps:
dividing the transmitted content into a large file and a small file, and setting the threshold values of the large file and the small file by a user, wherein the preferred scheme is 500 MB;
different verification methods can be applied to the large file, and the preferred scheme is a file attribute judgment method;
different verification methods can be applied to the small files, and the preferred scheme is a self-defined verification function judgment method.
CN202010806690.5A 2020-08-12 2020-08-12 Autonomous file consistency checking method Active CN111753518B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010806690.5A CN111753518B (en) 2020-08-12 2020-08-12 Autonomous file consistency checking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010806690.5A CN111753518B (en) 2020-08-12 2020-08-12 Autonomous file consistency checking method

Publications (2)

Publication Number Publication Date
CN111753518A true CN111753518A (en) 2020-10-09
CN111753518B CN111753518B (en) 2021-03-12

Family

ID=72713386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010806690.5A Active CN111753518B (en) 2020-08-12 2020-08-12 Autonomous file consistency checking method

Country Status (1)

Country Link
CN (1) CN111753518B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905547A (en) * 2021-03-25 2021-06-04 深圳潮数软件科技有限公司 Large file de-duplication and re-orientation method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1770051A (en) * 2004-11-04 2006-05-10 华为技术有限公司 File safety detection method
CN1784677A (en) * 2004-03-31 2006-06-07 微软公司 System and method for a consistency check of a database backup
CN106547911A (en) * 2016-11-25 2017-03-29 长城计算机软件与系统有限公司 A kind of access method and system of mass small documents
EP3242243A1 (en) * 2016-05-03 2017-11-08 Safran Identity & Security Method for backing up and restoring data of a secure element
CN110457278A (en) * 2018-05-07 2019-11-15 百度在线网络技术(北京)有限公司 A kind of document copying method, device, equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101551818B (en) * 2009-04-14 2011-04-06 北京红旗中文贰仟软件技术有限公司 A unidirectional multi-mapping file matching method
CN103488952B (en) * 2013-09-24 2017-01-18 华为技术有限公司 File integrity verification method and file processor
CN104346454B (en) * 2014-10-30 2017-12-05 上海新炬网络技术有限公司 Data consistency verification method based on oracle database
CN105739971B (en) * 2016-01-20 2019-03-08 网易(杭州)网络有限公司 Verify generation, application method and the device of file
CN106845278A (en) * 2016-12-26 2017-06-13 武汉斗鱼网络科技有限公司 A kind of file verification method and system
CN110781028B (en) * 2018-07-30 2023-04-11 阿里巴巴集团控股有限公司 Data backup method, data recovery method, data backup device, data recovery device and computing equipment
CN110457953B (en) * 2019-07-26 2021-08-10 中国银行股份有限公司 Method and device for detecting integrity of file

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1784677A (en) * 2004-03-31 2006-06-07 微软公司 System and method for a consistency check of a database backup
CN1770051A (en) * 2004-11-04 2006-05-10 华为技术有限公司 File safety detection method
EP3242243A1 (en) * 2016-05-03 2017-11-08 Safran Identity & Security Method for backing up and restoring data of a secure element
CN106547911A (en) * 2016-11-25 2017-03-29 长城计算机软件与系统有限公司 A kind of access method and system of mass small documents
CN110457278A (en) * 2018-05-07 2019-11-15 百度在线网络技术(北京)有限公司 A kind of document copying method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905547A (en) * 2021-03-25 2021-06-04 深圳潮数软件科技有限公司 Large file de-duplication and re-orientation method

Also Published As

Publication number Publication date
CN111753518B (en) 2021-03-12

Similar Documents

Publication Publication Date Title
US11516289B2 (en) Method and system for displaying similar email messages based on message contents
US11657053B2 (en) Temporal optimization of data operations using distributed search and server management
US20230319129A1 (en) Temporal optimization of data operations using distributed search and server management
AU2022200375B2 (en) Temporal optimization of data operations using distributed search and server management
US9817877B2 (en) Optimizing data processing using dynamic schemas
US7478113B1 (en) Boundaries
US9223794B2 (en) Method and apparatus for content-aware and adaptive deduplication
US8683228B2 (en) System and method for WORM data storage
CN109522283B (en) Method and system for deleting repeated data
TWI554897B (en) Mail index establishment method and system, mail search method and system
US9223661B1 (en) Method and apparatus for automatically archiving data items from backup storage
CN113986873B (en) Method for processing, storing and sharing data modeling of mass Internet of things
US10310765B1 (en) Record-oriented data storage for a distributed storage system
US8943024B1 (en) System and method for data de-duplication
CN111291235A (en) Metadata storage method and device based on time sequence database
US8065277B1 (en) System and method for a data extraction and backup database
CN111753518B (en) Autonomous file consistency checking method
US6714950B1 (en) Methods for reproducing and recreating original data
CN110019056B (en) Container metadata separation for cloud layer
US12032525B2 (en) Systems and computer implemented methods for semantic data compression
US11438295B1 (en) Efficient backup and recovery of electronic mail objects
US20240004867A1 (en) Optimization of application of transactional information for a hybrid transactional and analytical processing architecture
US20240004897A1 (en) Hybrid transactional and analytical processing architecture for optimization of real-time analytical querying
US11853318B1 (en) Database with tombstone access
US10078625B1 (en) Indexing stored documents based on removed unique values

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant