CN107632781B - Method for rapidly checking consistency of distributed storage multi-copy and storage structure - Google Patents

Method for rapidly checking consistency of distributed storage multi-copy and storage structure Download PDF

Info

Publication number
CN107632781B
CN107632781B CN201710748653.1A CN201710748653A CN107632781B CN 107632781 B CN107632781 B CN 107632781B CN 201710748653 A CN201710748653 A CN 201710748653A CN 107632781 B CN107632781 B CN 107632781B
Authority
CN
China
Prior art keywords
hash value
storage
data segment
expired
flag bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710748653.1A
Other languages
Chinese (zh)
Other versions
CN107632781A (en
Inventor
陈仲涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Lianyungang Technology Co ltd
Original Assignee
Cloudsoar Networks Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cloudsoar Networks Inc filed Critical Cloudsoar Networks Inc
Priority to CN201710748653.1A priority Critical patent/CN107632781B/en
Publication of CN107632781A publication Critical patent/CN107632781A/en
Application granted granted Critical
Publication of CN107632781B publication Critical patent/CN107632781B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method for rapidly checking consistency of distributed storage multi-copy and a storage structure, which adopts a processing framework of a control host and a storage host and comprises the following steps: uniformly dividing a stored file into a plurality of data segments in advance, wherein each data segment is provided with a first hash value which corresponds to the data segment independently, and a flag bit for indicating whether the corresponding first hash value is expired or not; when a write request is received, calculating a corresponding zone bit according to the offset and the length of the write request, and setting the zone bit to be overdue; and screening out expired zone bits, and calculating a second hash value of the whole file according to the first hash value of each data segment after updating the first hash value corresponding to the zone bits. The invention divides a large file into a plurality of data segments, calculates the hash value of the file in segments, and calculates the hash value of the whole file through the hash value of each segment, thereby avoiding reading the data of the whole file, improving the consistency detection speed and reducing the bandwidth consumption of the storage host.

Description

Method for rapidly checking consistency of distributed storage multi-copy and storage structure
Technical Field
The invention relates to the technical field of information storage, in particular to a method for rapidly checking consistency of distributed storage multiple copies and a storage structure.
Background
With the advent of the information age, the global data volume is on an explosive growth trend. Improving the reliability of storage systems and guaranteeing the availability of data have become important research points for enterprises. In the existing distributed storage system, the reliability, the availability, the performance and the expandability of the system are mostly improved by a multi-copy technology. However, the distributed storage systems communicate through the network, the instability of the network easily causes inconsistency of backend data, and the distributed storage systems generally contain a large number of server hosts and disks, and the probability of hardware damage is high.
If the consistency of the copies cannot be quickly detected, the data integrity and high availability of the distributed storage system are greatly reduced. The existing consistency checking method mainly calculates the hash value of the file and compares whether the hash values of a plurality of duplicate files are consistent to judge whether the data of the files are consistent.
However, for a large file, calculating the hash value consumes a large amount of CPU and storage host bandwidth, which seriously affects the performance of the system. And inconsistent positions of files are often fewer, but the content of the whole file needs to be read for calculating the hash value of the file, so that huge resource waste is caused.
Accordingly, the prior art is yet to be improved and developed.
Disclosure of Invention
The present invention provides a method and a storage structure for rapidly verifying consistency of multiple copies in distributed storage, aiming at providing a method for improving consistency detection speed, reducing bandwidth consumption of a storage host, and accelerating data verification speed.
The technical scheme adopted by the invention for solving the technical problem is as follows:
a method for rapidly checking consistency of a plurality of copies of distributed storage, wherein the distributed storage adopts a processing architecture of a control host and a storage host, and the method comprises the following steps:
A. uniformly dividing a stored file into a plurality of data segments in advance, wherein each data segment is provided with a first hash value which corresponds to the data segment independently, and a flag bit for indicating whether the corresponding first hash value is expired or not;
B. when a write request is received, calculating a corresponding zone bit according to the offset and the length of the write request, and setting the zone bit to be overdue;
C. and screening out expired zone bits, and calculating a second hash value of the whole file according to the first hash value of each data segment after updating the first hash value corresponding to the zone bits.
According to the method for rapidly checking consistency of the distributed storage multi-copy, the first hash value and the zone bit are saved by adopting an additional new file.
In the method for rapidly checking consistency of the distributed storage multi-copy, during initialization, both a first hash value and a flag bit are set to be 0; and setting the first hash value corresponding to the unwritten data segment as 0.
The method for rapidly checking consistency of multiple copies of distributed storage comprises the following specific steps:
a1, dividing a stored file into a plurality of data segments in advance, wherein the size of each data segment is 4M, and performing initialization setting;
a2, each data segment is provided with a separately corresponding first hash value and a flag bit for indicating whether the corresponding first hash value is expired.
The method for rapidly checking consistency of multiple copies of distributed storage comprises the following specific steps:
b1, when a write request is received, calculating a corresponding zone bit according to the offset and the length of the write request;
b2, and setting the flag bit from 0 to 1, indicating that the flag bit has expired.
The method for rapidly checking consistency of multiple copies of distributed storage comprises the following specific steps:
c1, screening out expired zone bits, and calculating a new first hash value of the expired zone bits;
c2, judging whether a flag bit is set to be expired during the calculation of the new first hash value, if so, executing the step C1, and if not, writing the new first hash value into the storage host;
and C3, calculating a second hash value of the whole file according to the first hash value of each data segment.
The method for rapidly checking consistency of multiple copies of distributed storage comprises the following steps of: initializing the flag bit to 0 in the memory of the control host, and determining whether there is a flag bit set to 1 during the calculation of the new first hash value, if so, executing step C1, otherwise, writing the new first hash value into the storage host.
A storage fabric, wherein the storage fabric employs a control host-storage host processing architecture;
the control host is built with a virtual disk and is used for managing the life cycle of the virtual disk and completing the functions of receiving, caching and forwarding data;
the storage host consists of a plurality of storage media and is used for storing redundant data;
the storage structure stores a computer program, and the computer program realizes the steps of the method for rapidly checking consistency of distributed storage multiple copies when being executed by a control host.
The invention has the beneficial effects that: the invention provides a method for rapidly checking consistency of distributed storage multiple copies and a storage structure, wherein a large file is divided into a plurality of data segments, the hash value of the file is calculated by the data segments, and the hash value of the whole file is calculated by the hash value of each data segment; by the method, the hash value of the corresponding data segment is updated only by recording which data segments are modified, so that the data of the whole file is prevented from being read when the consistency is checked, the consistency checking speed is greatly improved, and the consumption of the bandwidth of the storage host is reduced; and the hash value is calculated by the data segment, so that concurrent calculation can be realized more easily when the system is idle, and the data verification speed is greatly accelerated.
Drawings
FIG. 1 is a flowchart illustrating a method for rapidly verifying consistency among multiple copies of a distributed storage system according to a preferred embodiment of the present invention.
FIG. 2 is a schematic block diagram of a preferred embodiment of a memory structure of the present invention.
FIG. 3 is a diagram illustrating a first hash value of a data segment according to a preferred embodiment of a method for fast consistency check of multiple copies in distributed storage.
FIG. 4 is a flowchart illustrating a method for rapidly verifying consistency among multiple copies of a distributed storage according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
An embodiment of the present invention provides a method for quickly checking consistency of multiple copies of distributed storage, please refer to fig. 1 to 4, which are shown in the figure by using a processing architecture of a control host and a storage host.
The method specifically comprises the following steps:
s100, uniformly dividing a stored file into a plurality of data segments in advance, wherein each data segment is provided with a first hash value which corresponds to the data segment individually, and a flag bit which is used for indicating whether the corresponding first hash value is expired is arranged.
S101, dividing a stored file into a plurality of data segments in advance, wherein the size of each data segment is 4M, and carrying out initialization setting.
S102, each data segment is provided with a first hash value which corresponds to the data segment independently, and a flag bit which is used for indicating whether the corresponding first hash value is expired is arranged.
In the embodiment of the present invention, the size of the large file is assumed to be 100G. Each data segment is 4M in size, and is divided into 25600 data segments in total, if the crc32 hash algorithm is adopted, each data segment needs to consume 4B to store the first hash value, and the whole file of the first hash value needs 100K to store the first hash value. Each data segment also needs a 1-bit flag bit to indicate whether the first hash value is expired, and the whole flag bit file needs 3200B to store the flag bit. The storage overhead consumed by the first hash value and the flag bit is (100K +3200B)/100G ≈ 0.0001%.
And the flag bit of the first hash value needs to be loaded into the memory, so that the judgment is accelerated, and as can be seen from the above, the memory space occupied by the flag bit required by the 100G file is less than 4K.
S200, when a write request is received, calculating a corresponding zone bit according to the offset and the length of the write request, and setting the zone bit to be overdue.
S201, when a write request is received, calculating a corresponding zone bit according to the offset and the length of the write request.
S202, setting the flag bit from 0 to 1, and indicating that the flag bit has expired.
In the embodiment of the invention, during initialization, both the first hash value and the flag bit are set to be 0; and setting the first hash value corresponding to the unwritten data segment as 0.
When a write request exists, calculating which flag bits are related to the current write request according to the offset and the length of the write request, if the flag bits are 0, setting the flag bits to be 1, indicating that the first hash value of the corresponding data segment is expired, and updating the flag bits next time; if the zone bit is modified, the zone bit needs to be written into the storage host, and the latest zone bit is ensured not to be lost due to abnormal conditions such as power failure and the like.
S300, screening out expired zone bits, and calculating a second hash value of the whole file according to the first hash value of each data segment after updating the first hash value corresponding to the zone bits.
S301, screening out expired zone bits, and calculating a new first hash value of the expired zone bits.
S302, judging whether a flag bit is set to be expired or not during the calculation of the new first hash value, if so, executing the step S301, and otherwise, writing the new first hash value into the storage host.
And S303, calculating a second hash value of the whole file according to the first hash value of each data segment.
The step S302 specifically includes:
initializing the flag bit to 0 in the memory of the control host, and determining whether there is a flag bit set to 1 during the calculation of the new first hash value, if so, executing step C1, otherwise, writing the new first hash value into the storage host.
In the embodiment of the invention, when the first hash value of the data segment needs to be calculated, the expired flag bits of the data segments are firstly judged, then the first hash value of the corresponding data segment is updated, for the data segments of which the flag bits are not set, the first hash value can be ensured to be the newest without updating the first hash value, and then the hash value of the whole file is calculated according to the first hash values of all the data segments.
Before a new first hash value is written into the storage host, it needs to be determined whether there is a write request to modify the data segment during the first hash value calculation of the data segment.
Specifically, the flag bit is set to 0 in the memory, the flag bit is not updated to the storage host, then the first hash value is calculated, and then whether the flag bit in the memory is set and modified is determined, if the flag bit in the memory is set and modified, it indicates that a write request is made to modify the data segment during the calculation of the new first hash value, and the first hash value is expired and is not necessarily written into the storage host.
Further, the first hash value and the flag bit are saved by adopting an additional new file.
In addition, according to the method for rapidly checking consistency of the distributed storage multiple copies, the invention also provides a storage structure, wherein the storage structure adopts a processing architecture of a control host and a storage host.
The control host is built with a virtual disk and is used for managing the life cycle of the virtual disk and completing the functions of receiving, caching and forwarding data;
the storage host consists of a plurality of storage media and is used for storing redundant data; in the distributed storage system, the storage resources are abstracted into a plurality of storage components at the final storage place of the data, and each component consists of a large-scale sparse file chain.
The storage structure stores a computer program, and the computer program realizes the steps of the method for rapidly checking consistency of distributed storage multiple copies when being executed by a control host.
In summary, the present invention discloses a method and a storage structure for rapidly checking consistency of multiple copies in distributed storage, which adopts a processing architecture of a control host and a storage host, and includes: uniformly dividing a stored file into a plurality of data segments in advance, wherein each data segment is provided with a first hash value which corresponds to the data segment independently, and a flag bit for indicating whether the corresponding first hash value is expired or not; when a write request is received, calculating a corresponding zone bit according to the offset and the length of the write request, and setting the zone bit to be overdue; and screening out expired zone bits, and calculating a second hash value of the whole file according to the first hash value of each data segment after updating the first hash value corresponding to the zone bits. The invention provides a method for rapidly checking consistency of distributed storage multiple copies and a storage structure, wherein a large file is divided into a plurality of data segments, the hash value of the file is calculated by the data segments, and the hash value of the whole file is calculated by the hash value of each data segment; by the method, the hash value of the corresponding data segment is updated only by recording which data segments are modified, so that the data of the whole file is prevented from being read when the consistency is checked, the consistency checking speed is greatly improved, and the consumption of the bandwidth of the storage host is reduced; and the hash value is calculated by the data segment, so that concurrent calculation can be realized more easily when the system is idle, and the data verification speed is greatly accelerated.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims (7)

1. A method for rapidly checking consistency of a plurality of copies of distributed storage, wherein the distributed storage adopts a processing architecture of a control host and a storage host, and the method is characterized by comprising the following steps:
A. uniformly dividing a stored file into a plurality of data segments in advance, wherein each data segment is provided with a first hash value which corresponds to the data segment independently, and a flag bit for indicating whether the corresponding first hash value is expired or not;
B. when a write request is received, calculating a corresponding zone bit according to the offset and the length of the write request, and setting the zone bit to be overdue;
C. screening out expired zone bits, and calculating a second hash value of the whole file according to the first hash value of each data segment after updating the first hash value corresponding to the zone bits;
wherein the step C specifically comprises:
c1, screening out expired zone bits, and calculating a new first hash value of the expired zone bits;
c2, judging whether a flag bit is set to be expired during the calculation of the new first hash value, if so, executing the step C1, and if not, writing the new first hash value into the storage host;
and C3, calculating a second hash value of the whole file according to the first hash value of each data segment.
2. The method of claim 1, wherein the first hash value and the flag are saved in an additional new file.
3. The method of claim 1, wherein, during initialization, both the first hash value and the flag bit are set to 0; and setting the first hash value corresponding to the unwritten data segment as 0.
4. The method according to claim 3, wherein the step A specifically comprises:
a1, dividing a stored file into a plurality of data segments in advance, wherein the size of each data segment is 4M, and performing initialization setting;
a2, each data segment is provided with a separately corresponding first hash value and a flag bit for indicating whether the corresponding first hash value is expired.
5. The method according to claim 4, wherein the step B specifically comprises:
b1, when a write request is received, calculating a corresponding zone bit according to the offset and the length of the write request;
b2, and setting the flag bit from 0 to 1, indicating that the flag bit has expired.
6. The method according to claim 5, wherein the step C2 is specifically: initializing the flag bit to 0 in the memory of the control host, and determining whether there is a flag bit set to 1 during the calculation of the new first hash value, if so, executing step C1, otherwise, writing the new first hash value into the storage host.
7. A storage structure is characterized in that the storage structure adopts a processing architecture of a control host and a storage host;
the control host is built with a virtual disk and is used for managing the life cycle of the virtual disk and completing the functions of receiving, caching and forwarding data;
the storage host consists of a plurality of storage media and is used for storing redundant data;
the storage structure stores a computer program, and the computer program, when executed by the control host, implements the steps of the method for rapidly checking consistency of distributed storage multiple copies according to any one of claims 1 to 6.
CN201710748653.1A 2017-08-28 2017-08-28 Method for rapidly checking consistency of distributed storage multi-copy and storage structure Expired - Fee Related CN107632781B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710748653.1A CN107632781B (en) 2017-08-28 2017-08-28 Method for rapidly checking consistency of distributed storage multi-copy and storage structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710748653.1A CN107632781B (en) 2017-08-28 2017-08-28 Method for rapidly checking consistency of distributed storage multi-copy and storage structure

Publications (2)

Publication Number Publication Date
CN107632781A CN107632781A (en) 2018-01-26
CN107632781B true CN107632781B (en) 2020-05-05

Family

ID=61100574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710748653.1A Expired - Fee Related CN107632781B (en) 2017-08-28 2017-08-28 Method for rapidly checking consistency of distributed storage multi-copy and storage structure

Country Status (1)

Country Link
CN (1) CN107632781B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271399A (en) * 2018-11-19 2019-01-25 武汉达梦数据库有限公司 A kind of method of calibration of database write-in log consistency
CN111382463B (en) * 2020-04-02 2022-11-29 中国工商银行股份有限公司 Block chain system and method based on stream data
CN112559547B (en) * 2020-12-24 2023-09-19 北京百度网讯科技有限公司 Method and device for determining consistency among multiple storage object copies
CN113779558A (en) * 2021-09-10 2021-12-10 中国电信集团系统集成有限责任公司 Construction method, installation method and device of application program installation package

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6970987B1 (en) * 2003-01-27 2005-11-29 Hewlett-Packard Development Company, L.P. Method for storing data in a geographically-diverse data-storing system providing cross-site redundancy
CN103546580A (en) * 2013-11-08 2014-01-29 北京邮电大学 File copy asynchronous writing method applied to distributed file system
CN103761162A (en) * 2014-01-11 2014-04-30 深圳清华大学研究院 Data backup method of distributed file system
CN104731792A (en) * 2013-12-19 2015-06-24 中国银联股份有限公司 Method and system for verifying database consistency and method and system for positioning data difference

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6970987B1 (en) * 2003-01-27 2005-11-29 Hewlett-Packard Development Company, L.P. Method for storing data in a geographically-diverse data-storing system providing cross-site redundancy
CN103546580A (en) * 2013-11-08 2014-01-29 北京邮电大学 File copy asynchronous writing method applied to distributed file system
CN104731792A (en) * 2013-12-19 2015-06-24 中国银联股份有限公司 Method and system for verifying database consistency and method and system for positioning data difference
CN103761162A (en) * 2014-01-11 2014-04-30 深圳清华大学研究院 Data backup method of distributed file system

Also Published As

Publication number Publication date
CN107632781A (en) 2018-01-26

Similar Documents

Publication Publication Date Title
US10983955B2 (en) Data unit cloning in memory-based file systems
US9507732B1 (en) System and method for cache management
CN107632781B (en) Method for rapidly checking consistency of distributed storage multi-copy and storage structure
US9547591B1 (en) System and method for cache management
US8347050B2 (en) Append-based shared persistent storage
AU2008308549B2 (en) Solid state drive optimizer
US8965856B2 (en) Increase in deduplication efficiency for hierarchical storage system
CN109614276B (en) Fault processing method and device, distributed storage system and storage medium
CN110995776B (en) Block distribution method and device of block chain, computer equipment and storage medium
WO2019001521A1 (en) Data storage method, storage device, client and system
WO2010109568A1 (en) Storage device
CN108733311B (en) Method and apparatus for managing storage system
US10540114B2 (en) System and method accelerated random write layout for bucket allocation with in hybrid storage systems
US8589647B2 (en) Apparatus and method for synchronizing a snapshot image
US9292204B2 (en) System and method of rebuilding READ cache for a rebooted node of a multiple-node storage cluster
US10365827B1 (en) Spread space tracking
US8965855B1 (en) Systems and methods for hotspot mitigation in object-based file systems
CN112463073A (en) Object storage distributed quota method, system, equipment and storage medium
CN110121874B (en) Memory data replacement method, server node and data storage system
US9864661B2 (en) Cache-accelerated replication of snapshots between storage devices
US11256439B2 (en) System and method for parallel journaling in a storage cluster
US11200219B2 (en) System and method for idempotent metadata destage in a storage cluster with delta log based architecture
CN112445413A (en) Data storage method and device and related equipment
US20240095076A1 (en) Accelerating data processing by offloading thread computation
CN116244256A (en) Data warehousing method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200512

Address after: 812, block B, phase I, Tianan Innovation Technology Plaza, No. 25, Tairan 4th Road, Tianan community, Shatou street, Futian District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen Lianyungang Technology Co.,Ltd.

Address before: 518000, A902, room nine, building A, building 006, Industrial Research Institute, Nanshan New South Road, Nanshan District, Shenzhen, Guangdong

Patentee before: CLOUDSOAR NETWORKS Inc.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200505

Termination date: 20210828

CF01 Termination of patent right due to non-payment of annual fee