CN110989922A

CN110989922A - Distributed data storage method and system

Info

Publication number: CN110989922A
Application number: CN201911033427.0A
Authority: CN
Inventors: 李俊波
Original assignee: Fiberhome Telecommunication Technologies Co Ltd
Current assignee: Fiberhome Telecommunication Technologies Co Ltd
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2020-04-10
Anticipated expiration: 2039-10-28
Also published as: CN110989922B

Abstract

The invention discloses a distributed data storage method and a system, which generate a data block distribution strategy corresponding to data storage identification one by one according to data storage size and a preset distributed algorithm by receiving a data storage request sent by a user side, generate first marking data corresponding to data blocks one by one according to a preset key generation algorithm, and respectively store the data blocks and the first marking data in the same storage positions of a plurality of storage nodes according to the data block distribution strategy; receiving a data reading request sent by a user side, acquiring a data block distribution strategy according to a data storage identifier, reading a data block of a storage node and first mark data, and performing data verification; if the verification is unsuccessful, reading the data block at the same storage position of the next storage node and the repeated data verification of the first marked data, and if the verification is successful, sending the read data block to the user side; and traversing all storage nodes in the data block distribution strategy according to the storage node reading sequence until the verification is successful, or else, sending the user data damage information to the user side.

Description

Distributed data storage method and system

Technical Field

The invention belongs to the field of distributed data storage, and particularly relates to a distributed data storage method and system.

Background

Distributed storage is a storage system, generally comprising a plurality of independent devices, and the devices are connected through a network. When data is written, the data blocks are uniformly distributed on a plurality of independent storage devices through an algorithm, when the data is read, the data blocks are read from different nodes to a client through the algorithm, and the system needs to meet the requirements of reliability, usability and expandability according to the application scene.

Generally, a client stores data by algorithmically addressing the data to a specific storage node through a network and then storing the data. Typically the data block size selects 64 KB. When the client reads data, the data is addressed to a specific storage node through an algorithm through a network, a 64KB data block is read and returned to the client, the integrity and the consistency of the data of the storage node are detected and checked through specific process polling, however, the check is time interval, the 64KB data block has no check process from the storage node to the client, and when the data block is tampered in the middle of the detection of the interval, the read data block has the probability of error data.

Disclosure of Invention

In order to overcome the defects or the improvement requirements in the prior art, the invention provides a distributed data storage method and a distributed data storage system, wherein a data block distribution strategy is generated by receiving a data storage request sent by a user side, and a data block and first marking data are respectively stored in the same storage position of a plurality of storage nodes according to the data block distribution strategy; receiving a data reading request sent by a user side, reading a data block and first mark data of a storage node according to a data block distribution strategy to perform data verification, reading repeated data of a next storage node to perform data verification if the verification is unsuccessful, and sending the read data block to the user side if the verification is successful, so that the same data block can be searched for by polling nodes according to a verification mechanism under the condition that the data block of the storage node has an error, and finally, correct data of a user is returned.

To achieve the above object, according to an aspect of the present invention, there is provided a distributed data storage method, including the steps of:

s1, receiving a data storage request sent by a user side, wherein the data storage request comprises a data storage size and a data storage identifier, and generating data block distribution strategies in one-to-one correspondence with the data storage identifier according to the data storage size and a preset distributed algorithm, wherein the data block distribution strategies comprise data block storage positions, storage node information and a storage node reading sequence; receiving a data stream sent by a user side, dividing the data stream into a plurality of data blocks according to the size of a preset data block, generating first marking data corresponding to the data blocks one by one according to a preset key generation algorithm, and respectively storing the data blocks and the first marking data in the same storage positions of a plurality of storage nodes according to a data block distribution strategy;

s2, receiving a data reading request sent by a user side, wherein the data reading request comprises a data storage identifier, and acquiring a data block distribution strategy according to the data storage identifier; reading the data block of the storage node and the first mark data according to the data block distribution strategy to perform data verification; if the verification is unsuccessful, reading the data block at the same storage position of the next storage node and the repeated data verification of the first marked data, and if the verification is successful, sending the read data block to the user side; and traversing all storage nodes in the data block distribution strategy according to the storage node reading sequence until the verification is successful, or else, sending the user data damage information to the user side.

As a further improvement of the invention, the preset distributed algorithm comprises consistent hash, DHT and CRUSH algorithms.

As a further improvement of the invention, the data stream sent by the user terminal is received through storage system protocols, wherein the storage system protocols comprise TCP, UDP and HTTP protocols.

As a further improvement of the present invention, the preset key generation algorithms include MD5, SHA and HMAC algorithms.

As a further improvement of the present invention, the data verification specifically comprises: and generating second marking data corresponding to the read data blocks one to one according to a preset key generation algorithm, and comparing and checking the first marking data and the second marking data, wherein if the first marking data and the second marking data are the same, the checking is successful.

To achieve the above object, according to another aspect of the present invention, there is provided a distributed data storage system, which includes an interaction module and a plurality of storage nodes,

the interactive module is used for receiving a data storage request sent by a user side, wherein the data storage request comprises a data storage size and a data storage identifier; generating a data block distribution strategy corresponding to the data storage identification one by one according to the data storage size and a preset distribution algorithm, wherein the data block distribution strategy comprises a data block storage position, storage node information and a storage node reading sequence; receiving a data stream sent by a user side, dividing the data stream into a plurality of data blocks according to the size of a preset data block, generating first marking data corresponding to the data blocks one by one according to a preset key generation algorithm, and respectively storing the data blocks and the first marking data in the same storage positions of a plurality of storage nodes according to the data block distribution strategy;

the interaction module is also used for receiving a data reading request sent by a user side, wherein the data reading request comprises a data storage identifier, and a data block distribution strategy is obtained according to the data storage identifier; reading the data block of the storage node and the first mark data according to the data block distribution strategy to perform data verification; if the verification is unsuccessful, reading the data block at the same storage position of the next storage node and the repeated data verification of the first marked data, and if the verification is successful, sending the read data block to the user side; and traversing all storage nodes in the data block distribution strategy according to the storage node reading sequence until the verification is successful, or else, sending the user data damage information to the user side.

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

the invention relates to a distributed data storage method and a system, which generate a data block distribution strategy by receiving a data storage request sent by a user side, and respectively store a data block and first marking data in the same storage position of a plurality of storage nodes according to the data block distribution strategy; the method comprises the steps of receiving a data reading request sent by a user side, reading a data block and first mark data of a storage node according to a data block distribution strategy to carry out data verification, reading repeated data of a next storage node to carry out data verification if the verification is unsuccessful, and sending the read data block to the user side if the verification is successful, so that the same data block can be searched by polling nodes according to a verification mechanism under the condition that the data block has errors in the storage node, and finally, correct data of a user is returned, and the problem that the error data block is obtained with small probability when the data based on a storage system is read by the client side is solved.

According to the distributed data storage method and system, the first marking data of the stored data block are processed through the high-performance encryption algorithm, the second marking data of the read data block are processed through the high-performance encryption algorithm, and data verification is performed by comparing the first marking data with the second marking data, so that the reliability of reading of the data block is further ensured.

Drawings

Fig. 1 is a schematic flowchart of a distributed data storage method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The present invention will be described in further detail with reference to specific embodiments.

Fig. 1 is a schematic flowchart of a distributed data storage method according to an embodiment of the present invention. As shown in fig. 1, a distributed data storage method includes the following steps:

as a preferred embodiment, the distributed data storage system may define a preset distributed algorithm, and the preset distributed algorithm may uniformly distribute data blocks of the user on the distributed storage nodes, and at the same time, when nodes are newly added and nodes are reduced, the data blocks on the nodes can be uniformly distributed; common preset distributed algorithms comprise consistent hash, DHT and CRUSH algorithms, and the distribution information of the user data blocks is calculated according to the preset distributed algorithms, so that data block distribution strategies corresponding to the data storage identifications one to one are obtained;

receiving data stream sent by a user end through a storage system protocol, wherein the common storage system protocol is TCP, UDP, HTTP and the like, the user data is transmitted into the distributed data storage system by the storage system protocol being partitioned into data blocks (as an example, the data blocks are 64KB in size), the distributed data storage system generates corresponding first marker data for each data block of the user according to a preset key generation algorithm, which includes, as an example, MD5, SHA and HMAC algorithms, and is, as an example, a one-way encryption and irreversible algorithm, the user may customize an encryption key, encrypting user data by a self-defined secret key, wherein the encryption result is data with fixed length such as 16 bits, 32 bits, 64 bits and the like, and respectively storing the data block and the first mark data in the same storage position of a plurality of storage nodes according to a data block distribution strategy;

s2, receiving a data reading request sent by a user side, wherein the data reading request comprises a data storage identifier, and searching a data block distribution strategy according to the data storage identifier; reading the data block of the storage node and the first mark data according to the data block distribution strategy to perform data verification; if the verification is unsuccessful, reading the data block at the same storage position of the next storage node and the repeated data verification of the first marked data, and if the verification is successful, sending the read data block to the user side; and traversing all storage nodes in the data block distribution strategy according to the storage node reading sequence until the verification is successful, or else, sending the user data damage information to the user side.

As a preferred embodiment, the data verification specifically includes: and generating second marking data corresponding to the read data blocks one to one according to a preset key generation algorithm, and comparing and checking the first marking data and the second marking data, wherein if the first marking data and the second marking data are the same, the checking is successful.

A user sends a data reading request, the distributed data storage system receives the request, the distributed data storage system locates a specific storage node according to a data block distribution strategy corresponding to a data storage identifier one by one, the storage node reads data blocks of a disk, each time the data is read from the disk as data blocks and first marking data (as an example, the size of the data blocks can be 64KB +16 bytes), the data is simultaneously divided into data blocks (64KB) and first marking data (16 bytes), corresponding second marking data (16 bytes) is generated for the data blocks (64KB) and compared with the first marking data (16 bytes), if the comparison is the same, the data blocks (64KB) are sent to a user terminal, and if the comparison is not the same, the data blocks are distributed according to the data block distribution strategy (including storage node information and storage node reading sequence), reselecting the next node, acquiring the data block and the first marked data at the same storage position of the next storage node, checking the data block in the same way, and repeating the steps until the data block at the same position circulates all the storage nodes and still has no correct data, and returning the user data damage information;

as an example, in the case of copy storage, the number of added set copies is 3, the user data a includes data blocks 1 to 5, which are stored on the storage nodes 1 to 3, respectively, and the data block 2 of the storage node 1 is damaged, the distributed data storage system selects one of the storage nodes according to a data reading request and a data block distribution policy sent by a user, if the storage node 1 is used, the data block to be read is the data block 1, when the data block 2 is read, it is verified that there is a problem in the data block, the data block 2 is read on the storage node 2, and then the remaining data blocks are read, if the problem of data block inconsistency occurs again, the storage node 3 is switched until the data is completely read, so that in the case that there is an error in the data block of the storage node according to the existing storage policy, data block encryption technology and data distribution algorithm, the same data block can be searched according to a checking mechanism and the polling node, and finally, the correct data of the user is returned. If the data bad block reaches the maximum limit, the data of the user is damaged, otherwise, when the data bad block of the user fails to reach the maximum limit, the user can still obtain reliable data by the method.

A distributed data storage system comprises an interaction module and a plurality of storage nodes, wherein,

the interactive module is used for receiving a data storage request sent by a user side, wherein the data storage request comprises a data storage size and a data storage identifier; generating a data block distribution strategy corresponding to the data storage identification one by one according to the data storage size and a preset distribution algorithm, wherein the data block distribution strategy comprises a data block storage position, storage node information and a storage node reading sequence; receiving a data stream sent by a user side, dividing the data stream into a plurality of data blocks according to the size of a preset data block, generating first marking data corresponding to the data blocks one by one according to a preset key generation algorithm, and respectively storing the data blocks and the first marking data in the same storage positions of a plurality of storage nodes according to a data block distribution strategy;

the interaction module is also used for receiving a data reading request sent by a user side, wherein the data reading request comprises a data storage identifier, and a data block distribution strategy is searched according to the data storage identifier; acquiring a data block of a storage node and first marking data according to a data block distribution strategy to perform data verification; if the verification is unsuccessful, reading the data block at the same storage position of the next storage node and the repeated data verification of the first marked data, and if the verification is successful, sending the read data block to the user side; and traversing all storage nodes in the data block distribution strategy according to the storage node reading sequence until the verification is successful, or else, sending the user data damage information to the user side.

A user sends a data reading request, the distributed data storage system receives the request, the distributed data storage system locates a specific storage node according to a data block distribution strategy corresponding to a data storage identifier one by one, the storage node reads data blocks of a disk, each time the data is read from the disk as data of the data block and first marking data (as an example, the size of the data block can be 64KB +16 bytes), the data is simultaneously divided into the data block (64KB) and the first marking data (16 bytes), corresponding second marking data (16 bytes) is generated for the data block (64KB) and compared with the first marking data (16 bytes), if the comparison is the same, the data block (64KB) is sent to the user end, if the comparison is not the same, according to the data block distribution strategy (including storage node information and storage node reading sequence), reselecting the next node, acquiring the data block and the first marked data at the same storage position of the next storage node, checking the data block in the same way, and repeating the steps until the data block at the same position circulates all the storage nodes and still has no correct data, and returning the user data damage information;

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A distributed data storage method, comprising the steps of:

s1, receiving a data storage request sent by a user side, wherein the data storage request comprises a data storage size and a data storage identifier, and generating data block distribution strategies in one-to-one correspondence with the data storage identifier according to the data storage size and a preset distributed algorithm, wherein the data block distribution strategies comprise data block storage positions, storage node information and a storage node reading sequence; receiving a data stream sent by a user side, dividing the data stream into a plurality of data blocks according to the size of a preset data block, generating first marking data corresponding to the data blocks one by one according to a preset key generation algorithm, and respectively storing the data blocks and the first marking data in the same storage positions of a plurality of storage nodes according to the data block distribution strategy;

2. The distributed data storage method according to claim 1, wherein the preset distributed algorithms include consistent hash, DHT, and CRUSH algorithms.

3. A distributed data storage method according to claim 1 or 2, wherein the data stream sent by the user terminal is received through storage system protocols, and the storage system protocols include TCP, UDP and HTTP protocols.

4. A distributed data storage method according to claim 1 or 2, wherein said pre-provisioned key generation algorithms comprise MD5, SHA and HMAC algorithms.

5. The distributed data storage method according to claim 1 or 2, wherein the data verification specifically comprises: and generating second marking data corresponding to the read data blocks one to one according to a preset key generation algorithm, and comparing and checking the first marking data and the second marking data, wherein if the first marking data and the second marking data are the same, the checking is successful.

6. A distributed data storage system comprising an interaction module and a plurality of storage nodes,

the interactive module is further used for receiving a data reading request sent by a user side, wherein the data reading request comprises a data storage identifier, and a data block distribution strategy is obtained according to the data storage identifier; reading the data block of the storage node and the first mark data according to the data block distribution strategy to perform data verification; if the verification is unsuccessful, reading the data block at the same storage position of the next storage node and the repeated data verification of the first marked data, and if the verification is successful, sending the read data block to the user side; and traversing all storage nodes in the data block distribution strategy according to the storage node reading sequence until the verification is successful, or else, sending the user data damage information to the user side.

7. The distributed data storage system of claim 6, wherein said predetermined distributed algorithms comprise consistent hash, DHT, and CRUSH algorithms.

8. A distributed data storage system according to claim 6 or 7, wherein the data stream sent by the user terminal is received by the storage system protocol, and the storage system protocol includes TCP, UDP and HTTP protocols.

9. A distributed data storage system as claimed in claim 6 or 7, wherein said pre-provisioned key generation algorithms include MD5, SHA and HMAC algorithms.

10. The distributed data storage system according to claim 6 or 7, wherein the data verification specifically is: and generating second marking data corresponding to the read data blocks one to one according to a preset key generation algorithm, and comparing and checking the first marking data and the second marking data, wherein if the first marking data and the second marking data are the same, the checking is successful.