CN114168979A - Data copy coding method for distributed storage system and storage medium - Google Patents

Data copy coding method for distributed storage system and storage medium Download PDF

Info

Publication number
CN114168979A
CN114168979A CN202111320005.9A CN202111320005A CN114168979A CN 114168979 A CN114168979 A CN 114168979A CN 202111320005 A CN202111320005 A CN 202111320005A CN 114168979 A CN114168979 A CN 114168979A
Authority
CN
China
Prior art keywords
data
scrambling
copy
matrix
stored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111320005.9A
Other languages
Chinese (zh)
Inventor
万胜刚
黄炜宸
刘俊伦
何绪斌
谢长生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202111320005.9A priority Critical patent/CN114168979A/en
Publication of CN114168979A publication Critical patent/CN114168979A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/604Tools and structures for managing or administering access control systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0617Improving the reliability of storage systems in relation to availability
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Abstract

The invention discloses a data copy coding method and a storage medium for a distributed storage system, which belong to the field of information security and comprise the following steps: when the copy of the original data needs to be stored, performing a data scrambling step on the original data, and storing the copy; the data scrambling step comprises the following steps: dividing data to be coded into scrambling blocks with the size of G, taking the scrambling blocks as elements in a matrix, and constructing a corresponding matrix D ═ (D)ij)n*n(ii) a G is a preset scrambling granularity, and n is 2mM is a positive integer; scrambling the elements in the matrix D by using the n-order scrambling matrix to obtain a scrambling copy matrix as a copy of the data to be encoded; the read amplification rate caused by the scrambling matrix is greater than a preset threshold and has a lower bound of randomness, wherein each element value represents a corresponding position in the matrix DThe element at (b) is scrambled to the position in the scrambled copy. The invention can realize the quick conversion between the copies while ensuring the data security and reliability, and does not depend on a reliable third party.

Description

Data copy coding method for distributed storage system and storage medium
Technical Field
The invention belongs to the field of information security, and particularly relates to a data copy encoding method and a storage medium for a distributed storage system.
Background
Decentralized storage, i.e. storage of data by users in anonymous and unmanaged devices on the internet, may be a potential addition to traditional centralized storage, which saves significant capital and operational expenditure. However, the anonymity and unmanaged nature of decentralized storage devices makes possible malicious attacks on replicated data, thereby reducing the reliability of the data.
John et al defined a witch attack in a large-scale P2P system: a malicious entity may present multiple identities and thus control a significant portion of the nodes of the system, resulting in system redundancy being compromised. Under anonymous shielding, untrusted device owners can forge a large number of identities to collect as many copies of the same data as possible, but store far fewer copies than they collect, which severely compromises system data reliability, and therefore, there is also a lot of research devoted to solving the problems created by such attacks.
The duplicate proof is a new type of storage proof and is also an interactive protocol. The storage facilitator may provide a storage certificate to the user that the user's data has been replicated for storage on the facilitator's unique dedicated physical storage device. The copy proof can prevent various attack modes such as Sybil attack and the like.
The Sybil attack is also the Nash equilibrium between the user and the storage provider, and the provider can use the Nash equilibrium to evaluate to maximize the benefit of the behavior as much as possible, namely reaching the Nash equilibrium point. In terms of storage space alone, the provider may earn the greatest benefit by paying itself the least amount of storage space possible.
Ben et al have implemented a porrep scheme for defending against witch attacks in a fully open scenario, which relies on time constraints, uses a very slow serial coding function, makes time assumptions about the coding and communication processes, and considers that cheating behavior exists when the time for a prover to generate a proof exceeds a set time boundary. Ingmar et al introduced a Kademlia-based secure key routing protocol that relied on trusted third parties, limited the generation of free node ids by cryptographic challenges, and introduced reliable sibling broadcasts by parallel lookups over multiple disconnected paths, with high attack resistance to common attacks including Sybil attacks, whereas in reality, a fully trusted third party was difficult to find, and even after finding, was a significant overhead for the user.
Generally, the existing solution is either a problem of poor data access performance due to excessive computing overhead or a problem of incapability of realizing true decentralization due to dependence on a trusted third party. These disadvantages greatly reduce the applicability of decentralized storage to users, and greatly limit the development of decentralized storage.
Disclosure of Invention
Aiming at the defects and the improvement requirements of the prior art, the invention provides a data copy coding method and a storage medium for a distributed storage system, aiming at solving the technical problem that the existing data copy coding method has overlarge calculation overhead or depends on a reliable third party.
To achieve the above object, according to an aspect of the present invention, there is provided a data copy encoding method for a distributed storage system, including:
after receiving original data to be stored and a corresponding storage request, analyzing the storage request to judge whether a copy of the original data needs to be stored, if so, using the original data as data to be encoded, performing a data scrambling step on the data to generate a copy of the original data, and storing the copy;
the data scrambling step comprises the following steps:
(a1) will be provided withAfter the data to be coded is divided into scrambling blocks with the size of G, the scrambling blocks are used as elements in a matrix, and a corresponding matrix D (D) is constructedij)n*n(ii) a G is a preset scrambling granularity, and n is 2mM is a positive integer;
(a2) scrambling the elements in the matrix D by using the n-order scrambling matrix to obtain a scrambling copy matrix as a copy of the data to be encoded; the read amplification rate caused by the scrambling matrix is greater than a preset threshold value and has a lower bound of randomness, wherein each element value represents the position of the scrambled copy of the element at the corresponding position in the matrix D.
Compared with the existing method for realizing data coding based on cryptography or a trusted third party, the data copy coding method for the distributed storage system can realize quick conversion between copies, ensure the access performance of data and is independent of a reliable third party; because the read amplification rate caused by the scrambling matrix adopted when the data coding is carried out is large enough (larger than the preset threshold), the reverse correlation can be added between the size of the storage space used by the storage provider and other costs to be paid by the storage provider, so that the Nash balance point between the storage provider and the user moves towards the direction of paying more storage spaces, the cheating of the storage provider can be effectively prevented, and the data reliability is ensured; the scrambling matrix adopted in the encoding process of the invention also has a lower bound of randomness, and can effectively provide security guarantee. Generally speaking, the invention uses the lower bound scrambling matrix which can cause enough large reading magnification and has randomness to scramble the original data to realize coding, can realize fast conversion between copies, ensures the access performance of data, does not depend on a reliable third party, and effectively solves the technical problem that the existing data copy coding method has overlarge calculation cost or depends on a reliable third party.
Further, the scrambling matrix is a Latin square.
The invention takes the Latin side as the scrambling matrix, and under the condition of meeting the lower bound of causing enough large read amplification rate and having randomness, a plurality of alternative Latin methods are available, therefore, the invention can provide enough large scrambling copy space, and further improve the data reliability.
Further, the generation method of the scrambling matrix comprises the following steps:
calculating elements of 2 nd to n th columns in the scrambling matrix according to index data pre-stored in a memory space through Galois field addition operation to generate a scrambling matrix;
the index data is the first column of elements obtained through Galois field multiplication operation.
The generation process of the Latin square comprises Galois field multiplication operation and Galois field addition operation, wherein the Galois field multiplication operation is used for generating a first element of each row, namely a1 st column element in the whole matrix, and the Galois field addition operation is used for generating other elements of corresponding rows according to the first element of each row; the Galois field multiplication result which consumes longer time is taken as index data to be stored in the memory space in advance, and when data coding is carried out, the other elements in the Latin side are calculated in real time through Galois field addition operation which consumes shorter time, so that the internal and external memory overhead can be effectively reduced under the condition of not influencing the coding speed.
Further, the column numbers of the elements in the matrix D are unchanged before and after scrambling.
Further, in step (a2), the scrambling of the elements of the plurality of columns in the matrix D is performed in parallel.
When the data scrambling is carried out, the invention ensures that the sequence is unchanged, ensures that the read-write operation of each row in the matrix is not interfered with each other, and carries out the scrambling of a plurality of rows of elements in parallel, thereby further improving the speed of data coding.
Further, G ═ 2.
The invention can carry out fine-grained scrambling on the data by setting G to 2, so that when a cheater reads partial data of one copy, the cheater must visit all data of the other copy or spend the same I/O bandwidth, which causes extra large I/O punishment to the cheater, and the Nash balance of the system is moved, thereby preventing cheating of a storage provider and ensuring the data security.
Further, the data copy encoding method for a distributed storage system provided by the present invention further includes:
(b1) after receiving the request for initiating the challenge, judging whether the requested target copy data is stored correctly locally, if so, turning to the step (b 2); otherwise, go to step (b 3);
(b2) reading the requested data from the stored target copy data and returning, and finishing the challenge;
(b3) judging whether other copy data are stored or not, if so, reading the other stored copy data, calculating a scrambling matrix converted from target copy data to the read copy data, taking the read copy data as data to be encoded, performing a data scrambling step according to the calculated scrambling matrix to generate target copy data, reading requested data from the target copy data and returning, or reading the requested data from the read copy data according to the scrambling matrix and returning, and finishing the challenge; otherwise, go to step (b 4);
(b4) the challenge is not responded to and ends.
Based on the steps, the method can support data replication certification, and particularly for honest storage providers which correctly store k copies (k is the number of copies which meet service availability and data reliability indexes), when a user or a user agent initiates a challenge to the storage provider to request partial data in target copy data, for the storage provider which correctly stores original data, only a small amount of specified data needs to be read in sequence from the locally stored original data, namely the correct data can be returned to the user within a specified time, and the challenge is successful;
for a cheating storage provider which stores less than k copies, it is necessary to read other locally stored copy data, perform a scrambling step on the data to obtain a copy, then read the required data, and return the data to a user, which results in that the data cannot be returned to the user within a specified time, and the challenge fails; for a cheating storage provider storing less than k copies, the cheating storage provider can also read requested data from other copy data stored locally according to the scrambling matrix and return the requested data to a user, which causes the storage provider to bear a great random I/O cost, cannot complete the challenge within a specified time, and therefore, also causes the challenge to fail.
In general, through the above steps, the present invention can support data replication certification, so that a storage provider with cheating behavior can pay an intolerable I/O cost when a running certification protocol exists, nash equilibrium can be moved, and read amplification is formed, and unless the storage provider pays an intolerable I/O cost, the storage provider cannot return correct data within a specified time, and finally the challenge fails.
According to another aspect of the present invention, a computer-readable storage medium is provided, which includes a stored computer program, and when the computer program is executed by a processor, the computer-readable storage medium is controlled to execute the data copy encoding method for a distributed storage system provided by the present invention.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) the invention uses the lower bound matrix which can cause enough reading magnification and has randomness as the scrambling matrix, realizes data encoding by a data scrambling mode according to the scrambling matrix, ensures the data security and reliability, can realize the quick conversion between copies, ensures the data access performance, and does not depend on a reliable third party.
(2) The invention uses Latin square as scrambling matrix, which can provide enough space for scrambling copy and improve data reliability.
(3) When data scrambling is carried out, the invention keeps the sequence unchanged, and carries out parallel scrambling on the elements of multiple columns in the matrix, thereby further improving the coding speed.
(4) The invention provides a step supporting data replication certification, so that when a cheater reads partial data of one copy, the cheater must access all data of the other copy or spend the same I/O bandwidth, an extra large amount of I/O punishment is caused to the cheater, and the Nash balance of the system is moved, thereby effectively preventing the cheating of a storage provider and ensuring the data security.
(5) The present invention has a number of adjustable parameters, including copy size, scrambling granularity, challenge frequency, etc., by which specific I/O costs can be flexibly amplified for a particular attacker.
(6) The method utilizes a read-amplifying method to resist Sybil attacks in a low-credibility distributed storage system, and can further increase the cheating cost of a storage provider and prevent the cheating by scrambling the copies at a fine granularity.
Drawings
FIG. 1 is a schematic diagram of a data scrambling step according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a data replication certification procedure according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Before explaining the technical scheme of the invention in detail, the related technology in the field of distributed storage is briefly explained as follows:
copy conversion: in the distributed storage, after a user uploads original data, a storage provider needs to store the original data, and then stores a plurality of copies according to requirements; the copy is obtained by processing the original data, and when a user requires to store data provided by a provider, the provider needs to convert the stored copy into the original data after processing; the conversion between the original data and the copy and between the copy and the copy is called copy conversion;
false copy: the storage provider does not store the copy according to the requirement of the user, for example, the user requires to store 10GB of data, the service provider only really takes out 1GB of space for storage for the benefit of the service provider, the fraudulent act is called false storage, and the corresponding copy is called false copy;
and (3) reading and amplifying: a phenomenon that data actually required to be read out is larger than requested target data in a storage system;
challenge (challenge): in order to ensure that the storage provider does store data as required, a user regularly initiates a 'challenge' to the provider, which is equivalent to a query, randomly requests to acquire partial data, and if the storage provider cannot return correct data within a specified time, the storage provider is indicated to store the data uncorruptedly, and the challenge fails.
In order to solve the technical problems that data access performance is poor due to excessive calculation overhead or data encoding in the existing distributed storage system depends on a reliable third party and decentralization cannot be achieved, the invention provides a data copy encoding method and a storage medium for the distributed storage system, and the overall thought of the method is as follows: by utilizing the read amplification phenomenon of a storage layer, a proper scrambling matrix is selected, and data coding is performed in a data scrambling mode, so that rapid copy conversion is realized under the conditions of storage provider Nash balance in a mobile game model and prevention of cheating of the storage provider, the data access performance is ensured, and the method does not depend on a reliable third party.
A witch attack is a nash balance between a user and a storage provider. In a distributed storage system, any behavior of a storage provider can be evaluated using nash equilibrium, and the provider can evaluate its own behavior to maximize the benefit as much as possible, i.e., to reach the nash equilibrium point. In terms of storage space alone, the provider may earn the greatest benefit by paying itself the least amount of storage space possible. If an inverse correlation can be made between the size of the storage space used by the provider and the other costs he is paying, then the nash equilibrium can be controlled to move in the direction of paying more storage space. At the storage level, the performance gap between random reading and sequential reading of a disk is very large, which affects the indexes such as bandwidth and IOPS.
The read amplification utilized by the invention is a phenomenon commonly existing in a storage system, and is a suitable candidate for constructing a challenge of coping with false copies (mainly copy conversion); a sufficiently large read-to-amplify ratio consumes I/O bandwidth that is at a premium for storage devices, which not only can cause a heavy hit in QoS, but can also greatly shorten the life of HDDs that are widely deployed as storage devices for distributed storage; the read amplification occurs only in the storage tier, and in other tiers such as memory, computing and networking, the additional cost incurred by read amplification is relatively small, which helps the scatter store to run on low-end computers, making the scatter store cheaper and scalable. The invention needs to select a proper scrambling matrix for data scrambling, and needs to consider whether the amplification rate is large enough or not and whether the lower bound of randomness exists or not: if the amplification rate is too small, the amplification rate is not enough to move the Nash equilibrium point of the provider, in practical application, whether the read amplification rate caused by the selected scrambling matrix is large enough can be judged according to the I/O cost which can be born by an attacker, if the caused read amplification rate exceeds the I/O cost which can be born by the attacker, the read amplification rate of the scrambling matrix can be considered to be large enough, and at the moment, the Nash equilibrium on the read-write performance and the storage performance can be moved to a proper position; if the scrambling algorithm does not have a theoretical lower bound of randomness, no security guarantee for the algorithm can be provided.
The Latin square (Latin square) is a matrix satisfying the above requirements, the generation rule is public, and in the case of order determination, there are many alternative Latin squares, so that, choosing Latin square as scrambling matrix can provide enough space of scrambling copy far exceeding the existing requirement for data reliability on one hand, and on the other hand, can ensure that anyone can perform encoding and decoding operation on data. In view of the above, in the present invention, the latin square is preferably selected as the scrambling matrix for data scrambling, and in the following embodiments, the latin square is taken as an example for explanation.
To facilitate a clearer explanation of the present invention, the following definitions of latin squares and the related properties are briefly introduced as follows:
the n-order Latin square is a matrix of n x n, wherein each row and each column are an arrangement of 1,2,3 … …, n;
if it is not
Figure BDA0003345249060000091
Is two Latin squares of n x n (1 ≦ i, j ≦ n), and is a simultaneous matrix
Figure BDA0003345249060000092
N in (1)2The ordered pairs are different from each other, then called L1And L2Orthogonal, or L1And L2Is a set of orthogonal latin squares.
There are some properties for latin squares of different orders as follows: if the order n is a power of a prime number, then there are n-1 sets of Latin squares that are mutually orthogonal;
the sizes of the data copies are always powers of 2, and Latin squares of power orders of 2 can be generated by using the formula; for example, for a commonly used 32GB size copy, setting the scrambling granularity to 2 bytes, then 2 would need to be generated17*217The large latin square can theoretically generate 2 in total17-1 set of mutually orthogonal latin squares, which provides a sufficiently large copy space, any copy of which can be scrambled by performing a scrambling process on any one of the latin squares to generate a scrambled copy; meanwhile, in the Latin party, the lower bound of randomness and the uniform distribution of continuous data can be ensured between any two mutually orthogonal Latin parties, which means that two different copies generated from the same original data still have quite strong reading amplification;
the sufficiently large latin square can be used as the latin square for scrambling in the present invention, and further examples of latin squares will not be listed here.
It should be noted that the latin square is only a preferred embodiment of the present invention, and should not be construed as the only limitation of the present invention, and other matrices satisfying the above two requirements, and having a lower bound of randomness, which can cause a sufficiently large reading magnification, can be used as the scrambling matrix in the present invention.
The following are examples.
Example 1:
a data copy encoding method for a decentralized storage system, comprising:
after receiving original data to be stored and a corresponding storage request (the storage request contains a storage requirement of a user for a copy), analyzing the storage request to judge whether the copy of the original data needs to be stored, if so, taking the original data as data to be encoded, performing a data scrambling step on the data to generate the copy of the original data, and storing the copy;
the data scrambling step comprises the following steps:
(a1) dividing data to be coded into scrambling blocks with the size of G, taking the scrambling blocks as elements in a matrix, and constructing a corresponding matrix D ═ (D)ij)n*n(ii) a G is a preset scrambling granularity, and n is 2mM is a positive integer; optionally, in this embodiment, the size of the data copy is 32GB, and G ═ 2, so that fine-grained data scrambling can be implemented; as can be seen from the above analysis, in the present embodiment, n is 217
(a2) Scrambling the elements in the matrix D by using the n-order scrambling matrix to obtain a scrambling copy matrix as a copy of the original data; the reading amplification rate caused by the scrambling matrix is larger than a preset threshold value and has a lower bound of randomness, wherein each element value represents the position of the element at the corresponding position in the matrix D in the scrambled copy after being scrambled; in the present embodiment, the scrambling matrix used is specifically 217The latin square C of (1), wherein each row is one traversal of 1,2,3, … … n, and each column is also one traversal of 1,2,3, … … n;
in practical applications, if the amplification between copies is not stable or has continuous bits, the storage provider can store a part of the data less, which still affects the reliability of the data; it can be demonstrated that: only two groups of identical latin squares are selected, which results in two scrambled copies with two or more consecutive bits; therefore, in the embodiment, the latin square is used for data scrambling, so that a more stable amplification rate between the original copy and the scrambled copy can be ensured.
Considering that a complete Latin square generally needs to occupy a large storage space, the above 217For the example of the order Latin square, each data needs to be stored by 17 bits, and the actual storage system generally uses 2 bitsxThe data storage is carried out by taking (x is a positive integer) bytes as a unit, so that in the actual storage, each data in the Latin side needs to be stored by using 4 bytes, and then 2 bytes are used17The actual memory size occupied by the latin square is 217*2174 bytes, in order to reduce the internal and external memory overhead of the latin square, the storage and generation of the latin square are optimized in this embodiment, specifically, because the generation process of the latin square includes galois field multiplication and galois field addition, where the galois field multiplication is used to generate the first element of each row, i.e. the 1 st column element in the whole matrix, and the galois field addition generates the other elements of the corresponding row according to the first element of each row; based on the generation rule, in this embodiment, the first element, i.e., the first row element, in each line in the latin square is calculated in advance through galois field multiplication, and is stored in the memory as index data; when data coding is carried out, other elements in each line in the Latin square are calculated in real time through Galois field addition operation according to index data stored in a memory space in advance; therefore, in the embodiment, index generation optimization is performed when the scrambled coordinate matrix is generated, the index data is used for storing the Galois field multiplication result which consumes a long time, and the Galois field addition operation which consumes a short time is used for generating the coordinate data in real time, so that the scrambled coordinate matrix which originally occupies dozens of GB memory and external memory space is changed into the index data which only occupies hundreds of KB, and the running speed of coding is not influenced while the internal and external memory overhead is reduced.
In order to further increase the encoding speed, when the scrambling matrix is selected, the present embodiment also ensures that the column numbers of the elements in the matrix D are unchanged before and after scrambling, so that the scrambling operations of each column of data are not interfered with each other; based on this, in step (a2), the present embodiment also performs a scrambling operation on the elements of the columns in the matrix D in parallel using a plurality of threads; specifically, fine-grained pipelining is set, so that data reading and processing are performed simultaneously. The data prefetching thread reads original data from the disk continuously in a rolling mode according to the size of the window, a plurality of branch threads perform data scrambling processing separately, and the prefetching thread polls the data processing progress to the branch threads to adjust the data reading speed of the prefetching thread. Because the sequence of the data in the scrambling strategy of the invention is not changed, the write operation among the threads is not interfered with each other, thereby simultaneously realizing the lock-free multithreading technology.
Fig. 1 shows a simple example of data scrambling, in which the original data is constructed into a 4-degree matrix D ═ D (D)ij)4*4Accordingly, n is 4; from the above analysis, it was found that n-1 ═ 1 latin squares L orthogonal to each other exist1(ii) a Selecting scrambling matrix C from fourth-order orthogonal Latin square4*4Wherein each element represents the position of the element at the corresponding position in the matrix D in the scrambling copy matrix, e.g., the value of the element at the first row and the first column in the scrambling matrix C is 4, and the element at the first row and the first column in the matrix D is moved to the position at the fourth row and the first column due to the unchanged sequence of the rows and the columns; and D, scrambling and recombining each element according to the scrambled coordinates displayed in the scrambling matrix C to generate a scrambling copy matrix corresponding to the original data. Data recovery is the inverse process of scrambling, and is accomplished by computing the inverse of the transform matrix, with the same algorithm complexity as the scrambling variation. If there are enough orthogonal latin squares, different scrambling copies can be generated by selecting different latin squares. It should be noted that the illustration in fig. 1 is only an exemplary illustration, and is only used to explain the scrambling strategy of the present invention, and in practical applications, the size of the latin square is often much larger than 4.
Based on the above description, it can be seen that any two scrambled copies or copies and original data can be mutually converted; for example, the storage provider only stores B, and can obtain a-B latin square by calculating from raw data to a latin square of a and from raw data to B latin square, and then can obtain a by converting from B, which requires reading complete B data; the second method is that a latin side of a- > B directly finds out where the requested data is scattered in B, and reads the data accordingly, because the scrambling granularity is small, but a part of data requested by initiating a challenge is relatively large, direct reading according to location will result in a great random I/O cost. Based on this, the present embodiment further provides a corresponding data replication certification step, which is used to support data replication certification and specifically includes:
(b1) after receiving the request for initiating the challenge, judging whether the requested target copy data is stored correctly locally, if so, turning to the step (b 2); otherwise, go to step (b 3);
(b2) reading the requested data from the stored target copy data and returning, and finishing the challenge;
(b3) judging whether other copy data are stored or not, if so, reading the other stored copy data, calculating a scrambling matrix converted from target copy data to the read copy data, taking the read copy data as data to be encoded, performing a data scrambling step according to the calculated scrambling matrix to generate target copy data, reading requested data from the target copy data and returning, or reading the requested data from the read copy data according to the scrambling matrix and returning, and finishing the challenge; otherwise, go to step (b 4);
(b4) the challenge is not responded to and ends.
Based on the steps, the method can support data replication certification, and particularly for honest storage providers which correctly store k copies (k is the number of copies which meet service availability and data reliability indexes), when a user or a user agent initiates a challenge to the storage provider to request partial data in target copy data, for the storage provider which correctly stores original data, only a small amount of specified data needs to be read in sequence from the locally stored original data, namely the correct data can be returned to the user within a specified time, and the challenge is successful;
for a cheating storage provider which stores less than k copies, it is necessary to read other locally stored copy data, perform a scrambling step on the data to obtain a copy, then read the required data, and return the data to a user, which results in that the data cannot be returned to the user within a specified time, and the challenge fails; for a cheating storage provider storing less than k copies, the cheating storage provider can also read requested data from other copy data stored locally according to the scrambling matrix and return the requested data to a user, which causes the storage provider to bear a great random I/O cost, cannot complete the challenge within a specified time, and therefore, also causes the challenge to fail.
As shown in fig. 2, when a user or a user agent challenges a storage provider to request a part of data in copy data (assuming that the size of the requested data is 1MB), for an honest storage provider that correctly stores k copies, he only needs to sequentially read a small amount of data at a required location to return correct data to the user within a specified time, and the challenge is successful; for a spamming storage provider that stores less than k copies, there are two strategies for responding to the challenge: firstly, reading out stored 32GB copies in sequence, and obtaining the scrambled copies by executing a scrambling step on original data to respond to challenges, wherein a scrambling matrix is obtained by converting a scrambling matrix (namely a Latin side) of target copy data and a scrambling matrix (namely the Latin side) of read copy data, the strategy is called sequential read cheating, and the amplification rate is the ratio of I/O time of sequentially reading out 32GB data to I/O time of reading out 1MB data; secondly, in the stored copy data, the position of the requested data in the stored copy data is determined through a scrambling matrix obtained by conversion, 1MB of data is read through random I/O according to the granularity of 2 bytes, namely, 512 times, 1024 times of random I/O are carried out, which is called random cheating, at the moment, the amplification rate is the ratio of the time of carrying out random I/O for more than 50 ten thousand times to the time of carrying out I/O once, and the amplification rate can be stabilized at the magnitude of the number no matter which orthogonal Latin method is selected for scrambling.
It can be seen that storage providers that store less than k copies need to pay an intolerable I/O penalty in running the attestation protocol, and the nash balance of the storage provider can be moved while read amplification is formed. Unless the storage provider is willing to pay an intolerable I/O penalty, he cannot return the correct data within the specified time, and the decision fails. A large amount of read amplification can be built as a penalty in responding to challenges of counterfeit copies. Even with slight periodic challenges, such penalties can significantly consume I/O bandwidth, and even the lifecycle of the spammer's hard disk, thus storage providers are reluctant to perform witch attacks.
In general, through the above steps, the present invention can support data replication certification, so that a storage provider with cheating behavior can pay an intolerable I/O cost when a running certification protocol exists, nash equilibrium can be moved, and read amplification is formed, and unless the storage provider pays an intolerable I/O cost, the storage provider cannot return correct data within a specified time, and finally the challenge fails.
Example 2:
a computer-readable storage medium comprising a stored computer program, which when executed by a processor, controls an apparatus in which the computer-readable storage medium is located to execute the data copy encoding method for a decentralized storage system provided in embodiment 1 above.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A method of encoding a copy of data for a distributed storage system, comprising:
after receiving original data to be stored and a corresponding storage request, analyzing the storage request to judge whether a copy of the original data needs to be stored, if so, taking the original data as data to be encoded, performing a data scrambling step on the data to generate a copy of the original data, and storing the copy;
the data scrambling step comprises:
(a1) dividing data to be coded into scrambling blocks with the size of G, taking the scrambling blocks as elements in a matrix, and constructing a corresponding matrix D ═ (D)ij)n*n(ii) a G is a preset scrambling granularity, and n is 2mM is a positive integer;
(a2) scrambling the elements in the matrix D by using an n-order scrambling matrix to obtain a scrambling copy matrix as a copy of the data to be encoded; the read amplification rate caused by the scrambling matrix is greater than a preset threshold and has a lower bound of randomness, wherein each element value represents the position of the scrambled copy of the element at the corresponding position in the matrix D after scrambling.
2. The method of data copy encoding for a decentralized storage system according to claim 1, wherein the scrambling matrix is a latin square.
3. The method of claim 2, wherein the scrambling matrix is generated in a manner comprising:
calculating elements of 2 nd to n th columns in the scrambling matrix according to index data pre-stored in a memory space through Galois field addition operation to generate the scrambling matrix;
the index data is a first column of elements obtained through Galois field multiplication operation.
4. A data copy encoding method for a decentralized storage system according to any one of claims 1 to 3, characterized in that the elements in the matrix D are before and after scrambling with their column numbers unchanged.
5. A method of encoding data replicas for a decentralized storage system according to claim 4, wherein in step (a2), the scrambling of columns of elements in the matrix D is performed in parallel.
6. A data copy encoding method for a decentralized storage system according to any one of claims 1 to 3, wherein G-2.
7. A data copy encoding method for a decentralized storage system according to any one of claims 1 to 3, characterized in that it further comprises:
(b1) after receiving the request for initiating the challenge, judging whether the requested target copy data is stored correctly locally, if so, turning to the step (b 2); otherwise, go to step (b 3);
(b2) reading the requested data from the stored target copy data and returning, and finishing the challenge;
(b3) judging whether other copy data are stored or not, if so, reading the other stored copy data, calculating a scrambling matrix converted from the target copy data to the read copy data, taking the read copy data as data to be encoded, executing the data scrambling step according to the calculated scrambling matrix to generate the target copy data, reading the requested data from the target copy data and returning the requested data, or reading the requested data from the read copy data according to the scrambling matrix and returning the requested data, and finishing the challenge; otherwise, go to step (b 4);
(b4) the challenge is not responded to and ends.
8. A computer-readable storage medium, comprising a stored computer program which, when executed by a processor, controls an apparatus in which the computer-readable storage medium is located to perform the data copy encoding method for a decentralized storage system according to any one of claims 1 to 7.
CN202111320005.9A 2021-11-09 2021-11-09 Data copy coding method for distributed storage system and storage medium Pending CN114168979A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111320005.9A CN114168979A (en) 2021-11-09 2021-11-09 Data copy coding method for distributed storage system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111320005.9A CN114168979A (en) 2021-11-09 2021-11-09 Data copy coding method for distributed storage system and storage medium

Publications (1)

Publication Number Publication Date
CN114168979A true CN114168979A (en) 2022-03-11

Family

ID=80478377

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111320005.9A Pending CN114168979A (en) 2021-11-09 2021-11-09 Data copy coding method for distributed storage system and storage medium

Country Status (1)

Country Link
CN (1) CN114168979A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114827176A (en) * 2022-04-08 2022-07-29 华中科技大学 Method and system for defending Sybil attack in distributed storage system
CN114826720A (en) * 2022-04-19 2022-07-29 中国工商银行股份有限公司 Data storage method and device, computer readable storage medium and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114827176A (en) * 2022-04-08 2022-07-29 华中科技大学 Method and system for defending Sybil attack in distributed storage system
CN114827176B (en) * 2022-04-08 2023-05-09 华中科技大学 Method and system for defending Sybil attacks in distributed storage system
CN114826720A (en) * 2022-04-19 2022-07-29 中国工商银行股份有限公司 Data storage method and device, computer readable storage medium and electronic equipment
CN114826720B (en) * 2022-04-19 2024-01-30 中国工商银行股份有限公司 Data storage method, data storage device, computer readable storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN110868440B (en) Block chain male chain
Guan et al. Chaotic image encryption algorithm using frequency‐domain DNA encoding
Sriman et al. Blockchain technology: Consensus protocol proof of work and proof of stake
Wang et al. Circuit oram: On tightness of the goldreich-ostrovsky lower bound
Berger et al. Scaling byzantine consensus: A broad analysis
Williams et al. Single round access privacy on outsourced storage
CN114168979A (en) Data copy coding method for distributed storage system and storage medium
CN108810063A (en) Secure distribution and restorative procedure, the system and medium of data under a kind of cloudy storage environment
CN109639436A (en) The data property held verification method and terminal device based on salt figure
Cheng et al. Query assurance verification for outsourced multi-dimensional databases
Gakhov Probabilistic data structures and algorithms for big data applications
Chou et al. Bc-store: A scalable design for blockchain storage
CN109783456B (en) Duplication removing structure building method, duplication removing method, file retrieving method and duplication removing system
Yang et al. Cloud storage data access control scheme based on blockchain and attribute-based encryption
Li et al. A survey of state-of-the-art sharding blockchains: Models, components, and attack surfaces
Alamer et al. A secure tracing method in fog computing network for the iot devices
Zhen et al. A dynamic state sharding blockchain architecture for scalable and secure crowdsourcing systems
Ye et al. GCplace: geo-cloud based correlation aware data replica placement
Liu et al. Multi-user image retrieval with suppression of search pattern leakage
Ge et al. CRchain: An efficient certificate revocation scheme based on blockchain
Heo et al. Blockchain storage optimisation with multi-level distributed caching
CN114827176A (en) Method and system for defending Sybil attack in distributed storage system
Sun et al. Dynamic authenticated data structures with access control for outsourcing data stream
Zhu et al. Management of access privileges for dynamic access control
Sun et al. Vault: Decentralized Storage Made Durable

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination