WO2013131253A1

WO2013131253A1 - Pollution data recovery method and apparatus for distributed storage data

Info

Publication number: WO2013131253A1
Application number: PCT/CN2012/072007
Authority: WO
Inventors: 李挥; 黄显霞; 冯俊秋; 叶顺鸿; 陈畅民; 侯韩旭; 朱兵
Original assignee: 北京大学深圳研究生院
Priority date: 2012-03-06
Filing date: 2012-03-06
Publication date: 2013-09-12

Abstract

The invention relates to a pollution data recovery method for distributed storage data. The method comprises the following steps: respectively downloading, by any k+1 of n storage nodes, the stored data thereof; calculating the k downloaded data blocks according to Y_j=XG_j to obtain the downloaded original data X; comparing the data block Y_k+1 downloaded from the k+1^st storage node with the product of the original data X obtained in above steps and the matrix G_k+1, if they are the same, quitting the cycle and obtaining the data; if not, downloading the data from at least one storage node again, wherein at least one data block is different from the k+1 data blocks downloaded before and replaces at least one of the before downloaded data blocks; and then returning to the original data calculation step. The present invention also relates to an apparatus for implementing the above mentioned method. By implementing the pollution data recovery method and apparatus for distributed storage data of the present invention, the method can be simpler and the calculation overhead can be less.

Description

Method and device for recovering pollution of distributed storage data

Technical field

The present invention relates to the field of distributed data storage, and more particularly to a method and apparatus for recovering pollution of distributed storage data.

Background technique

Network coding is an information transmission technology that combines coding and routing. On the basis of the traditional storage-and-forward routing method, it increases the amount of information transmitted in a single transmission by allowing information fusion of multiple received data packets. The overall performance of the network. However, a malicious node can intentionally tamper with or falsify a message. After a downstream node receives a contaminated message, if it is not known that the message is contaminated and used to encode with other uncontaminated messages, the contaminated message spreads quickly. Downstream of the malicious node even spreads to the entire network, so the contaminated message must be filtered out as early as possible during the transmission.

The distributed storage system based on network coding divides the original data into several parts and stores them on different nodes. Each coded group is calculated by combining linear network coding ideas with multiple modules. In order to obtain raw data, it is necessary to obtain enough coding blocks at the same time, which has a significant application in distributed storage. Storage nodes have limited communication, computing, and storage capabilities. The purpose of storing coded modules rather than raw data is to provide system effectiveness. For example, consider an example of an (n, k) MDS code, where n storage nodes are used to store file modules, the original file is divided into k shares, and encoded into n shares stored on the n different nodes. Each node stores a linear combination of the original data blocks. This random linear coding technique allows a receiver to recover the original data with high probability and solve a problem by downloading any k modules from these nodes for selecting the appropriate parameters. System of linear equations (SLEs;). Therefore, this recipient can obtain the original file and only requires low latency in data construction and low downloads in network communication.

In a good environment, network coding can improve the effectiveness of distributed storage systems. However, in a bad environment, such as an attacker may attack a storage node, there may be a potential problem, we call the problem pollution. attack. Among them, the attacker changes some stored encoding modules, so that the wrong decoding occurs during the process of restoring the original file, so that the correct file cannot be obtained. Since these encoding schemes linearly combine the raw data, a simple corrupted encoding module will affect the decoding of the entire file. The actual impact of pollution attacks is huge and unpredictable.

In communication systems, the detection of pollution attacks in distributed storage based on network coding introduces a cryptographic technique-hash function. For example, in an actual P2P file sharing system, the data module is often hashed and a letter is used. This hash value can be obtained by any center. By comparing the hash value in the trust center with the hash value of each downloaded data module, the node can determine whether the downloaded module is legal. In order to apply this scheme to a P2P file sharing system based on network coding, a homomorphic hash function is introduced. In the proposed scheme, the hash value of an encoding module X can be obtained by constructing the hash value of each submodule _Xi (\<i<k) of the module, that is, when X = χ,., hash(X) = U ^k _i=l hash(x _i ). Assume that when a node joins the system for the first time, the hash value of each module of a given file can be obtained in a secure manner. These hash values can be used to verify the integrity of the encoding module downloaded by the node. In order to reduce the computational overhead caused by the homomorphic hash function, the literature [C.Gkantsidis and P.Rodriguez "Cooperative Security for Network Coding File Distribution," Proc. IEEE INFOCOM, 2006.] requires that when a malicious be detected is detected Tampered modules, nodes can cooperate with each other and can notify each other. In this way, a given node cannot authenticate each module by itself, and can only rely on information sent by other nodes for verification.

In any case, each scheme that uses a hash function (whether or not it is homomorphic) needs to establish a secure channel between the source and the sink in order to get the true hash of the original data module. Another solution to prevent pollution attacks requires digital signatures of data modules before they are added to the system. However, in order for the algorithm to work, the intermediate node combines the data modules received from different sinks, and the data signature scheme must also have a homomorphic attribute, similar to the case of the homomorphic hash function described above.

A homomorphic signature scheme has recently been proposed in distributed storage based on network coding. Unlike the homomorphic hash function scheme, the homomorphic signature scheme does not require a secure channel between the source and the sink. In the literature [Z.Yu, T.Wei, B.Ramkumar, and Y.Guan, "An efficient signature-based scheme for securing network coding against pollution attacks",

In Proc. IEEE Infocom, 2008.], first assume that there is a trusted server and generate security parameters.

(p,q,g) , where p and q are two large prime numbers and satisfy ^l(pl) (eg lpl=1024 bits, lgl=257 bits), g is defined as a ( _gl ,..., ) line Vector, where the order of the elements is q and are randomly selected from Z _p ; further assume that the source has a pair of private and public keys PK, and then the server issues the public parameters (;^^, /^) through the secure channel. For a file ^, group and k primitive data modules of the group ^· = 1,..., ), the source calculates the signature by the following algorithm:

Stepl: Calculate the hash value of each module = (b _a ,...,b _ir ), ie σ _; =11 ^ mod p , i = l,...,k.

Step2: Calculate the signature of the above hash value = Sign(SK, (id _f

Step3: Generate a group signature, ie σ = (σ ₁ .

When a node has just joined the system, download the signature of the file from the server for the given id _f , id _g , σ in the group And B

The node verifies the validity of B by the following algorithm.

Ste'pl: Verifies the validity of the signature σ based on the known information ΟΡΚ^ , ^ , ,..., ) ). If the signature σ is illegal, the algorithm ends and the new node needs to re-download the signature C7 of the file.

Step2: Is the homomorphic hash of i - B = (B _l , ... , B _r )? P σ = Π;· _=1<? mod p.

Step3: Calculate the hash of B, ie σ ' = Π^ ^; mod p.

Step4: Determine if σ is equal to σ '. If σ = σ ', the new node accepts Β; otherwise it will throw away the destroyed ^

Since the homomorphic hash function has attributes: The hash value of the linear combination of the input modules is equal to a combination of the hash values of the modules, so the signature scheme is correct.

However, it has two other problems: First, the homomorphic signature scheme is computationally expensive; second, it requires a public key infrastructure (PKI) to manage the signature authentication key. These two problems make the scheme unusable in practical systems: Because this scheme requires large computational complexity, it cannot be used in sensor networks; it cannot be used in large-scale distributed systems because of the need.

In summary, the existing method for detecting and recovering pollution attacks in distributed storage is complicated, computationally expensive, and cannot guarantee the reliability of the distributed network storage system.

Summary of the invention

The technical problem to be solved by the present invention is that the above method of the prior art is complicated, the calculation cost is large, and the reliability of the distributed network storage system cannot be guaranteed. The method provides a simple method and a small computational overhead. A method and apparatus for recovering pollution of distributed storage data that ensures the reliability of a distributed network storage system.

The technical solution adopted by the present invention to solve the technical problem is: constructing a pollution recovery method for distributed storage data, comprising the following steps:

A) downloading its stored data by any k of the n storage nodes or downloading the stored data by any k+1 of the n storage nodes respectively;

B) calculating, according to Yj=XGj, the downloaded raw data X, and selecting the k+1th storage node from the n storage nodes and downloading the stored data; j=l, 2, ...n, k is less than n;

C) comparing the data module Y _k+1 downloaded in the k+1th storage node with the product of the original data X and the matrix G _k+1 obtained in the above step, and if the same, exiting the current loop to obtain data Otherwise, perform step D); D) downloading data of at least one storage node again, at least one of the re-downloaded data modules being different from the previously downloaded k+1 data, replacing at least one of the previously downloaded data modules with the re-downloaded data module, And return to step B).

In the pollution recovery of the distributed storage data according to the present invention, the step D) further includes: D1) selecting k+1 storage nodes, at least one of the selected storage nodes is different from the previous selection and downloading. Over k+1 storage nodes of data;

D2) Download the data module of the selected k+1 storage nodes, and replace the last downloaded data module with the k+1 data modules downloaded again.

In the pollution recovery of the distributed storage data according to the present invention, the step E) further includes: D31) setting to download the selected number of storage nodes τ again, and if so, adding τ to 1; wherein, τ Less than k;

D41) selecting τ or τ + l storage nodes, at least one of the selected storage nodes being different from k storage nodes that have previously selected and downloaded data for obtaining X values;

D51) Download the selected τ or τ +1 data nodes in the storage node, and replace the τ or τ + in the data module used to calculate the X value with the τ or τ +1 data modules downloaded again. 1 data module.

In the pollution recovery of the distributed storage data according to the present invention, the step D) further includes the following steps: D61) selecting a storage node different from the k+1th storage node previously used for comparison to download the data, instead of The k+1th data module for comparison.

In the pollution recovery of the distributed storage data according to the present invention, the step E) further includes:

D32) setting to download the selected number of storage nodes τ again, where τ is less than k;

D42) selecting τ storage nodes, at least one of the selected storage nodes being different from k storage nodes that have previously selected and downloaded data for obtaining an X value;

D52) Download the data modules of the selected τ storage nodes, and replace any τ data modules in the data module used to calculate the X value by using the τ data modules downloaded again.

In the pollution recovery of the distributed storage data according to the present invention, the step Ε) further includes the following steps: D62) selecting a storage node different from the k+1th storage node previously used for comparison to download the data, instead of The k+1th data module for comparison.

In the pollution recovery of the distributed storage data according to the present invention, the following steps are further included:

AO) divides the original data X into k parts, obtains, and performs one (n, k) on the above After the MDS code is encoded, n linear data-independent data modules Yj are obtained, and the n data modules Yj and the generation matrix are respectively stored on n storage nodes.

The invention also relates to an apparatus for implementing the above method, comprising:

a data download unit: configured to download, respectively, data stored by any of the n storage nodes or download the stored data by any k+1 of the n storage nodes;

a data obtaining unit: configured to perform operation on the downloaded k data according to Yj=XGj to obtain downloaded original data X, and select a k+1th storage node from the n storage nodes and download the stored data thereof Data; or the operation of the downloaded k data according to η=Χ to obtain the downloaded original data X; the data comparison unit: used to download the data module Y _{k+1 in} the k+1th storage node , comparing with the product of the original data X and the matrix G _k+1 obtained in the above steps;

a data recovery unit: configured to download data of at least one storage node again according to an output of the data comparison unit, where at least one of the re-downloaded data modules is different from the previously downloaded k+1 data, to download the data again The data module replaces at least one of the previously downloaded data modules.

In the device of the present invention, the data recovery unit further includes:

a first selection module: configured to select k+1 storage nodes, wherein at least one of the selected storage nodes is different from k+1 storage nodes that have previously selected and downloaded data;

The first data downloading and replacing module is configured to download the data module of the selected k+1 storage nodes, and replace the last downloaded data module with the k+1 data modules downloaded again.

The first subset setting module is configured to set the number of storage nodes selected to be downloaded again τ, and if set, add τ to 1; where τ is a subset of k;

a second selection module: configured to select τ or τ + 1 storage nodes, wherein at least one of the selected storage nodes is different from k storage nodes that have previously selected and downloaded data for obtaining an X value;

The second data downloading and replacing module is configured to download the data module in the selected τ or τ +1 storage nodes, and replace the data used to calculate the obtained X value by using the downloaded τ or τ +1 data modules again. Any τ or τ +1 data modules in the module;

The second comparison data downloading and replacing module is configured to select a storage module different from the k+1th storage module previously used for comparison to download the data, instead of the k+1th data module for comparison.

Second subset setting module: used to set the number of storage nodes selected to be downloaded again τ, where τ Less than k;

a third selection module: configured to select τ storage nodes, wherein at least one of the selected storage nodes is different from k storage nodes that have previously selected and downloaded data for obtaining an X value;

The third data downloading and replacing module is configured to download the data module in the selected τ storage nodes, and replace the τ data modules in the data module used for calculating the obtained X value by using the τ data modules downloaded again. ;

The third comparison data downloading and replacing module is configured to select a storage module different from the k+1th storage module previously used for comparison to download the data, instead of the k+1th data module for comparison.

The method and device for recovering pollution of distributed storage data embodying the present invention have the following beneficial effects: Since the homomorphic signature and the public key in the prior art are not required, the acceptable data download and calculation can be ensured correctly. The original data; therefore, the method is relatively simple, the calculation overhead is small, and the reliability of the distributed network storage system can be guaranteed.

DRAWINGS

1 is a flow chart of a method in a first embodiment of a method and apparatus for recovering pollution of distributed storage data according to the present invention;

Figure 2 is a flow chart showing the steps of eliminating pollution data in the first embodiment;

Figure 3 is a schematic structural view of the device in the first embodiment;

4 is a flowchart of a method for eliminating pollution data in a second embodiment of a method and apparatus for recovering pollution of distributed storage data according to the present invention;

Figure 5 is a pseudo code showing the data acquisition and recovery process in the second embodiment;

Figure 6 is a schematic structural view of the device in the second embodiment;

7 is a flowchart of a method for eliminating pollution data in a third embodiment of a method and apparatus for recovering pollution of distributed storage data according to the present invention;

Figure 8 is a pseudo code showing the data acquisition and recovery process in the third embodiment;

Fig. 9 is a schematic structural view of the apparatus in the third embodiment.

detailed description

The embodiments of the present invention will be further described below in conjunction with the accompanying drawings.

As shown in FIG. 1, in the first embodiment of the pollution recovery method and apparatus for distributed storage data of the present invention, the data is downloaded, the contaminated (or maliciously modified) part of the downloaded data is removed, and the correct original data is obtained. The process includes: Step S11 divides the original data into k shares, and stores them in n storage nodes after encoding: In this step, the data needs to be distributed and stored in the storage node of the network according to the provisions of distributed (network) storage; Distributed (network) storage is to distribute data on multiple independent devices. Traditional network storage systems use centralized storage servers to store all data. Storage servers become the bottleneck of system performance, and are also reliable and secure. The focus can not meet the needs of large-scale storage applications; distributed network storage systems use a scalable system structure, use multiple storage nodes to share storage load, and use index server to locate storage information, which not only improves system reliability and availability. And access efficiency, but also easy to expand. Generally, a distributed storage system includes k original nodes, n storage nodes, and at least one receiving node; in fact, a collection of source nodes, storage nodes, and receiving nodes may overlap; source nodes and storage nodes The energy of the device is relatively low, and the energy of the device at the receiving node is relatively high.

In this step, for a file M, it is divided into equal-sized k shares, and different original data modules X, · = 1, 2...) are stored by k different source nodes, and MDS code is encoded by Generating n different encoding modules = 1, 2 - «), the n encoding modules are linearly independent and any of the k shares can reconstruct the file M; where k is less than n.

Due to the characteristics of the MDS code, we can get ^ = 3⁄4^. , where X = is the row vector of all the original data modules, = ( , g ₂ ^ f is a column vector, where the non-zero elements are all in random linear coding Therefore, ^ e GF^) , where ϊ· = 1, 2 · · · /^· = 1,2 · · · η. Thus, the data Z . = (Gj , ^ ) stored by each storage node j can be expressed as = XGj. The entire system can be represented by Y=XG, where F = , ...! ) is the row vector of all coding modules, G = , G ₂ · · · G„ ) is a k*n generator matrix, containing coefficient vectors in each column of G.

For the theoretical basis for recovering raw data from k data, see the literature [A. GDimakis, V. Prabhakaran, and K. Ramchandran, "Ubiquitous Access to Distributed Data in Large-Scale Sensor Networks through Decentralized Erasure Codes," Proc. The theorems 1 and 2 in Int'l Symp. Information Processing in Sensor Networks (IPSN '05), 2005.] are not repeated here.

It is worth mentioning that this step performs the storage of data, which is the basis of the first embodiment. However, this step does not necessarily exist in terms of data pollution removal methods. It is merely explained for the basis of use of the method described in the first embodiment and a complete understanding of the technical solution.

Step S12: arbitrarily select k+1 of the n storage nodes, and download the data: In this step, arbitrarily select k+1 among the n storage nodes, and download the data encoding module therein; Step S13 calculates any k data in the downloaded data to obtain the original data X: In this step, k are obtained from the downloaded k+1 data modules, and are calculated according to Yj=XGj to obtain the original data X.

Step S14 multiplies the obtained original data X by the matrix of the k+1th storage node: In this step, the matrix G _k+1 obtained by the k+1th storage node is compared with the original data X obtained in the above step. Multiply, get a data module, which is the data module that the k+1th storage node should store when X is correct. Here, it is assumed that the original data X obtained in the above steps is correct, that is, it is assumed that the data modules on the k storage nodes in the above steps are not contaminated (or falsified).

The data module calculated in step S15 is the same as the downloaded k+1th data module. If yes, go to step S16; otherwise, go to step S17. In this step, if the k storage nodes used to recover the original data in the above step are not contaminated, and the k+1th storage node is also not contaminated, the downloaded k+1th data module should be equal to step S14. The calculated data module. If the two are equal, it is determined that the original data X obtained in step S13 is correct; otherwise, at least one of the above k+1 data modules is contaminated or tombed, and the original data obtained in the above step S13 is incorrect or used for The compared k+1th storage node is contaminated.

Step S16 The download data is not contaminated, and exits: In this step, since the original data that has been judged is correct, no further processing is required. So, exit this cycle to get the data.

Step S17: Download at least one data module different from the previously downloaded data module, and replace one or more of the downloaded k+1 data modules with the data module. In this step, at least one of the above k+1 data modules is included. A contaminated or tombed, the original data obtained in the above step S13 is incorrect or the k+1th storage node used for comparison is contaminated. To this end, the new, undownloaded data module needs to be downloaded again and replaced. The data module, and returning to step S13, recalculate using the newly downloaded data to obtain the correct original data X or the correct k+1th data module for verification. In this step, at least one data module is downloaded, and at least one newly downloaded data module is not previously downloaded. This step will be described in detail later.

In the above steps of the first embodiment, it is used to download k+1 data modules at a time, and k of them are used to restore original data, and one is used to verify whether the restored original data is correct. In some cases, k data modules may also be downloaded first and used to recover the original data; after the original data is restored, a data module (from the k above) from different storage nodes is downloaded for verifying the obtained raw data. is it right or not. The effect of these two methods is the same. Of course, for the latter, the relevant steps in the first embodiment also require minor adjustments to suit the way they are downloaded.

In the first embodiment, as shown in FIG. 2, step S17 further includes the following steps: Step S21: Select k+1 storage nodes, at least one of which is different from the storage node that was last downloaded: In this step, k+1 storage nodes are first reselected, since the last downloaded storage node is known, In this step, it is easy to ensure that the selected k+1 storage nodes are not identical to the last downloaded storage node, as long as one or more of the list of storage nodes that were last downloaded or downloaded have been removed. Add the same number of undownloaded storage nodes.

Step S22 downloads the data on the selected k+1 storage nodes, and replaces the downloaded data with the last downloaded data: In this step, download the data module on the storage node selected in the previous step, and this time The downloaded data module is used to replace the data module of the last operation or verification. After performing this step, the process jumps to step S13 for execution.

It should be noted that, in the case where the original data is obtained by downloading k data at a time, and then downloading one data for verifying the obtained original data, in the above steps S21 and S22, k pieces of data that have not been downloaded are also downloaded first. The operation obtains the original data X, and then downloads a previously undownloaded data for verifying the original data X obtained.

In the first embodiment, each data module itself can be represented as a column vector (·3⁄4, 1⁄2 ····^) containing m symbols, where for all and / = l, 2~m, there are 3⁄4 e GF(). Thus, each coding module γ" can also be represented as a column vector containing m symbols in the G^) domain. Linear combination ^ = XGj is calculated by symbol to symbol, which means that for all · = 1, 2 · · · π and Z = l, 2...m,

3⁄4 =∑ i3⁄4. Therefore, we can think of X and Y as matrices of size m*k and m*n, respectively.

In the first embodiment, the decoding effect of the data pollution attack on the stored data can be analyzed. It is assumed that the attacker can access the t storage nodes and can observe and modify the equations (data) stored by the node, that is, if the attacker can By accessing storage node j, you can change the storage stored by node j! ^ and . Let CT = G + AG and Υ ' = + Δ be the generator matrix and coding module vector that are falsified after the attack, respectively, where the tampering is mainly in the matrix and the vector. It is assumed that the attack can change the communication link of t storage nodes, which obviously gives more possibilities for the attack, but does not expand the possible impact of the attack. For the sake of simplicity, the nodes whose stored data has been tampered are defined as compromise nodes, and the compromise nodes and the general storage nodes cannot be distinguished. According to the actual attack situation, it is assumed that the source node cannot be attacked, but only the output result of the storage system is changed. This is because the storage node is exposed to the attack for a long time, and the source node can only be attacked within a limited period of time during which the data is generated, and the probability of being attacked is greatly reduced.

In order not to lose generality, it is assumed that the attacker randomly tampers with t storage nodes, and the receiving node randomly selects k storage nodes to download k linear equations. The set of equations downloaded by the receiving node is Z .. _k = (G:.. _k ,} :. _k ), where Gi = (G; , G; , - GD , Y* = (γ; , - γ; -). The receiving node obtains the result ^ ^( ^) - ^{1 in} order to obtain X to solve SLEsJ = XG; _K .

First assume that the attacker only tampers with the encoding module, which means <T=G. under these circumstances,

X* =

The attack's changes to the original data can be written as follows:

ΔΧ = f X = } ((^ _k )- LX = (} _k )- LX = _k ) , where we used

Χ = .. ^ )- ¹ . This means that if a given row of Δί^ contains only zero elements, the row corresponding to ΔΧ contains only zero elements; and non-zero elements in a given row of △} _k will affect The entire line in ΔΧ. Therefore, tampering on a given row of any of the k coding modules affects the decoding of the entire data module, but this effect will be limited to the corresponding row.

Again assume that the attacker only changed the coefficient vector, which means y* = y. under these circumstances,

^ = .. _k (Gi _k r^ If at least one of the k coefficients is changed by the opponent, ie (GU ^{1 is} completely different from (^^). Therefore, this change will affect the decoding of the data blocks of all rows.

Finally, suppose the attacker changes the coefficient vector and the encoding module at the same time, and the effect will be combined. In general, changes to the data module caused by an attacker can be derived by the following formula:

Thus

XGu^j. It can be observed from the above formula that if Ai ... _k is controlled by the attacker, it means that all downloaded equations come from the compromise node, and the ΔΖ value can be selected by the attacker; it can reconstruct X from the content stored by the node. It is also possible to form an arbitrary X* = Χ + ΔΧ by forging the content stored in the i-th compromise node as i = X'G. Therefore, in the case of t≥, not only the original data module vector but also a specific value can be forged. In fact, a small change in the encoded information stored by the node will result in a large amount of change in the decoded data. In the worst case, all data modules are destroyed.

In the first embodiment, the basic idea of the method is as follows: In most cases, the attacker cannot forge a particular solution fz GU ¹ because it is impossible to tamper with all the k equations that were initially downloaded. Except that the k equations have not been tampered with, X*=X, in other cases T can be considered as a random vector. If Χ'≠Χ , then X* takes a random value from a set of at least q. Suppose there is another equation that has not been tampered with: ₊₁ = 3⁄43⁄4 ₊₁ (eg: Receive node download

). If X* is random or chosen by the attacker, the probability that it does not satisfy another equation that has not been modified by tomb is very large, and if = x then the probability will be

1 satisfies the equation. Therefore, we can judge whether the data module vector f is contaminated by an additional equation that has not been tampered with. Therefore, in the first embodiment, the receiving node first downloads k equations z. _A and calculates f

The receiving node then downloads another z: ₊₁ . If 1^ = . _± , then no attack is detected (the sink will be treated as the correct solution); otherwise, if ≠ G.. _k , an attacked signal will be sent. The communication complexity required for this algorithm is k+1 and the computational complexity is 1. According to the prior knowledge that any error correction code has a Hamming distance of at least 2, it is concluded that any attack detection algorithm in the described system must download at least k+1 equations. Therefore, the proposed attack detection algorithm is optimal in terms of communication complexity.

For the result of the above attack detection, a false negative decision (error not detected) may occur, mainly in two cases: f is a random value; or a specific f value forged by the attacker. Assuming that at least one of the downloaded k equations is correct, or the attacker's tampering with the content stored by the node is irrelevant, then T can be considered a random value. Suppose the attacker does not have a matrix of tomb modifier coefficients, CT = G. From the above analysis, the solution of the equation obtained by the sink is T = X + A1 ^ G = Χ + ΔΧ. Further assume that the additional equation for error detection has not been tampered with Z: ₊₁ = Z _k+1 = (G _k+1 , Y _k+1 ), in which case the false negative decision

P _Jneg =Vr{Y _k+1 = X'G _k+1 \ Y _ik ≠0}

Ratio P _{is: = Pr {+1 = (X} + A K¾ +1 IAi ≠ 0}, the last step by _₊₁ = _{¾¾ +1.}

= Pr{AXG _M =0\AY _k ≠0}, (1) If 杲_A has a non-zero element on the ith line and G _{1 A} is correct, ΔΖ also has some non-zero elements on the ith line. Otherwise, if the ith row of Δί _λ has only zero elements, then the ith row of Δ has only zero elements. We can Δ¾¾ ₊₁ i-th row element written A¾ _+1), (2), known by the above analytical formula (2) is an important element of the linear combinations. However, the elements are randomly selected, and since the downloaded equation is random and the attacker does not know in advance, the probability that the formula (2) is 0 is equal to l/q. If the ΔΧ elements are uncorrelated, then /^ =^,(3), where ί' is the number of rows in the Δί _λ matrix that contain non-zero elements.

When the tampering done by the attacker is related, Ρ ≤ 1 is still true. It can be seen from this point that 随机 randomly takes values from a set of at least q in size. Obviously, in order to maximize the probability of error detection (thus minimizing the probability of success), an attacker must limit the tampering of the encoding module to the same line or linearly correlate the tamper-corrected line.

If the attacker does not change the coefficient matrix tomb i.e. CT = G, but we assume that the equation for the additional detection of tampering is meant _{_{+ i = (<¾ +1,}} ) = +1, ί +1 + Δ; _{λ λ+1} ). In this case, the following result can be derived from a result similar to the previous example of a single cartridge: P _fneg =Pr{AXG _i , =AY _k+1 \AY _{1 k} ≠0}, (4). By The analysis of the impact of the attack on the previous one is known: If the ith row of Δί _k contains only zero elements, the ith row of ΔΖ also contains only zero elements, in which case the ith row of ΔΧ ₊₁ must also Contains only zero elements. Therefore, if the ith element of Δί ₊₁ is non-zero, the above error probability is 0 (eg, even if the additional equation used for detection is not correct, the attack can be detected); on the other hand, if Δΐ _{+1 is} in each Each line contains zero elements, of which _3⁄4 only contains zero elements, then due to the randomness of ₊₁ , it can be concluded that: P _fg ≤ l / q.

If the attacker tampers with the coefficient vector and the encoding module at the same time, then ≠Ο, Δ; Κ _≠ Ο. This situation must be handled with care because Τ = _T (G* _T Γ ¹ value is not completely random. For example, if _ί = XG _t then

^/ l AG _ik ≠ ^AY _t ≠0 , both have ΔΖ =0. Obviously this kind of tomb change cannot be seen as an attack because the falsified equation is not contaminated. This example is to point out that even if the coefficient vector and all elements of the coding module are tombed, it may be equivalent to a correct equation with only one tomb element. So if we consider the most likely correlation between the tamper elements, we return to the previous situation, ie P _fneg =

≠ 0} ≤ 1/ g, (5) also satisfies this situation. In the case of a random attack by an adversary, all values of X* may be random and uncorrelated, so this situation is reduced to \/q ^m .

So, when not all downloaded equations belong to the same ≠X value, the maximum value of an error undetected is ^^ =1/^. So if the q value is large enough, then the probability of not detecting a contaminated attack will be negligible. Of course, the larger the q value, the greater the traffic and storage overhead. Note that if the encoding module contains standard error detection elements, such as for a CRC checksum, the attacker must tamper with at least two lines in each attack coding module and keep the tomb change uncorrelated, in this case /^ ≤1 ² . This conclusion makes it possible to select smaller domains and thus speed up the calculations on the finite field. If fact, g = 2 ^2Q, then an error probability of ^2-40.

Now, consider that when all downloaded k+1 equations are contaminated, f is forged by the attacker, in which case the method in the first embodiment does not detect the attack because the additional detection equation is also satisfied. The probability that such an error is not detected is Δ = " ¹ } / { ⁿ ) ^ (t / n) ^{k + l} (For an actual system, the value of Δ depends mainly on the number of nodes hit by k + lk + 1) when t is relative small for n, and k is sufficiently large when the value will be very small (e.g., when n = 100, k = 10, t = 20, Δ «10- 9). Therefore, we can not evaluate an error The upper limit of the probability of detection: Ρ <1/ + Δ. In most cases, it can be assumed that η is much larger than t, so the value of Δ is close to 0; however, if we consider a strong attack and a large value of t, Δ will Can not be ignored. In addition, the attack detection will also have a false positive decision (correctly determined error), such as the k downloaded at the beginning of the node. The equation is correct, which means ζ: _Λ = z _L , so the receiving node calculates the correct solution = Yi _k (G:j γ ^ι == γ _{Ι λ} y ^l = x ' and the additional equation for attack detection It is not correct' so the probability of a false positive decision can be defined as: P _fp . _s = Ρτ{ΔΖ^ _+ι ≠ 0 1 = 0} , (6). Download k of n-k-t at the beginning The equation is correct, the probability that the +1 equation is also correct is (7)

N - k where t is the number of storage nodes attacked by the attacker, and / is obtained from equation (7). _s = 1-^^ = 1, (8). Although n-kn-k

P _fpos can not be ignored, but ^ Japanese nature has no serious impact on the system. In summary, the method in the first embodiment gives any set of size k+1 regardless of whether the set contains a contaminated equation. If we find an attack we only know that there are some contaminated equations in this set, but we don't know how many and which equations are polluted. Therefore, the recovery algorithm can be seen as z = the correct set, and you need to find a positive jt from these correct sets.

The exact set to restore the original data.

From the above analysis, it can be seen that the file cannot be successfully recovered when the correct number of equations is less than k+1, because at least k correct equations are needed to obtain the correct solution of the data module vector, and an additional correct equation is needed to verify the equations. Solution. Since the q value is large enough and Δ is often negligible, the results obtained by the above method are always correct.

In a first embodiment, the invention also relates to an apparatus for implementing the above-described pollution recovery method for distributed storage data. As shown in FIG. 3, the apparatus includes a data distribution and storage unit 31, a data download unit 32, a data obtaining unit 33, a data comparing unit 34, and a data restoring unit 35. The data downloading unit 32 is configured to respectively download the stored data by any k of the n storage nodes or download the stored data by any k+1 of the n storage nodes respectively; the data obtaining unit 33 is In one case (when k data modules are downloaded first), the k data to be downloaded is calculated according to YfXGj to obtain the downloaded original data X, and then the k+1th storage node is selected among the n storage nodes. And downloading the stored data; in another case (that is, when downloading k+1 data modules at a time), the k data to be downloaded is operated according to YfXGj to obtain the downloaded original data X; the data comparison unit 34 is used to compare the data module Y _k+1 downloaded in the k+1th storage node with the product of the original data X and the matrix G _k+1 obtained in the above step; if they are the same, It is determined that the downloaded k+1 data modules are not contaminated; otherwise, it is determined that at least one data module is contaminated and needs to be cleared; the data recovery unit 35 is configured to download again according to the output (or judgment) of the above data comparison unit. At least one of the data of the storage node, at least one of the re-downloaded data modules is different from the last downloaded k+1 data, and at least one of the previously downloaded data modules is replaced with the re-downloaded data module. The data distribution and storage unit 31 is configured to divide the original data X into k shares, and obtain i=l, 2, . . . k, and obtain the linearly independent n data modules Yj by encoding the MDS code described above. Where j=l, 2, . . . n, the n data modules Υ” and the generation matrix respectively (3⁄4 are stored on n storage nodes; wherein k is less than n. The data allocation and storage unit 31 is not actually Essential, for ease of understanding, list them here.

In this embodiment, the data recovery unit 35 further includes: a first selection module 351 and a first data download and replace module 352; wherein, the first selection module 351 is configured to select k+1 storage nodes, where the selected storage is At least one of the nodes is different from the k+1 storage nodes that have previously selected and downloaded data; the first data download and replace module 352 is configured to download the data module of the selected k+1 storage nodes, using the downloaded again k+1 data modules replace the last downloaded data module.

In the second embodiment of the present invention, most of the steps in the method and the units in the apparatus are substantially the same as in comparison with the first embodiment, except that the data that has been judged to be contaminated is different. The method of cleaning is not the same, which in turn causes the structure of the data recovery unit in the device to be slightly different. As shown in FIG. 4, in the second embodiment, the step of downloading at least one data module different from the previously downloaded data module and replacing one or more of the downloaded k+1 data modules with the specific one includes: :

Step S41 sets the number of storage nodes to be downloaded again, and if it is set, it is incremented by 1: In this step, the number of storage nodes τ that is downloaded again is set, and τ is less than k; if the original data X that has been determined multiple times cannot pass Verification, that is, unequal at the time of comparison, the number of storage nodes τ has been set after the first operation and comparison, in which case the number of storage nodes is incremented by one.

Step S42 selects the storage node to be downloaded again according to the number of storage nodes: In this step, select τ (or the number of one or more times plus 1) storage nodes, and select the storage node to be any of the above n storage nodes. But, it is necessary to ensure that at least one storage node that is selected is different from k storage nodes that have previously selected and downloaded data for obtaining the X value; for example, selecting a storage node whose data source is completely different from the previous operation; When selected, the data modules on these selected storage nodes are downloaded separately.

Step S43 replaces the data for the last calculation with the data downloaded again: The data module that is downloaded again is used as a subset, and the same number of data modules are replaced, and the data used for this operation is obtained. Step S44 selects a storage node different from the previous comparison to download the data module for comparison: In this step, select a storage node different from the k+1th storage node previously used for comparison to download the data, and use the same Replace the k+1th data module for comparison. After performing this step, jump to the operation to obtain the original data and compare and verify that it is correct; thus, when the correct original data is not found, the above process will be repeated until the contaminated data is cleared to obtain the correct original data.

Fig. 5 shows the process of realizing data acquisition and recovery in the second embodiment by using pseudo code. In Figure 5, the first line downloads z ₊₁ first; the second line uses * ₊₁ as the test equation to perform the attack detection algorithm on z. If no attack description is detected. _Α is cleared, the algorithm ends; Otherwise, the iteration process of S=z is cleared starting on lines 5-24 in Figure 5. In each iteration of lines 7-24 in Figure 5, the new line is downloaded in line 8, the new downloaded equation is defined as e, and the 10th line is used as the test equation for the current iteration. Perform an attack detection algorithm. The remaining equations downloaded so far that do not contain the set S constitute a cleanup set (: take any possible subset C of C, such that IC'r is less than k, and use C' in all possible ways The equation in the middle replaces the r equations in the set S. After each replacement, the attack detection algorithm is executed in the set using e as the detection equation. If there is no attack, it is cleared and the algorithm execution ends.

Corresponding to the above steps, the structure of the data recovery unit of the apparatus in the second embodiment is slightly different from that in the first embodiment. As shown in Figure 6, the data recovery unit in the second embodiment includes: a first subset setting module 51, a second selection module 52, a second data download and replace module 53 and a second comparison data download and replace module 54; The first subset setting module 51 is configured to set to download the selected storage node number τ again; if the value has been set in the previous cycle, add τ to 1; where τ is less than k; the second selection module 52 is used to select τ or τ +l (or the number of times after adding 1) storage nodes; here, at least one of the selected storage nodes is different from the data that has been previously selected and downloaded for obtaining the X value. k storage nodes; the second data downloading and replacing module 53 is configured to download the selected data module in the storage node whose number is obtained by the first subset setting module 51, and use the downloaded τ or τ +1 Data modules as a subset, instead of any τ or τ +1 data modules in the data module used to calculate the X value; the second comparison data download and replace module 54 is used for selection A storage node different from the k+1th storage node previously used for comparison downloads its data, replacing the k+1th data module for comparison.

7, FIG. 8, and FIG. 9 respectively show a flow chart of the process of clearing the pollution data in the third embodiment of the present invention, a pseudo code, and a structure of the data recovery unit, as shown in FIG. 7, the third embodiment and the The case of the second embodiment is substantially the same, and in the steps S61-S64, the difference from the second embodiment is that in step S61, the τ does not change regardless of how many data replacements are performed. Fig. 8 shows the process of realizing data acquisition and recovery in the third embodiment by using pseudo code. In Fig. 8, the same steps as in the second embodiment are first performed on the 1-4th lines, z ₊₁ equations are downloaded and attack detection is performed. End this data acquisition if no attack is detected; otherwise, start a cleanup process on lines 5-26. Wherein, the size w of the clear set C defined in the fifth row is a fixed value "", where α is an input parameter. In the sixth line, the equation ₊₂ ... ^ is downloaded, and the set is initialized with ζ.. _3⁄4 S, used to initialize the cleanup set (:. These two sets change in each iteration, and use variables and _c to indicate their respective first equations in the current iteration. Also denotes the test equation used for attack detection. Line 7 in Figure 8 initializes S, K and begins the iterative process. In each iteration, download a new equation 第 on line 9, and use it as the test equation for its 12th line attack detection. Lines 13-15 take all possible subsets of C and make IC' l= r not exceed _max , and replace the r equations in S with the equations in C ' in all possible ways in lines 16-18. Line 19 uses e as the test equation to perform the attack detection algorithm on the replaced set. The 20th line indicates that if no attack is detected, S ' is cleared and the algorithm runs. Otherwise, line 25 will increase our Scalar values and continue the next iteration. Note the collection From 7: a size yt + w l + scrollable window equation composition, in the third embodiment, the end conditions are: either successfully cleared, or all the equations have been downloaded.

Referring to FIG. 9, for the apparatus in the third embodiment, the structure of the data recovery unit includes: a second subset setting module 71, a third selection module 72, a third data downloading and replacing module 73, and a third comparison. The data downloading and replacing module 74; wherein, the second subset setting module 71 is configured to set to download the selected number of storage nodes τ again, where τ is less than k; the third selecting module 72 is configured to select τ storage nodes, the selected At least one of the storage nodes is different from k storage nodes that have previously selected and downloaded data for obtaining the X value; the third data download and replace module 73 is configured to download the data module of the selected τ storage nodes, using again The downloaded τ data modules replace any τ data modules in the data module used to calculate the X value; the third comparison data download and replace module 74 is used to select a k+1th storage that was previously used for comparison. Modules with different modules download their data, replacing the k+1th data module for comparison.

In the second embodiment and the third embodiment, for the replacement of data, mainly using the clearing concept, the definition is used to clear the downloaded set of equations to C, let e be an additional equation, using the equation in C Instead of S (ie, the data set used to calculate the raw data), a subset of size ICI, and define a new set of equations as ^; Then, we perform attack detection on the set ^, using the equation e as the detection. In other words, the solution to SLEs is seen as a union to determine if the solution satisfies equation e. If no attack is detected, the solution obtained is treated as the correct data encoding vector, otherwise we use S again, replacing the other one in S with the equation in C. The size is a subset of ICI; then the attack detection algorithm is executed. Repeat these steps until either the cleanup is successful, or all subsets of S in size ICI are replaced.

If e is correct, C only contains the correct equation, and the number of equations attacked in S does not exceed ICI, then the above method will eventually succeed, because eventually all the attacked (or polluted) in S will be replaced by the correct equation in C. Equation (data module). In the case of failure, either e is tampered with, or C contains a contaminated equation, or the number of equations attacked in S is greater than ICL. In this case, we can download another set of equations C 'and make IC' WCI, also download another test equation e ', and clear S again.

The basic idea of the second embodiment is to start the clearing with a clearing set C (e.g., initially assume that there is only one attacking equation in the set S), and then repeatedly increase the size of the set C if the clearing fails. In this way, we will get a clear set C sooner or later, and the correct number of equations in C is the same as the number of equations in the set S. In each iteration, select a subset of all possible equations in C. Therefore, the correct equation in set C will eventually be used instead of the attacked equation in set S, and a final set of cleanes c will be obtained.

In the second embodiment, first downloading ₊₁ as a test equation to determine whether the data is contaminated on Z; if not detected, indicating that the skin is cleared, ending the loop to obtain data; otherwise, starting to clear

Iterative process. In each iteration, to download a new equation, this new downloaded equation is defined as e; it is used as a test equation for performing the attack detection algorithm in the current iteration. At this point, the downloaded remaining equations in the set S do not constitute a clear set (take any possible subset C in C, so that IC' r is not greater than k, and use C ' in all possible ways The equation replaces the r equations in set S. After each replacement, the attack detection algorithm is executed in the set using e as the detection equation. If there is no attack, ^ is cleared.

In the second embodiment, a new equation is downloaded in each iteration, and thus its communication complexity depends on the number of iterations performed by the algorithm.

In terms of computational complexity, since only a few simple operations are required in the second embodiment, the computational complexity will be greatly reduced as long as the number of equations attacked in the system is acceptable.

According to the above analysis, the embodiments described in the present invention have a better application in practice than the scheme based on the homomorphic digital signature. First, there is no need for a PKI, and no key management scheme is needed because no cryptographic techniques are used. Second, although computational overhead is also required, it is only for entities in the distributed storage system that need to retrieve the original information. In wireless sensor networks, although constrained by computational complexity, this entity is often a base station and has sufficient capacity to compute. However, based on the homomorphic digital signature scheme, both the source and the storage node require a large amount of computation, and these nodes are usually resource-limited sensing nodes.

As an improvement or compromise of the second embodiment, the third embodiment changes the set of fixed size in each iteration S and C instead of increasing the size of C. Sets S and C consist of a fixed window size from the z* equation, resulting in a probability of success for this method not equal to 1 in any case. If the number of attack equations contained in S does not exceed a _x then the recovery will be successful, where ^ is an input parameter that limits the computational complexity of this method, which is limited by limiting the size of the subset of equations from sets C and S.

In the third embodiment, the same steps as in the second embodiment are first performed, an equation is downloaded, and attack detection is performed. If there is no attack, the data acquisition is ended; otherwise, a clearing process is started; the size of the cleanup set C is defined as a fixed value "", where "is an input parameter; the download equation ζ: ₊₂ ^, and initialized with .^ Set S, used to initialize the cleanup set (:. These two sets are changed in each iteration, and the variables ^ and c are used to indicate their respective first equations in the current iteration. The same 4 represents the test for attack detection. The equation will be initialized and the iterative process begins. In each iterative process, a new equation Z is downloaded; and it is used as the test equation for attack detection in line 12 of Figure 8. All possible subsets in C are taken and IC' l= r does not exceed r _max , replace the r equations in S with the equations in C ' in all possible ways; use e as the test equation and perform the attack detection algorithm on the replaced set ^ '; if not detected The attack indicates that it is cleared; otherwise, it will increase our scalar value and continue the next iteration. Note that the set consists of equations from a scroll window of size t + w + 1, the algorithm Conditions beam is: either remove succeed or all equations downloaded.

The success probability, average traffic, and computational complexity of the above several embodiments are briefly described below. In the second implementation, the probability of success is a function of the number of equations t being attacked whose success probability exceeds 90% within a threshold t, after which it begins to fall. For ^ = 4 = 5 = 6, the thresholds are approximately t = 85, t = 100 and t = 1010, respectively; if the r _max value is increased, the method in the second embodiment can ensure a stronger attack (set S) Contains more attack equations) to recover. However, its computational complexity will increase accordingly.

In the first embodiment, the probability of success will remain at 1 in the threshold = ?ι-, and the threshold ί = 899 in the case of "100 = 100, the threshold is much larger than the threshold in the second embodiment. Ignore this Point, the method in the second embodiment can eliminate the four contaminated equations in the set of size = 100, which means that 40 equations can be tolerated in the entire system of size w = 1000. However, even contaminated The probability that the number of equations reaches 85 is still high. The reason is that when t=85, the average number of equations attacked in a system with size = 100 is 8.5, which means that the number of equations in the system that are attacked is relatively small. It is obvious that by scrolling a size = 100 window across the entire set of equations Z*, it can be seen from the resulting set that the probability of the set of attacked equations not exceeding 4 is very large. For other r _max values Similar results can still be obtained. For the average communication complexity (ie, the number of equations downloaded) in the second embodiment, t=120 can be used as a demarcation point, so after the value is exceeded, the probability of success will not be ideal, so we do not care. Its average communication complexity after t>120. The average communication complexity increases with the number of equations t being attacked, because the larger t is, the harder it is to find the attacked equation, and the number of attacked equations contained in the set S of size k does not exceed A fixed value r _max , while at the same time containing at least r _max correct equations in the set C of size. However, since the average downloaded number of equations is less than half of the total number of equations n, communication complexity is always acceptable. Especially when the number of equations t of the attack is 50, that is, only 5% of the storage nodes are attacked, the communication complexity of the system is very small. We can also conclude that the communication complexity increases as the value of the r-span decreases. It can be seen from the later computational complexity analysis that the reduction in communication complexity is obtained at the expense of increased computational complexity.

The computational complexity in the second embodiment (i.e., the number of equations in the SLEs that need to be solved) is a function of the number of equations t being attacked. It can be considered that the computational complexity increases rapidly as t increases, and also increases as r _max increases. In fact, an increase in r _max of 1 results in a significant increase in computational complexity. The best compromise is that _rmax = 4, the method in the second embodiment can handle t = 30 attacked equations (ie 3% of the total number of equations) with a very low communication complexity. Simultaneous computational complexity is also acceptable (solution 10 ⁸ * 2 ²⁶ SLEs equation).

In addition, in the present invention, in some very special cases, for the false negative and false positive cases that occur in the previous attack detection result, the following method may also be employed; setting a set S of size k and an additional equation e An attack was not found in the existing attack detection. It can be assumed that either the solution is a correct data module X or the attacker forged. For this assumption, the decision can be made based on the number of remaining equations that the solution satisfies. First find S and e according to the existing methods in the above three embodiments, so that the object _r = k + T is the solution of the SLE formed by the set S; then select an equation e S UW, if it is satisfied, it will increase directly by 1 . Repeat the above steps until straight over w / 2 or all possible equations are executed. In one case Τ = Χ , that is, to find the correct solution, the current data acquisition ends; otherwise, the equations satisfying the solution in e and set S are ignored, and the method repeats the above process in the remaining set of equations.

So far, a solution to the pollution attack problem in distributed storage based on network coding has been described, and a clear scheme is proposed to detect and recover from the pollution attack. A notable feature of this approach is that it is not based on checksums or digital signatures, which are commonly used in cryptography to provide data integrity services, but rather to exploit the redundancy features inherent in distributed storage systems.

The coding modules required in the above embodiments are more than the coding modules required to obtain the original data, and these additional coding modules are used for attack detection and recovery. Attack detection and recovery only need to be limited The solution of a system of linear equations in a system on a domain. Since no cryptographic algorithm is used, the present invention does not need to rely on a PKI or a secure channel established in advance.

The above methods have significant advantages in both communication and computational load. For the above three embodiments: The first embodiment provides the lowest possible average computational complexity in the system; the second embodiment is optimal in terms of communication complexity and ensures a strong attack (most coding modules) Recovery in tampering) is still a more practical solution in most systems; the third embodiment is a compromise for a very large system, with no resilience (success probability) and communication load. The solution is effective, but the computational load for very large systems is acceptable.

The above embodiments can be applied to any distributed storage system based on network coding, a distributed domain of P2P files or a wireless sensor network. In addition, the above method does not require additional coding on the storage node or additional information is added to the coding module, only the receiving node needs to perform a certain amount of computation. For this reason, the present invention is particularly suitable for wireless sensor networks in which the storage node is a sensor node with limited capacity resources and the receiving node is a relatively strong base station.

It is worth mentioning that, in various embodiments of the present invention, different parts thereof mainly lie in the replacement method of the contaminated data, and in some cases, in an actual operation, may include the above different embodiments. The pollution data replacement method may be, for example, the method of the second embodiment may be used to perform the pollution data replacement, and when a certain condition is met, for example, the set time is consumed and the correct data is still not found, the replacement method may be converted. It is the method described in the first embodiment or the third embodiment. In summary, in the above embodiments, the technical features can be reasonably combined with each other into a new embodiment.

The above-mentioned embodiments are merely illustrative of several embodiments of the present invention, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of the invention should be determined by the appended claims.

Claims

Claim

A method for recovering pollution of distributed storage data, comprising the steps of:

B) calculating, according to Yj=XGj, the downloaded k data, obtaining the downloaded original data X, and selecting the k+1th storage node from the n storage nodes and downloading the stored data; or The downloaded k data is calculated according to the Y corpse XGj, and the downloaded original data X is obtained; wherein j=l, 2, ..., n, k is less than n;

C) comparing the data module Y _k+1 downloaded in the k+1th storage node with the product of the original data X and the matrix G _k+1 obtained in the above step, and if the same, exiting the current loop to obtain data Otherwise, perform step D);

D) downloading data of at least one storage node again, at least one of the re-downloaded data modules being different from the previously downloaded k+1 data, replacing at least one of the previously downloaded data modules with the re-downloaded data module, And return to step B).

The method for recovering the pollution of the distributed storage data according to claim 1, wherein the step D) further comprises:

D1) selecting k+1 storage nodes, at least one of the selected storage nodes being different from k+1 storage nodes that have previously selected and downloaded data;

D31) set to download the selected number of storage nodes τ again, if it is set, add τ to 1; where τ is less than k;

4. The method for recovering pollution of distributed storage data according to claim 3, wherein: Step D) also includes the following steps:

D61) Select a memory module different from the k+1th memory block previously used for comparison to download its data, replacing the k+1th data module for comparison.

The method for recovering the pollution of the distributed storage data according to claim 5, wherein the step D) further comprises the following steps:

D62) Select a memory module different from the k+1th memory block previously used for comparison to download its data, replacing the k+1th data module for comparison.

The method for recovering pollution of distributed storage data according to any one of claims 1-6, further comprising the steps of:

AO ) divides the original data X into k parts, obtains i=l, 2...k, and obtains linearly independent n data modules by encoding the above maximum distance separable code (MDS). η, the n data modules Yj and the generation matrix are respectively stored on the n storage nodes.

8. A device for implementing a distributed storage data pollution recovery method according to claim 1, comprising:

a data download unit: configured to download, by each of the n storage nodes, the stored data thereof or download the stored data by any k+1 of the n storage nodes respectively;

a data obtaining unit: configured to perform operation on the downloaded k data according to “Υ”=Χ to obtain the downloaded original data X, and select the k+1th storage node from the n storage nodes and download the storage thereof Or the operation of the downloaded k data according to η=Χ to obtain the downloaded original data X; the data comparison unit: for downloading the data module Y _{k+ of} the k+1th storage node ₁ , comparing with the product of the original data X and the matrix G _k+1 obtained in the above steps;

a data recovery unit: configured to download data of at least one storage node again according to an output of the data comparison unit, where at least one of the downloaded data modules is different from the previously downloaded k+1 data, Replacing at least one of the previously downloaded data modules with the data module that is downloaded again.

The device according to claim 8, wherein the data recovery unit further comprises: a first selection module: configured to select k+1 storage nodes, at least one of the selected storage nodes being different from before k+1 storage nodes that have selected and downloaded data;

The device according to claim 8, wherein the data restoring unit further comprises: a first subset setting module: configured to set to download the selected number of storage nodes τ again, and if so, to make τ Add 1; where τ is less than k;

The second comparison data downloading and replacing module is configured to select a storage node different from the k+1th storage node previously used for comparison to download the data, instead of the k+1th data module for comparison.

The device according to claim 8, wherein the data recovery unit further comprises: a second subset setting module: configured to set to download the selected number of storage nodes τ, where τ is less than k;

The third comparison data downloading and replacing module is configured to select a storage node different from the k+1th storage node previously used for comparison to download the data, instead of the k+1th data module for comparison.