WO2013131253A1 - Pollution data recovery method and apparatus for distributed storage data - Google Patents

Pollution data recovery method and apparatus for distributed storage data Download PDF

Info

Publication number
WO2013131253A1
WO2013131253A1 PCT/CN2012/072007 CN2012072007W WO2013131253A1 WO 2013131253 A1 WO2013131253 A1 WO 2013131253A1 CN 2012072007 W CN2012072007 W CN 2012072007W WO 2013131253 A1 WO2013131253 A1 WO 2013131253A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
downloaded
module
storage nodes
storage
Prior art date
Application number
PCT/CN2012/072007
Other languages
French (fr)
Chinese (zh)
Inventor
李挥
黄显霞
冯俊秋
叶顺鸿
陈畅民
侯韩旭
朱兵
Original Assignee
北京大学深圳研究生院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学深圳研究生院 filed Critical 北京大学深圳研究生院
Priority to PCT/CN2012/072007 priority Critical patent/WO2013131253A1/en
Publication of WO2013131253A1 publication Critical patent/WO2013131253A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1443Transmit or communication errors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/14Multichannel or multilink protocols

Definitions

  • the present invention relates to the field of distributed data storage, and more particularly to a method and apparatus for recovering pollution of distributed storage data.
  • Network coding is an information transmission technology that combines coding and routing. On the basis of the traditional storage-and-forward routing method, it increases the amount of information transmitted in a single transmission by allowing information fusion of multiple received data packets. The overall performance of the network.
  • a malicious node can intentionally tamper with or falsify a message. After a downstream node receives a contaminated message, if it is not known that the message is contaminated and used to encode with other uncontaminated messages, the contaminated message spreads quickly. Downstream of the malicious node even spreads to the entire network, so the contaminated message must be filtered out as early as possible during the transmission.
  • the distributed storage system based on network coding divides the original data into several parts and stores them on different nodes. Each coded group is calculated by combining linear network coding ideas with multiple modules. In order to obtain raw data, it is necessary to obtain enough coding blocks at the same time, which has a significant application in distributed storage.
  • Storage nodes have limited communication, computing, and storage capabilities. The purpose of storing coded modules rather than raw data is to provide system effectiveness. For example, consider an example of an (n, k) MDS code, where n storage nodes are used to store file modules, the original file is divided into k shares, and encoded into n shares stored on the n different nodes. Each node stores a linear combination of the original data blocks.
  • This random linear coding technique allows a receiver to recover the original data with high probability and solve a problem by downloading any k modules from these nodes for selecting the appropriate parameters.
  • network coding can improve the effectiveness of distributed storage systems.
  • a bad environment such as an attacker may attack a storage node, there may be a potential problem, we call the problem pollution. attack.
  • the attacker changes some stored encoding modules, so that the wrong decoding occurs during the process of restoring the original file, so that the correct file cannot be obtained. Since these encoding schemes linearly combine the raw data, a simple corrupted encoding module will affect the decoding of the entire file. The actual impact of pollution attacks is huge and unpredictable.
  • the detection of pollution attacks in distributed storage based on network coding introduces a cryptographic technique-hash function.
  • the data module is often hashed and a letter is used. This hash value can be obtained by any center. By comparing the hash value in the trust center with the hash value of each downloaded data module, the node can determine whether the downloaded module is legal.
  • a homomorphic hash function is introduced.
  • the hash value of each module of a given file can be obtained in a secure manner.
  • These hash values can be used to verify the integrity of the encoding module downloaded by the node.
  • the literature [C.Gkantsidis and P.Rodriguez "Cooperative Security for Network Coding File Distribution," Proc.
  • IEEE INFOCOM, 2006.] requires that when a malicious be detected is detected Tampered modules, nodes can cooperate with each other and can notify each other. In this way, a given node cannot authenticate each module by itself, and can only rely on information sent by other nodes for verification.
  • each scheme that uses a hash function needs to establish a secure channel between the source and the sink in order to get the true hash of the original data module.
  • Another solution to prevent pollution attacks requires digital signatures of data modules before they are added to the system.
  • the intermediate node combines the data modules received from different sinks, and the data signature scheme must also have a homomorphic attribute, similar to the case of the homomorphic hash function described above.
  • a homomorphic signature scheme has recently been proposed in distributed storage based on network coding. Unlike the homomorphic hash function scheme, the homomorphic signature scheme does not require a secure channel between the source and the sink.
  • the literature [Z.Yu, T.Wei, B.Ramkumar, and Y.Guan, "An efficient signature-based scheme for securing network coding against pollution attacks",
  • Ste'pl Verifies the validity of the signature ⁇ based on the known information ⁇ , ⁇ , ,..., ) ). If the signature ⁇ is illegal, the algorithm ends and the new node needs to re-download the signature C7 of the file.
  • the hash value of the linear combination of the input modules is equal to a combination of the hash values of the modules, so the signature scheme is correct.
  • the homomorphic signature scheme is computationally expensive; second, it requires a public key infrastructure (PKI) to manage the signature authentication key.
  • PKI public key infrastructure
  • the technical problem to be solved by the present invention is that the above method of the prior art is complicated, the calculation cost is large, and the reliability of the distributed network storage system cannot be guaranteed.
  • the method provides a simple method and a small computational overhead.
  • a method and apparatus for recovering pollution of distributed storage data that ensures the reliability of a distributed network storage system.
  • the technical solution adopted by the present invention to solve the technical problem is: constructing a pollution recovery method for distributed storage data, comprising the following steps:
  • step D comparing the data module Y k+1 downloaded in the k+1th storage node with the product of the original data X and the matrix G k+1 obtained in the above step, and if the same, exiting the current loop to obtain data Otherwise, perform step D); D) downloading data of at least one storage node again, at least one of the re-downloaded data modules being different from the previously downloaded k+1 data, replacing at least one of the previously downloaded data modules with the re-downloaded data module, And return to step B).
  • the step D) further includes: D1) selecting k+1 storage nodes, at least one of the selected storage nodes is different from the previous selection and downloading. Over k+1 storage nodes of data;
  • the step E) further includes: D31) setting to download the selected number of storage nodes ⁇ again, and if so, adding ⁇ to 1; wherein, ⁇ Less than k;
  • the step D) further includes the following steps: D61) selecting a storage node different from the k+1th storage node previously used for comparison to download the data, instead of The k+1th data module for comparison.
  • the step E) further includes:
  • the step ⁇ ) further includes the following steps: D62) selecting a storage node different from the k+1th storage node previously used for comparison to download the data, instead of The k+1th data module for comparison.
  • AO divides the original data X into k parts, obtains, and performs one (n, k) on the above
  • n linear data-independent data modules Yj are obtained, and the n data modules Yj and the generation matrix are respectively stored on n storage nodes.
  • the invention also relates to an apparatus for implementing the above method, comprising:
  • a data download unit configured to download, respectively, data stored by any of the n storage nodes or download the stored data by any k+1 of the n storage nodes;
  • a data recovery unit configured to download data of at least one storage node again according to an output of the data comparison unit, where at least one of the re-downloaded data modules is different from the previously downloaded k+1 data, to download the data again
  • the data module replaces at least one of the previously downloaded data modules.
  • the data recovery unit further includes:
  • a first selection module configured to select k+1 storage nodes, wherein at least one of the selected storage nodes is different from k+1 storage nodes that have previously selected and downloaded data;
  • the first data downloading and replacing module is configured to download the data module of the selected k+1 storage nodes, and replace the last downloaded data module with the k+1 data modules downloaded again.
  • the data recovery unit further includes:
  • the first subset setting module is configured to set the number of storage nodes selected to be downloaded again ⁇ , and if set, add ⁇ to 1; where ⁇ is a subset of k;
  • a second selection module configured to select ⁇ or ⁇ + 1 storage nodes, wherein at least one of the selected storage nodes is different from k storage nodes that have previously selected and downloaded data for obtaining an X value;
  • the second data downloading and replacing module is configured to download the data module in the selected ⁇ or ⁇ +1 storage nodes, and replace the data used to calculate the obtained X value by using the downloaded ⁇ or ⁇ +1 data modules again. Any ⁇ or ⁇ +1 data modules in the module;
  • the second comparison data downloading and replacing module is configured to select a storage module different from the k+1th storage module previously used for comparison to download the data, instead of the k+1th data module for comparison.
  • the data recovery unit further includes:
  • Second subset setting module used to set the number of storage nodes selected to be downloaded again ⁇ , where ⁇ Less than k;
  • a third selection module configured to select ⁇ storage nodes, wherein at least one of the selected storage nodes is different from k storage nodes that have previously selected and downloaded data for obtaining an X value;
  • the third data downloading and replacing module is configured to download the data module in the selected ⁇ storage nodes, and replace the ⁇ data modules in the data module used for calculating the obtained X value by using the ⁇ data modules downloaded again. ;
  • the third comparison data downloading and replacing module is configured to select a storage module different from the k+1th storage module previously used for comparison to download the data, instead of the k+1th data module for comparison.
  • the method and device for recovering pollution of distributed storage data embodying the present invention have the following beneficial effects: Since the homomorphic signature and the public key in the prior art are not required, the acceptable data download and calculation can be ensured correctly. The original data; therefore, the method is relatively simple, the calculation overhead is small, and the reliability of the distributed network storage system can be guaranteed.
  • FIG. 1 is a flow chart of a method in a first embodiment of a method and apparatus for recovering pollution of distributed storage data according to the present invention
  • Figure 2 is a flow chart showing the steps of eliminating pollution data in the first embodiment
  • Figure 3 is a schematic structural view of the device in the first embodiment
  • FIG. 4 is a flowchart of a method for eliminating pollution data in a second embodiment of a method and apparatus for recovering pollution of distributed storage data according to the present invention
  • Figure 5 is a pseudo code showing the data acquisition and recovery process in the second embodiment
  • Figure 6 is a schematic structural view of the device in the second embodiment
  • FIG. 7 is a flowchart of a method for eliminating pollution data in a third embodiment of a method and apparatus for recovering pollution of distributed storage data according to the present invention.
  • Figure 8 is a pseudo code showing the data acquisition and recovery process in the third embodiment
  • Fig. 9 is a schematic structural view of the apparatus in the third embodiment.
  • Step S11 divides the original data into k shares, and stores them in n storage nodes after encoding:
  • the data needs to be distributed and stored in the storage node of the network according to the provisions of distributed (network) storage;
  • Distributed (network) storage is to distribute data on multiple independent devices.
  • Traditional network storage systems use centralized storage servers to store all data. Storage servers become the bottleneck of system performance, and are also reliable and secure.
  • a distributed storage system includes k original nodes, n storage nodes, and at least one receiving node; in fact, a collection of source nodes, storage nodes, and receiving nodes may overlap; source nodes and storage nodes The energy of the device is relatively low, and the energy of the device at the receiving node is relatively high.
  • n different encoding modules 1, 2 - «
  • the n encoding modules are linearly independent and any of the k shares can reconstruct the file M; where k is less than n.
  • this step performs the storage of data, which is the basis of the first embodiment.
  • this step does not necessarily exist in terms of data pollution removal methods. It is merely explained for the basis of use of the method described in the first embodiment and a complete understanding of the technical solution.
  • Step S14 multiplies the obtained original data X by the matrix of the k+1th storage node:
  • the matrix G k+1 obtained by the k+1th storage node is compared with the original data X obtained in the above step.
  • the original data X obtained in the above steps is correct, that is, it is assumed that the data modules on the k storage nodes in the above steps are not contaminated (or falsified).
  • the data module calculated in step S15 is the same as the downloaded k+1th data module. If yes, go to step S16; otherwise, go to step S17. In this step, if the k storage nodes used to recover the original data in the above step are not contaminated, and the k+1th storage node is also not contaminated, the downloaded k+1th data module should be equal to step S14. The calculated data module. If the two are equal, it is determined that the original data X obtained in step S13 is correct; otherwise, at least one of the above k+1 data modules is contaminated or tombed, and the original data obtained in the above step S13 is incorrect or used for The compared k+1th storage node is contaminated.
  • Step S16 The download data is not contaminated, and exits: In this step, since the original data that has been judged is correct, no further processing is required. So, exit this cycle to get the data.
  • Step S17 Download at least one data module different from the previously downloaded data module, and replace one or more of the downloaded k+1 data modules with the data module.
  • at least one of the above k+1 data modules is included. A contaminated or tombed, the original data obtained in the above step S13 is incorrect or the k+1th storage node used for comparison is contaminated. To this end, the new, undownloaded data module needs to be downloaded again and replaced. The data module, and returning to step S13, recalculate using the newly downloaded data to obtain the correct original data X or the correct k+1th data module for verification.
  • at least one data module is downloaded, and at least one newly downloaded data module is not previously downloaded. This step will be described in detail later.
  • k+1 data modules it is used to download k+1 data modules at a time, and k of them are used to restore original data, and one is used to verify whether the restored original data is correct.
  • k data modules may also be downloaded first and used to recover the original data; after the original data is restored, a data module (from the k above) from different storage nodes is downloaded for verifying the obtained raw data. is it right or not.
  • the effect of these two methods is the same.
  • the relevant steps in the first embodiment also require minor adjustments to suit the way they are downloaded.
  • step S17 further includes the following steps: Step S21: Select k+1 storage nodes, at least one of which is different from the storage node that was last downloaded: In this step, k+1 storage nodes are first reselected, since the last downloaded storage node is known, In this step, it is easy to ensure that the selected k+1 storage nodes are not identical to the last downloaded storage node, as long as one or more of the list of storage nodes that were last downloaded or downloaded have been removed. Add the same number of undownloaded storage nodes.
  • Step S22 downloads the data on the selected k+1 storage nodes, and replaces the downloaded data with the last downloaded data: In this step, download the data module on the storage node selected in the previous step, and this time The downloaded data module is used to replace the data module of the last operation or verification. After performing this step, the process jumps to step S13 for execution.
  • the original data is obtained by downloading k data at a time, and then downloading one data for verifying the obtained original data
  • steps S21 and S22 k pieces of data that have not been downloaded are also downloaded first.
  • the operation obtains the original data X, and then downloads a previously undownloaded data for verifying the original data X obtained.
  • each coding module ⁇ " can also be represented as a column vector containing m symbols in the G ⁇ ) domain.
  • the decoding effect of the data pollution attack on the stored data can be analyzed. It is assumed that the attacker can access the t storage nodes and can observe and modify the equations (data) stored by the node, that is, if the attacker can By accessing storage node j, you can change the storage stored by node j! ⁇ and .
  • the nodes whose stored data has been tampered are defined as compromise nodes, and the compromise nodes and the general storage nodes cannot be distinguished.
  • the source node cannot be attacked, but only the output result of the storage system is changed. This is because the storage node is exposed to the attack for a long time, and the source node can only be attacked within a limited period of time during which the data is generated, and the probability of being attacked is greatly reduced.
  • the attacker randomly tampers with t storage nodes, and the receiving node randomly selects k storage nodes to download k linear equations.
  • the communication complexity required for this algorithm is k+1 and the computational complexity is 1. According to the prior knowledge that any error correction code has a Hamming distance of at least 2, it is concluded that any attack detection algorithm in the described system must download at least k+1 equations. Therefore, the proposed attack detection algorithm is optimal in terms of communication complexity.
  • a false negative decision (error not detected) may occur, mainly in two cases: f is a random value; or a specific f value forged by the attacker.
  • T can be considered a random value.
  • CT G.
  • the invention also relates to an apparatus for implementing the above-described pollution recovery method for distributed storage data.
  • the apparatus includes a data distribution and storage unit 31, a data download unit 32, a data obtaining unit 33, a data comparing unit 34, and a data restoring unit 35.
  • the data downloading unit 32 is configured to respectively download the stored data by any k of the n storage nodes or download the stored data by any k+1 of the n storage nodes respectively; the data obtaining unit 33 is In one case (when k data modules are downloaded first), the k data to be downloaded is calculated according to YfXGj to obtain the downloaded original data X, and then the k+1th storage node is selected among the n storage nodes.
  • the k data to be downloaded is operated according to YfXGj to obtain the downloaded original data X;
  • the data comparison unit 34 is used to compare the data module Y k+1 downloaded in the k+1th storage node with the product of the original data X and the matrix G k+1 obtained in the above step; if they are the same, It is determined that the downloaded k+1 data modules are not contaminated; otherwise, it is determined that at least one data module is contaminated and needs to be cleared; the data recovery unit 35 is configured to download again according to the output (or judgment) of the above data comparison unit.
  • At least one of the data of the storage node, at least one of the re-downloaded data modules is different from the last downloaded k+1 data, and at least one of the previously downloaded data modules is replaced with the re-downloaded data module.
  • the data allocation and storage unit 31 is not actually Essential, for ease of understanding, list them here.
  • the data recovery unit 35 further includes: a first selection module 351 and a first data download and replace module 352; wherein, the first selection module 351 is configured to select k+1 storage nodes, where the selected storage is At least one of the nodes is different from the k+1 storage nodes that have previously selected and downloaded data; the first data download and replace module 352 is configured to download the data module of the selected k+1 storage nodes, using the downloaded again k+1 data modules replace the last downloaded data module.
  • the step of downloading at least one data module different from the previously downloaded data module and replacing one or more of the downloaded k+1 data modules with the specific one includes: :
  • Step S41 sets the number of storage nodes to be downloaded again, and if it is set, it is incremented by 1:
  • the number of storage nodes ⁇ that is downloaded again is set, and ⁇ is less than k; if the original data X that has been determined multiple times cannot pass Verification, that is, unequal at the time of comparison, the number of storage nodes ⁇ has been set after the first operation and comparison, in which case the number of storage nodes is incremented by one.
  • Step S42 selects the storage node to be downloaded again according to the number of storage nodes: In this step, select ⁇ (or the number of one or more times plus 1) storage nodes, and select the storage node to be any of the above n storage nodes. But, it is necessary to ensure that at least one storage node that is selected is different from k storage nodes that have previously selected and downloaded data for obtaining the X value; for example, selecting a storage node whose data source is completely different from the previous operation; When selected, the data modules on these selected storage nodes are downloaded separately.
  • Step S43 replaces the data for the last calculation with the data downloaded again:
  • the data module that is downloaded again is used as a subset, and the same number of data modules are replaced, and the data used for this operation is obtained.
  • Step S44 selects a storage node different from the previous comparison to download the data module for comparison: In this step, select a storage node different from the k+1th storage node previously used for comparison to download the data, and use the same Replace the k+1th data module for comparison. After performing this step, jump to the operation to obtain the original data and compare and verify that it is correct; thus, when the correct original data is not found, the above process will be repeated until the contaminated data is cleared to obtain the correct original data.
  • Fig. 5 shows the process of realizing data acquisition and recovery in the second embodiment by using pseudo code.
  • the new line is downloaded in line 8
  • the new downloaded equation is defined as e
  • the 10th line is used as the test equation for the current iteration. Perform an attack detection algorithm.
  • the remaining equations downloaded so far that do not contain the set S constitute a cleanup set (: take any possible subset C of C, such that IC'r is less than k, and use C' in all possible ways
  • the equation in the middle replaces the r equations in the set S.
  • the attack detection algorithm is executed in the set using e as the detection equation. If there is no attack, it is cleared and the algorithm execution ends.
  • the structure of the data recovery unit of the apparatus in the second embodiment is slightly different from that in the first embodiment.
  • the data recovery unit in the second embodiment includes: a first subset setting module 51, a second selection module 52, a second data download and replace module 53 and a second comparison data download and replace module 54;
  • the first subset setting module 51 is configured to set to download the selected storage node number ⁇ again; if the value has been set in the previous cycle, add ⁇ to 1; where ⁇ is less than k;
  • the second selection module 52 is used to select ⁇ or ⁇ +l (or the number of times after adding 1) storage nodes; here, at least one of the selected storage nodes is different from the data that has been previously selected and downloaded for obtaining the X value.
  • the second data downloading and replacing module 53 is configured to download the selected data module in the storage node whose number is obtained by the first subset setting module 51, and use the downloaded ⁇ or ⁇ +1 Data modules as a subset, instead of any ⁇ or ⁇ +1 data modules in the data module used to calculate the X value;
  • the second comparison data download and replace module 54 is used for selection A storage node different from the k+1th storage node previously used for comparison downloads its data, replacing the k+1th data module for comparison.
  • FIG. 7, FIG. 8, and FIG. 9 respectively show a flow chart of the process of clearing the pollution data in the third embodiment of the present invention, a pseudo code, and a structure of the data recovery unit, as shown in FIG. 7, the third embodiment and the
  • the case of the second embodiment is substantially the same, and in the steps S61-S64, the difference from the second embodiment is that in step S61, the ⁇ does not change regardless of how many data replacements are performed.
  • Fig. 8 shows the process of realizing data acquisition and recovery in the third embodiment by using pseudo code.
  • the same steps as in the second embodiment are first performed on the 1-4th lines, z +1 equations are downloaded and attack detection is performed. End this data acquisition if no attack is detected; otherwise, start a cleanup process on lines 5-26.
  • the size w of the clear set C defined in the fifth row is a fixed value "", where ⁇ is an input parameter.
  • the equation +2 ... ⁇ is downloaded, and the set is initialized with ⁇ .. 3 ⁇ 4 S, used to initialize the cleanup set (:.
  • These two sets change in each iteration, and use variables and c to indicate their respective first equations in the current iteration.
  • the test equation used for attack detection Line 7 in Figure 8 initializes S, K and begins the iterative process. In each iteration, download a new equation ⁇ on line 9, and use it as the test equation for its 12th line attack detection.
  • Line 19 uses e as the test equation to perform the attack detection algorithm on the replaced set. The 20th line indicates that if no attack is detected, S ' is cleared and the algorithm runs. Otherwise, line 25 will increase our Scalar values and continue the next iteration. Note the collection From 7: a size yt + w l + scrollable window equation composition, in the third embodiment, the end conditions are: either successfully cleared, or all the equations have been downloaded.
  • the structure of the data recovery unit includes: a second subset setting module 71, a third selection module 72, a third data downloading and replacing module 73, and a third comparison.
  • the data downloading and replacing module 74 wherein, the second subset setting module 71 is configured to set to download the selected number of storage nodes ⁇ again, where ⁇ is less than k; the third selecting module 72 is configured to select ⁇ storage nodes, the selected At least one of the storage nodes is different from k storage nodes that have previously selected and downloaded data for obtaining the X value; the third data download and replace module 73 is configured to download the data module of the selected ⁇ storage nodes, using again The downloaded ⁇ data modules replace any ⁇ data modules in the data module used to calculate the X value; the third comparison data download and replace module 74 is used to select a k+1th storage that was previously used for comparison. Modules with different modules download their data, replacing the k+1th data module for comparison.
  • the definition is used to clear the downloaded set of equations to C, let e be an additional equation, using the equation in C Instead of S (ie, the data set used to calculate the raw data), a subset of size ICI, and define a new set of equations as ⁇ ; Then, we perform attack detection on the set ⁇ , using the equation e as the detection.
  • the solution to SLEs is seen as a union to determine if the solution satisfies equation e. If no attack is detected, the solution obtained is treated as the correct data encoding vector, otherwise we use S again, replacing the other one in S with the equation in C.
  • the size is a subset of ICI; then the attack detection algorithm is executed. Repeat these steps until either the cleanup is successful, or all subsets of S in size ICI are replaced.
  • Equation data module
  • the basic idea of the second embodiment is to start the clearing with a clearing set C (e.g., initially assume that there is only one attacking equation in the set S), and then repeatedly increase the size of the set C if the clearing fails. In this way, we will get a clear set C sooner or later, and the correct number of equations in C is the same as the number of equations in the set S. In each iteration, select a subset of all possible equations in C. Therefore, the correct equation in set C will eventually be used instead of the attacked equation in set S, and a final set of cleanes c will be obtained.
  • this new downloaded equation is defined as e; it is used as a test equation for performing the attack detection algorithm in the current iteration.
  • the downloaded remaining equations in the set S do not constitute a clear set (take any possible subset C in C, so that IC' r is not greater than k, and use C ' in all possible ways
  • the equation replaces the r equations in set S. After each replacement, the attack detection algorithm is executed in the set using e as the detection equation. If there is no attack, ⁇ is cleared.
  • the embodiments described in the present invention have a better application in practice than the scheme based on the homomorphic digital signature.
  • both the source and the storage node require a large amount of computation, and these nodes are usually resource-limited sensing nodes.
  • the third embodiment changes the set of fixed size in each iteration S and C instead of increasing the size of C.
  • Sets S and C consist of a fixed window size from the z* equation, resulting in a probability of success for this method not equal to 1 in any case. If the number of attack equations contained in S does not exceed a x then the recovery will be successful, where ⁇ is an input parameter that limits the computational complexity of this method, which is limited by limiting the size of the subset of equations from sets C and S.
  • the same steps as in the second embodiment are first performed, an equation is downloaded, and attack detection is performed. If there is no attack, the data acquisition is ended; otherwise, a clearing process is started; the size of the cleanup set C is defined as a fixed value "", where "is an input parameter; the download equation ⁇ : +2 ⁇ , and initialized with . ⁇ Set S, used to initialize the cleanup set (:. These two sets are changed in each iteration, and the variables ⁇ and c are used to indicate their respective first equations in the current iteration. The same 4 represents the test for attack detection. The equation will be initialized and the iterative process begins.
  • the probability of success is a function of the number of equations t being attacked whose success probability exceeds 90% within a threshold t, after which it begins to fall.
  • set S a stronger attack equations
  • the average communication complexity increases with the number of equations t being attacked, because the larger t is, the harder it is to find the attacked equation, and the number of attacked equations contained in the set S of size k does not exceed A fixed value r max , while at the same time containing at least r max correct equations in the set C of size.
  • the average downloaded number of equations is less than half of the total number of equations n, communication complexity is always acceptable.
  • the communication complexity of the system is very small.
  • the communication complexity increases as the value of the r-span decreases. It can be seen from the later computational complexity analysis that the reduction in communication complexity is obtained at the expense of increased computational complexity.
  • the coding modules required in the above embodiments are more than the coding modules required to obtain the original data, and these additional coding modules are used for attack detection and recovery. Attack detection and recovery only need to be limited The solution of a system of linear equations in a system on a domain. Since no cryptographic algorithm is used, the present invention does not need to rely on a PKI or a secure channel established in advance.
  • the above methods have significant advantages in both communication and computational load.
  • the first embodiment provides the lowest possible average computational complexity in the system; the second embodiment is optimal in terms of communication complexity and ensures a strong attack (most coding modules) Recovery in tampering) is still a more practical solution in most systems; the third embodiment is a compromise for a very large system, with no resilience (success probability) and communication load. The solution is effective, but the computational load for very large systems is acceptable.
  • the above embodiments can be applied to any distributed storage system based on network coding, a distributed domain of P2P files or a wireless sensor network.
  • the above method does not require additional coding on the storage node or additional information is added to the coding module, only the receiving node needs to perform a certain amount of computation.
  • the present invention is particularly suitable for wireless sensor networks in which the storage node is a sensor node with limited capacity resources and the receiving node is a relatively strong base station.
  • the pollution data replacement method may be, for example, the method of the second embodiment may be used to perform the pollution data replacement, and when a certain condition is met, for example, the set time is consumed and the correct data is still not found, the replacement method may be converted. It is the method described in the first embodiment or the third embodiment.
  • the technical features can be reasonably combined with each other into a new embodiment.

Abstract

The invention relates to a pollution data recovery method for distributed storage data. The method comprises the following steps: respectively downloading, by any k+1 of n storage nodes, the stored data thereof; calculating the k downloaded data blocks according to Yj=XGj to obtain the downloaded original data X; comparing the data block Yk+1 downloaded from the k+1st storage node with the product of the original data X obtained in above steps and the matrix Gk+1, if they are the same, quitting the cycle and obtaining the data; if not, downloading the data from at least one storage node again, wherein at least one data block is different from the k+1 data blocks downloaded before and replaces at least one of the before downloaded data blocks; and then returning to the original data calculation step. The present invention also relates to an apparatus for implementing the above mentioned method. By implementing the pollution data recovery method and apparatus for distributed storage data of the present invention, the method can be simpler and the calculation overhead can be less.

Description

分布式存储数据的污染恢复方法及装置  Method and device for recovering pollution of distributed storage data
技术领域 Technical field
本发明涉及分布式数据存储领域, 更具体地说, 涉及一种分布式存储数据的污 染恢复方法及装置。  The present invention relates to the field of distributed data storage, and more particularly to a method and apparatus for recovering pollution of distributed storage data.
背景技术 Background technique
网络编码是一种融合了编码和路由的信息传输技术, 在传统的存储 -转发路由方 式的基础上, 通过允许对接收的多个数据包进行信息融合来增加单次传输的信息量, 从而提高了网络的整体性能。 然而恶意的节点可以有意篡改或伪造消息, 下游的节 点收到被污染的消息后, 如果不知道这些消息被污染而用来与其他未被污染的消息 一起编码, 那么被污染的消息很快扩散到恶意节点的下游甚至扩散到整个网络, 因 此在传输的过程中必须尽早地过滤掉被污染的消息。  Network coding is an information transmission technology that combines coding and routing. On the basis of the traditional storage-and-forward routing method, it increases the amount of information transmitted in a single transmission by allowing information fusion of multiple received data packets. The overall performance of the network. However, a malicious node can intentionally tamper with or falsify a message. After a downstream node receives a contaminated message, if it is not known that the message is contaminated and used to encode with other uncontaminated messages, the contaminated message spreads quickly. Downstream of the malicious node even spreads to the entire network, so the contaminated message must be filtered out as early as possible during the transmission.
基于网络编码的分布式存储系统, 将原始数据分割成若干份, 并存储在各个不 同节点上, 每个编码分组上通过线性网络编码思想结合多个模块来计算。 为了获得 原始数据, 必须同时获得足够的编码块, 这在分布式存储中有着重大的应用。 存储 节点的通信、 计算及存储能力都是有限的, 存储编码的模块而不是原始数据的目的 是提供系统的有效性。 例如, 考虑一个 (n,k)MDS码的例子, 有 n个存储节点用于存储 文件模块, 将原始文件分割成 k份, 编码成 n份分别存储在这 n个不同的节点上。 每个 节点存储的是原始数据块的一个线性组合, 这个随机线性编码技术使得对于选择适 当的参数, 一个接收者可以通过从这些节点中下载任意 k个模块来以高概率恢复原始 数据并解决一个系统的线性方程组 ( system of linear equations, SLEs;) 。 因此, 这个 接收者可以获得原文件, 并且在数据构造中只需低的延迟和在网络通信中低的下载 量。  The distributed storage system based on network coding divides the original data into several parts and stores them on different nodes. Each coded group is calculated by combining linear network coding ideas with multiple modules. In order to obtain raw data, it is necessary to obtain enough coding blocks at the same time, which has a significant application in distributed storage. Storage nodes have limited communication, computing, and storage capabilities. The purpose of storing coded modules rather than raw data is to provide system effectiveness. For example, consider an example of an (n, k) MDS code, where n storage nodes are used to store file modules, the original file is divided into k shares, and encoded into n shares stored on the n different nodes. Each node stores a linear combination of the original data blocks. This random linear coding technique allows a receiver to recover the original data with high probability and solve a problem by downloading any k modules from these nodes for selecting the appropriate parameters. System of linear equations (SLEs;). Therefore, this recipient can obtain the original file and only requires low latency in data construction and low downloads in network communication.
在一个好的环境下, 网络编码可以提高分布式存储系统的有效性, 然而在一个 坏的环境下, 如可能存在一个攻击者攻击存储节点, 可能存在一个潜在的问题, 我 们称该问题为污染攻击。 其中, 攻击者改变一些存储的编码模块, 使得在恢复原始 文件过程中出现错误的译码从而不能获得正确文件。 由于这些编码方案是将原始数 据线性结合, 因此一个简单的被破坏的编码模块将影响整个文件的译码。 污染攻击 的实际影响是巨大和不可估测的。  In a good environment, network coding can improve the effectiveness of distributed storage systems. However, in a bad environment, such as an attacker may attack a storage node, there may be a potential problem, we call the problem pollution. attack. Among them, the attacker changes some stored encoding modules, so that the wrong decoding occurs during the process of restoring the original file, so that the correct file cannot be obtained. Since these encoding schemes linearly combine the raw data, a simple corrupted encoding module will affect the decoding of the entire file. The actual impact of pollution attacks is huge and unpredictable.
在通信系统中 , 基于网络编码的分布式存储中污染攻击的检测引入了密码技术- 哈希函数。 例如, 在实际 P2P文件共享系统中, 经常将数据模块进行哈希并且一个信 任中心可以获得此哈希值。 通过信任中心中的哈希值与每次下载数据模块的哈希值 进行比较, 节点可以判定下载的模块是否合法。 为了让这种方案应用于基于网络编 码的 P2P文件共享系统中, 引入了同态哈希函数。 在提出的方案中, 一个编码模块 X 的哈希值可以通过构造该模块的各子模块 Xi (\<i<k)哈希值而获得,即当 X = χ,.时, hash(X) = Uk i=lhash(xi)。 假设当节点第一次加入系统中, 可以通过一个安全方式获得一 个给定文件的各模块的哈希值。 这些哈希值可以用于验证节点所下载的编码模块的 完整性。为了降低同态哈希函数所带来的计算开销,文献 [C.Gkantsidis and P.Rodriguez "Cooperative Security for Network Coding File Distribution," Proc. IEEE INFOCOM, 2006.]中要求当检测到一个恶意的被篡改的模块, 节点间可以相互合作并可互相通 知。 通过这种方式, 一个给定的节点不能独自认证每一个模块, 只能依赖于其他节 点所发出的信息来进行验证。 In communication systems, the detection of pollution attacks in distributed storage based on network coding introduces a cryptographic technique-hash function. For example, in an actual P2P file sharing system, the data module is often hashed and a letter is used. This hash value can be obtained by any center. By comparing the hash value in the trust center with the hash value of each downloaded data module, the node can determine whether the downloaded module is legal. In order to apply this scheme to a P2P file sharing system based on network coding, a homomorphic hash function is introduced. In the proposed scheme, the hash value of an encoding module X can be obtained by constructing the hash value of each submodule Xi (\<i<k) of the module, that is, when X = χ,., hash(X) = U k i=l hash(x i ). Assume that when a node joins the system for the first time, the hash value of each module of a given file can be obtained in a secure manner. These hash values can be used to verify the integrity of the encoding module downloaded by the node. In order to reduce the computational overhead caused by the homomorphic hash function, the literature [C.Gkantsidis and P.Rodriguez "Cooperative Security for Network Coding File Distribution," Proc. IEEE INFOCOM, 2006.] requires that when a malicious be detected is detected Tampered modules, nodes can cooperate with each other and can notify each other. In this way, a given node cannot authenticate each module by itself, and can only rely on information sent by other nodes for verification.
在任何情况下, 使用哈希函数(不管是不是同态) 的每一个方案都需要在信源 和信宿之间建立一个安全的渠道, 只有这样才能得到原始数据模块的真正哈希值。 另外一种阻止污染攻击的方案要求在信源加入系统之前要先对数据模块进行数字签 名。 然而, 为了使该算法起作用, 中间节点要组合从不同信宿接收的数据模块, 数 据签名方案也必须具有同态属性, 这和上面所说的同态哈希函数的情况类似。  In any case, each scheme that uses a hash function (whether or not it is homomorphic) needs to establish a secure channel between the source and the sink in order to get the true hash of the original data module. Another solution to prevent pollution attacks requires digital signatures of data modules before they are added to the system. However, in order for the algorithm to work, the intermediate node combines the data modules received from different sinks, and the data signature scheme must also have a homomorphic attribute, similar to the case of the homomorphic hash function described above.
最近在基于网络编码的分布式存储中已提出同态签名方案。 和基于同态哈希函 数方案不同的是, 同态签名方案不需要在信源和信宿之间提前建立一个安全渠道。 在文献 [Z.Yu,T.Wei,B.Ramkumar,and Y.Guan, "An efficient signature-based scheme for securing network coding against pollution attacks",  A homomorphic signature scheme has recently been proposed in distributed storage based on network coding. Unlike the homomorphic hash function scheme, the homomorphic signature scheme does not require a secure channel between the source and the sink. In the literature [Z.Yu, T.Wei, B.Ramkumar, and Y.Guan, "An efficient signature-based scheme for securing network coding against pollution attacks",
Proc. IEEE Infocom, 2008.]中, 首先假设存在一个可信的服务器并产生安全参数 In Proc. IEEE Infocom, 2008.], first assume that there is a trusted server and generate security parameters.
(p,q,g) , 其中 p和 q是两个大素数且满足 ^l(p-l) (如 lpl=1024比特, lgl=257比特) , g 定义成一个(gl,..., )行向量, 其中元素阶数为 q且均从 Zp中随机选择; 再假设信源拥 有一对私钥 和公钥 PK , 然后服务器通过安全信道发布公开参数 (;^^,/^)。 对于 一个文件 ^, 群 和该群的 k个原始数据模块^· = 1,..., ), 信源通过下述算法计算签 名: (p,q,g) , where p and q are two large prime numbers and satisfy ^l(pl) (eg lpl=1024 bits, lgl=257 bits), g is defined as a ( gl ,..., ) line Vector, where the order of the elements is q and are randomly selected from Z p ; further assume that the source has a pair of private and public keys PK, and then the server issues the public parameters (;^^, /^) through the secure channel. For a file ^, group and k primitive data modules of the group ^· = 1,..., ), the source calculates the signature by the following algorithm:
Stepl: 计算每个模块 = (ba,...,bir)的哈希值即 σ; =11 ^ mod p , i = l,...,k。 Stepl: Calculate the hash value of each module = (b a ,...,b ir ), ie σ ; =11 ^ mod p , i = l,...,k.
Step2: 计算上述哈希值的签名即 = Sign(SK, (idf Step2: Calculate the signature of the above hash value = Sign(SK, (id f
Step3: 产生群签名即 σ = (σ1 . Step3: Generate a group signature, ie σ = (σ 1 .
当一个节点刚加入系统,从服务器中下载文件的签名,对于该群中给定的 idf , idg , σ 和 B
Figure imgf000005_0001
, 该节点通过下述算法验证 B的合法性。
When a node has just joined the system, download the signature of the file from the server for the given id f , id g , σ in the group And B
Figure imgf000005_0001
The node verifies the validity of B by the following algorithm.
Ste'pl :根据已知的信息 ΟΡΚ^ ,^ , ,..., ) )来验证签名 σ的合法性。如果签名 σ 是非法的, 则算法结束, 并且新来节点需要重新下载文件的签名 C7。  Ste'pl: Verifies the validity of the signature σ based on the known information ΟΡΚ^ , ^ , ,..., ) ). If the signature σ is illegal, the algorithm ends and the new node needs to re-download the signature C7 of the file.
Step2: i - B = (Bl, ... , Br)的同态哈希? P σ = Π;·=1<? mod p。 Step2: Is the homomorphic hash of i - B = (B l , ... , B r )? P σ = Π;· =1<? mod p.
Step3: 计算 B的哈希即 σ ' = Π^ ; mod p。 Step3: Calculate the hash of B, ie σ ' = Π^ ; mod p.
Step4: 判定 σ是否等于 σ '。 如果 σ = σ ' , 则新来节点接受 Β; 否则将会丢掉被破坏 的^  Step4: Determine if σ is equal to σ '. If σ = σ ', the new node accepts Β; otherwise it will throw away the destroyed ^
由于同态哈希函数具有属性: 输入模块的线性组合的哈希值等于各模块哈希值的 一个组合, 所以该签名方案是正确的。  Since the homomorphic hash function has attributes: The hash value of the linear combination of the input modules is equal to a combination of the hash values of the modules, so the signature scheme is correct.
然而它存在两个其他问题: 首先, 同态签名方案在计算上开销特别大; 其次, 需 要一个公钥基础设施( public key infrastructure, PKI )来管理签名认证密钥。 这两个问 题使得该方案不能运用在实际系统中: 由于该方案需要大的计算复杂性因而不能用 于传感网络; 由于需要 ΡΚΙ因而不能用于大规模的 Ρ2Ρ分布式系统。  However, it has two other problems: First, the homomorphic signature scheme is computationally expensive; second, it requires a public key infrastructure (PKI) to manage the signature authentication key. These two problems make the scheme unusable in practical systems: Because this scheme requires large computational complexity, it cannot be used in sensor networks; it cannot be used in large-scale distributed systems because of the need.
综上所述, 现有的分布式存储中污染攻击的检测与恢复方法, 其采用的方法较 为复杂、 计算开销较大、 不能保证分布式网络存储系统的可靠性。  In summary, the existing method for detecting and recovering pollution attacks in distributed storage is complicated, computationally expensive, and cannot guarantee the reliability of the distributed network storage system.
发明内容 Summary of the invention
本发明要解决的技术问题在于, 针对现有技术的上述方法较为复杂、 计算开销 较大、 不能保证分布式网络存储系统的可靠性的缺陷, 提供一种方法较为简单、 计 算开销较小、 可以保证分布式网络存储系统的可靠性的分布式存储数据的污染恢复 方法及装置。  The technical problem to be solved by the present invention is that the above method of the prior art is complicated, the calculation cost is large, and the reliability of the distributed network storage system cannot be guaranteed. The method provides a simple method and a small computational overhead. A method and apparatus for recovering pollution of distributed storage data that ensures the reliability of a distributed network storage system.
本发明解决其技术问题所采用的技术方案是: 构造一种分布式存储数据的污染 恢复方法, 包括如下步骤:  The technical solution adopted by the present invention to solve the technical problem is: constructing a pollution recovery method for distributed storage data, comprising the following steps:
A ) 由所述 n个存储节点中的任意 k个分别下载其存储的数据或由所述 n个 存储节点中的任意 k+1个分别下载其存储的数据;  A) downloading its stored data by any k of the n storage nodes or downloading the stored data by any k+1 of the n storage nodes respectively;
B )对所述下载的 k个数据按照 Yj=XGj进行运算, 得到下载的原始数据 X, 并由所述 n个存储节点中选择第 k+1个存储节点并下载其存储的数据; 其中, j=l、 2、 ...n, k小于 n;  B) calculating, according to Yj=XGj, the downloaded raw data X, and selecting the k+1th storage node from the n storage nodes and downloading the stored data; j=l, 2, ...n, k is less than n;
C )将所述第 k+1个存储节点中下载的数据模块 Yk+1, 与上述步骤中得到的 原始数据 X与矩阵 Gk+1的乘积比较, 如相同, 退出本次循环取得数据; 否则, 执行 步骤 D ); D )再次下载至少一个存储节点的数据, 所述再次下载的数据模块中至少一 个与之前下载的 k+1 个数据不同, 以所述再次下载的数据模块替代之前下载数据模 块中的至少一个, 并返回步骤 B )。 C) comparing the data module Y k+1 downloaded in the k+1th storage node with the product of the original data X and the matrix G k+1 obtained in the above step, and if the same, exiting the current loop to obtain data Otherwise, perform step D); D) downloading data of at least one storage node again, at least one of the re-downloaded data modules being different from the previously downloaded k+1 data, replacing at least one of the previously downloaded data modules with the re-downloaded data module, And return to step B).
在本发明所述的分布式存储数据的污染恢复中, 所述步骤 D ) 中进一步包括: D1 )选择 k+1个存储节点 , 所述选择的存储节点中至少一个不同于之前已经 选择并下载过数据的 k+1个存储节点;  In the pollution recovery of the distributed storage data according to the present invention, the step D) further includes: D1) selecting k+1 storage nodes, at least one of the selected storage nodes is different from the previous selection and downloading. Over k+1 storage nodes of data;
D2 )下载所选择的 k+1个存储节点中的数据模块,使用再次下载的 k+1个数 据模块代替上次下载的数据模块。  D2) Download the data module of the selected k+1 storage nodes, and replace the last downloaded data module with the k+1 data modules downloaded again.
在本发明所述的分布式存储数据的污染恢复中, 所述步骤 E ) 中进一步包括: D31 )设置再次下载所选择的存储节点数 τ , 如果已设置, 则使 τ加 1 ; 其 中, τ小于 k;  In the pollution recovery of the distributed storage data according to the present invention, the step E) further includes: D31) setting to download the selected number of storage nodes τ again, and if so, adding τ to 1; wherein, τ Less than k;
D41 ) 选择 τ或 τ +l 个存储节点, 所述选择的存储节点中至少一个不同于 之前已经选择并下载过数据用于取得 X值的 k个存储节点;  D41) selecting τ or τ + l storage nodes, at least one of the selected storage nodes being different from k storage nodes that have previously selected and downloaded data for obtaining X values;
D51 ) 下载所选择的 τ或 τ +l 个存储节点中的数据模块, 使用再次下载的 τ或 τ +1个数据模块代替上次用于计算取得 X值的数据模块中的任意 τ或 τ +1个数 据模块。  D51) Download the selected τ or τ +1 data nodes in the storage node, and replace the τ or τ + in the data module used to calculate the X value with the τ or τ +1 data modules downloaded again. 1 data module.
在本发明所述的分布式存储数据的污染恢复中, 所述步骤 D )还包括如下步驟: D61 )选择一个与之前用于比较的第 k+1存储节点不同的存储节点下载其数 据, 替代用于比较的第 k+1个数据模块。  In the pollution recovery of the distributed storage data according to the present invention, the step D) further includes the following steps: D61) selecting a storage node different from the k+1th storage node previously used for comparison to download the data, instead of The k+1th data module for comparison.
在本发明所述的分布式存储数据的污染恢复中, 所述步骤 E ) 进一步包括:  In the pollution recovery of the distributed storage data according to the present invention, the step E) further includes:
D32 )设置再次下载所选择的存储节点数 τ, 其中, τ小于 k;  D32) setting to download the selected number of storage nodes τ again, where τ is less than k;
D42 )选择 τ个存储节点, 所述选择的存储节点中至少一个不同于之前已经 选择并下载过数据用于取得 X值的 k个存储节点;  D42) selecting τ storage nodes, at least one of the selected storage nodes being different from k storage nodes that have previously selected and downloaded data for obtaining an X value;
D52 )下载所选择的 τ个存储节点中的数据模块, 使用再次下载的 τ个数据 模块代替上次用于计算取得 X值的数据模块中的任意 τ个数据模块。  D52) Download the data modules of the selected τ storage nodes, and replace any τ data modules in the data module used to calculate the X value by using the τ data modules downloaded again.
在本发明所述的分布式存储数据的污染恢复中, 所述步驟 Ε )还包括如下步骤: D62 )选择一个与之前用于比较的第 k+1存储节点不同的存储节点下载其数 据, 替代用于比较的第 k+1个数据模块。  In the pollution recovery of the distributed storage data according to the present invention, the step Ε) further includes the following steps: D62) selecting a storage node different from the k+1th storage node previously used for comparison to download the data, instead of The k+1th data module for comparison.
在本发明所述的分布式存储数据的污染恢复中, 还包括如下步骤:  In the pollution recovery of the distributed storage data according to the present invention, the following steps are further included:
AO )将原始数据 X分为 k份,得到 , 并通过对上述 进行一个( n,k ) MDS码编码后得到线性无关的 n个数据模块 Yj ,所述 n个数据模块 Yj分别与生成矩 阵 存储在 n个存储节点上。 AO) divides the original data X into k parts, obtains, and performs one (n, k) on the above After the MDS code is encoded, n linear data-independent data modules Yj are obtained, and the n data modules Yj and the generation matrix are respectively stored on n storage nodes.
本发明还涉及一种实现上述方法的装置, 包括:  The invention also relates to an apparatus for implementing the above method, comprising:
数据下载单元: 用于由所述 n个存储节点中的任意 k个分别下载其存储的 数据或由所述 n个存储节点中的任意 k+1个分别下载其存储的数据;  a data download unit: configured to download, respectively, data stored by any of the n storage nodes or download the stored data by any k+1 of the n storage nodes;
数据获得单元: 用于对所述下载的 k个数据按照 Yj=XGj进行运算, 得到 下载的原始数据 X, 并由所述 n个存储节点中选择第 k+1个存储节点并下载其存储 的数据; 或对所述下载的 k个数据按照 η=Χ 进行运算, 得到下载的原始数据 X; 数据比较单元: 用于将所述第 k+1个存储节点中下载的数据模块 Yk+1 , 与 上述步骤中得到的原始数据 X与矩阵 Gk+1的乘积比较; a data obtaining unit: configured to perform operation on the downloaded k data according to Yj=XGj to obtain downloaded original data X, and select a k+1th storage node from the n storage nodes and download the stored data thereof Data; or the operation of the downloaded k data according to η=Χ to obtain the downloaded original data X; the data comparison unit: used to download the data module Y k+1 in the k+1th storage node , comparing with the product of the original data X and the matrix G k+1 obtained in the above steps;
数据恢复单元: 用于依据所述数据比较单元的输出, 再次下载至少一个存 储节点的数据, 所述再次下载的数据模块中至少一个与之前下载的 k+1个数据不同, 以所述再次下载的数据模块替代之前下载数据模块中的至少一个。  a data recovery unit: configured to download data of at least one storage node again according to an output of the data comparison unit, where at least one of the re-downloaded data modules is different from the previously downloaded k+1 data, to download the data again The data module replaces at least one of the previously downloaded data modules.
在本发明所述的装置中, 所述数据恢复单元进一步包括:  In the device of the present invention, the data recovery unit further includes:
第一选择模块: 用于选择 k+1 个存储节点, 所述选择的存储节点中至少一 个不同于之前已经选择并下载过数据的 k+1个存储节点;  a first selection module: configured to select k+1 storage nodes, wherein at least one of the selected storage nodes is different from k+1 storage nodes that have previously selected and downloaded data;
第一数据下载及替换模块: 用于下载所选择的 k+1 个存储节点中的数据模 块, 使用再次下载的 k+1个数据模块代替上次下载的数据模块。  The first data downloading and replacing module is configured to download the data module of the selected k+1 storage nodes, and replace the last downloaded data module with the k+1 data modules downloaded again.
在本发明所述的装置中, 所述数据恢复单元进一步包括:  In the device of the present invention, the data recovery unit further includes:
第一子集设置模块: 用于设置再次下载所选择的存储节点数 τ , 如果已设 置, 则使 τ加 1 ; 其中, τ是 k的子集;  The first subset setting module is configured to set the number of storage nodes selected to be downloaded again τ, and if set, add τ to 1; where τ is a subset of k;
第二选择模块: 用于选择 τ或 τ +l个存储节点, 所述选择的存储节点中至 少一个不同于之前已经选择并下载过数据用于取得 X值的 k个存储节点;  a second selection module: configured to select τ or τ + 1 storage nodes, wherein at least one of the selected storage nodes is different from k storage nodes that have previously selected and downloaded data for obtaining an X value;
第二数据下载及替换模块: 用于下载所选择的 τ或 τ +1个存储节点中的数 据模块, 使用再次下载的 τ或 τ +1个数据模块代替上次用于计算取得 X值的数据模 块中的任意 τ或 τ +1个数据模块;  The second data downloading and replacing module is configured to download the data module in the selected τ or τ +1 storage nodes, and replace the data used to calculate the obtained X value by using the downloaded τ or τ +1 data modules again. Any τ or τ +1 data modules in the module;
第二比较数据下载及替换模块: 用于选择一个与之前用于比较的第 k+1存 储模块不同的存储模块下载其数据, 替代用于比较的第 k+1个数据模块。  The second comparison data downloading and replacing module is configured to select a storage module different from the k+1th storage module previously used for comparison to download the data, instead of the k+1th data module for comparison.
在本发明所述的装置中, 所述数据恢复单元进一步包括:  In the device of the present invention, the data recovery unit further includes:
第二子集设置模块: 用于设置再次下载所选择的存储节点数 τ, 其中, τ 小于 k; Second subset setting module: used to set the number of storage nodes selected to be downloaded again τ, where τ Less than k;
第三选择模块: 用于选择 τ个存储节点, 所述选择的存储节点中至少一个 不同于之前已经选择并下载过数据用于取得 X值的 k个存储节点;  a third selection module: configured to select τ storage nodes, wherein at least one of the selected storage nodes is different from k storage nodes that have previously selected and downloaded data for obtaining an X value;
第三数据下载及替换模块: 用于下载所选择的 τ个存储节点中的数据模块, 使用再次下载的 τ个数据模块代替上次用于计算取得 X值的数据模块中的任意 τ个 数据模块;  The third data downloading and replacing module is configured to download the data module in the selected τ storage nodes, and replace the τ data modules in the data module used for calculating the obtained X value by using the τ data modules downloaded again. ;
第三比较数据下载及替换模块: 用于选择一个与之前用于比较的第 k+1存 储模块不同的存储模块下载其数据, 替代用于比较的第 k+1个数据模块。  The third comparison data downloading and replacing module is configured to select a storage module different from the k+1th storage module previously used for comparison to download the data, instead of the k+1th data module for comparison.
实施本发明的分布式存储数据的污染恢复方法及装置, 具有以下有益效果: 由 于不需要使用现有技术中的同态签名及公钥, 经过可以接受的数据下载及计算, 就 能够得到保证正确的原始数据; 所以其方法较为筒单、 计算开销较小、 可以保证分 布式网络存储系统的可靠性。  The method and device for recovering pollution of distributed storage data embodying the present invention have the following beneficial effects: Since the homomorphic signature and the public key in the prior art are not required, the acceptable data download and calculation can be ensured correctly. The original data; therefore, the method is relatively simple, the calculation overhead is small, and the reliability of the distributed network storage system can be guaranteed.
附图说明 DRAWINGS
图 1 是本发明分布式存储数据的污染恢复方法及装置第一实施例中的方法流程 图;  1 is a flow chart of a method in a first embodiment of a method and apparatus for recovering pollution of distributed storage data according to the present invention;
图 2是所述第一实施例中排除污染数据步骤的流程图;  Figure 2 is a flow chart showing the steps of eliminating pollution data in the first embodiment;
图 3是所述第一实施例中装置的结构示意图;  Figure 3 is a schematic structural view of the device in the first embodiment;
图 4是本发明分布式存储数据的污染恢复方法及装置第二实施例中排除污染数 据步骤的流程图;  4 is a flowchart of a method for eliminating pollution data in a second embodiment of a method and apparatus for recovering pollution of distributed storage data according to the present invention;
图 5是所述第二实施例中表示数据取得及恢复过程的伪代码;  Figure 5 is a pseudo code showing the data acquisition and recovery process in the second embodiment;
图 6是所述第二实施例中装置的结构示意图;  Figure 6 is a schematic structural view of the device in the second embodiment;
图 7是本发明分布式存储数据的污染恢复方法及装置第三实施例中排除污染数 据步骤的流程图;  7 is a flowchart of a method for eliminating pollution data in a third embodiment of a method and apparatus for recovering pollution of distributed storage data according to the present invention;
图 8是所述第三实施例中表示数据取得及恢复过程的伪代码;  Figure 8 is a pseudo code showing the data acquisition and recovery process in the third embodiment;
图 9是所述第三实施例中装置的结构示意图。  Fig. 9 is a schematic structural view of the apparatus in the third embodiment.
具体实施方式 detailed description
下面将结合附图对本发明实施例作进一步说明。  The embodiments of the present invention will be further described below in conjunction with the accompanying drawings.
如图 1 所示, 在本发明分布式存储数据的污染恢复方法及装置第一实施例中, 其下载数据、 将下载数据中被污染 (或被恶意修改) 部分去除、 得到正确的原始数 据的流程包括: 步骤 Sll 将原始数据分为 k份, 编码后存储在 n个存储节点: 在本步骤中, 需要 数据按照分布式 (网络)存储的规定, 将数据分散存储在网络的存储节点上; 具体 来讲, 分布式 (网络)存储是将数据分散存储在多台独立的设备上; 而传统的网絡 存储系统采用集中的存储服务器存放所有数据, 存储服务器成为系统性能的瓶颈, 也是可靠性和安全性的焦点, 不能满足大规模存储应用的需要; 分布式网络存储系 统则采用可扩展的系统结构, 利用多台存储节点分担存储负荷, 利用索引服务器定 位存储信息, 它不但提高了系统的可靠性、 可用性和存取效率, 还易于扩展。 一般 而言, 分布式存储系统中包含 k个原节点, n个存储节点, 至少 1个接收节点; 实际 上, 源节点、 存储节点、 接收节点的集合 4艮可能重合; 源节点和存储节点的设备能 量比较低, 接收节点的设备能量比较高。 As shown in FIG. 1, in the first embodiment of the pollution recovery method and apparatus for distributed storage data of the present invention, the data is downloaded, the contaminated (or maliciously modified) part of the downloaded data is removed, and the correct original data is obtained. The process includes: Step S11 divides the original data into k shares, and stores them in n storage nodes after encoding: In this step, the data needs to be distributed and stored in the storage node of the network according to the provisions of distributed (network) storage; Distributed (network) storage is to distribute data on multiple independent devices. Traditional network storage systems use centralized storage servers to store all data. Storage servers become the bottleneck of system performance, and are also reliable and secure. The focus can not meet the needs of large-scale storage applications; distributed network storage systems use a scalable system structure, use multiple storage nodes to share storage load, and use index server to locate storage information, which not only improves system reliability and availability. And access efficiency, but also easy to expand. Generally, a distributed storage system includes k original nodes, n storage nodes, and at least one receiving node; in fact, a collection of source nodes, storage nodes, and receiving nodes may overlap; source nodes and storage nodes The energy of the device is relatively low, and the energy of the device at the receiving node is relatively high.
在本步骤中, 对于一个文件 M, 将其分成大小相等的 k份, 由 k个不同源节点存 储不同的原始数据模块 X, · = 1, 2… ), 并通过对其进行 MDS码编码而产生 η份不同 的编码模块 = 1, 2—«) ,这 n份编码模块是线性无关的且其中任意 k份就可以重构文 件 M; 其中, k小于 n。  In this step, for a file M, it is divided into equal-sized k shares, and different original data modules X, · = 1, 2...) are stored by k different source nodes, and MDS code is encoded by Generating n different encoding modules = 1, 2 - «), the n encoding modules are linearly independent and any of the k shares can reconstruct the file M; where k is less than n.
由于 MDS码的特性, 可以得出 ^ = ¾^. , 其中 X = 是所有原始数据模 块的行向量, = ( , g2^ f是一个列向量, 其中非零元素都是随机线性编码中的 系数。 因此 ^ e GF^) , 其中 ϊ· = 1, 2 · · · /^· = 1,2 · · · η。 这样, 每个存储节点 j 存储的数据 Z . = (Gj , ^ ),可以表示为 = XGj。整个系统可以通过 Y=XG来表示,其中 F = , …!) 是所有编码模块的行向量, G = , G2 · · · G„ )是一个 k*n的生成矩阵, 在 G的每列中包 含系数向量。 Due to the characteristics of the MDS code, we can get ^ = 3⁄4^. , where X = is the row vector of all the original data modules, = ( , g 2 ^ f is a column vector, where the non-zero elements are all in random linear coding Therefore, ^ e GF^) , where ϊ· = 1, 2 · · · /^· = 1,2 · · · η. Thus, the data Z . = (Gj , ^ ) stored by each storage node j can be expressed as = XGj. The entire system can be represented by Y=XG, where F = , ...! ) is the row vector of all coding modules, G = , G 2 · · · G„ ) is a k*n generator matrix, containing coefficient vectors in each column of G.
关于 k 份数据可以恢复原始数据的理论依据, 请参见文献 [A.GDimakis, V. Prabhakaran, and K.Ramchandran, "Ubiquitous Access to Distributed Data in Large-Scale Sensor Networks through Decentralized Erasure Codes," Proc. Fourth Int'l Symp.Information Processing in Sensor Networks (IPSN'05), 2005.]中的定理 1和 2, 在 此不再赘述。  For the theoretical basis for recovering raw data from k data, see the literature [A. GDimakis, V. Prabhakaran, and K. Ramchandran, "Ubiquitous Access to Distributed Data in Large-Scale Sensor Networks through Decentralized Erasure Codes," Proc. The theorems 1 and 2 in Int'l Symp. Information Processing in Sensor Networks (IPSN '05), 2005.] are not repeated here.
值得一提的是, 本步骤执行的是数据的存储, 是第一实施例的基础。 但就数据 的污染清除方法而言, 本步骤并不是一定存在的。 仅仅是为了对第一实施例中记载 的方法的使用基础及对技术方案的完整理解而加以说明。  It is worth mentioning that this step performs the storage of data, which is the basis of the first embodiment. However, this step does not necessarily exist in terms of data pollution removal methods. It is merely explained for the basis of use of the method described in the first embodiment and a complete understanding of the technical solution.
步骤 S12 由 n个存储节点中任意选择 k+1个, 下载其数据: 在本步骤中, 在上 述 n个存储节点中任意选择 k+1个, 并下载其中的数据编码模块; 步骤 S13 将下载数据中的任意 k个数据计算, 得到原始数据 X: 在本步骤中, 由下载的 k+1个数据模块任取 k个, 按照 Yj=XGj进行计算, 得到原始数据 X。 Step S12: arbitrarily select k+1 of the n storage nodes, and download the data: In this step, arbitrarily select k+1 among the n storage nodes, and download the data encoding module therein; Step S13 calculates any k data in the downloaded data to obtain the original data X: In this step, k are obtained from the downloaded k+1 data modules, and are calculated according to Yj=XGj to obtain the original data X.
步骤 S14 将得到的原始数据 X与第 k+1个存储节点的矩阵相乘: 在本步骤中, 将由第 k+1个存储节点得到的矩阵 Gk+1与上述步骤得到的原始数据 X相乘, 得到一 个数据模块, 该数据模块就是当 X是正确的时, 第 k+1个存储节点应该存储的数据 模块。 这里, 假设上述步骤中得到的原始数据 X是正确的, 也就是假设上述步骤中 的 k个存储节点上的数据模块均未被污染 (或篡改) 。 Step S14 multiplies the obtained original data X by the matrix of the k+1th storage node: In this step, the matrix G k+1 obtained by the k+1th storage node is compared with the original data X obtained in the above step. Multiply, get a data module, which is the data module that the k+1th storage node should store when X is correct. Here, it is assumed that the original data X obtained in the above steps is correct, that is, it is assumed that the data modules on the k storage nodes in the above steps are not contaminated (or falsified).
步骤 S15 计算得到的数据模块与下载的第 k+1个数据模块相同? 如是, 执行步 骤 S16; 否则, 执行步骤 S17。 在本步骤中, 如果上述步骤中用于恢复原始数据的 k 个存储节点未被污染、 且第 k+1个存储节点也未被污染, 则下载的第 k+1个数据模 块应该等于步骤 S14中计算出来的数据模块。 如上述二者相等, 判断为步骤 S13 中 得到的原始数据 X正确; 否则, 上述 k+1个数据模块中至少有一个被污染或墓改, 上述步骤 S13中得到的原始数据不正确或用于比较的第 k+1个存储节点被污染。  The data module calculated in step S15 is the same as the downloaded k+1th data module. If yes, go to step S16; otherwise, go to step S17. In this step, if the k storage nodes used to recover the original data in the above step are not contaminated, and the k+1th storage node is also not contaminated, the downloaded k+1th data module should be equal to step S14. The calculated data module. If the two are equal, it is determined that the original data X obtained in step S13 is correct; otherwise, at least one of the above k+1 data modules is contaminated or tombed, and the original data obtained in the above step S13 is incorrect or used for The compared k+1th storage node is contaminated.
步骤 S16 下载数据未被污染, 退出: 在本步骤中, 由于已经判断得到的原始数 据是正确的, 不需要再做处理。 所以, 退出本次循环取得数据。  Step S16 The download data is not contaminated, and exits: In this step, since the original data that has been judged is correct, no further processing is required. So, exit this cycle to get the data.
步骤 S17 再次下载至少一个与之前下载的数据模块不同的数据模块, 用其替换 下载的 k+1个数据模块中的一个或多个: 在本步骤中, 上述 k+1个数据模块中至少 有一个被污染或墓改, 上述步骤 S13 中得到的原始数据不正确或用于比较的第 k+1 个存储节点被污染, 为此, 需要再次下载新的、 未下载过的数据模块并替换之前的 数据模块, 并返回步骤 S13, 使用新下载的数据重新进行计算, 以便得到正确的原始 数据 X或正确的第 k+1个用于验证的数据模块。 在本步骤中, 至少下载一个数据模 块, 同时至少一个新下载的数据模块是之前未下载过的。 稍后将对本步骤加以详细 的描述。  Step S17: Download at least one data module different from the previously downloaded data module, and replace one or more of the downloaded k+1 data modules with the data module. In this step, at least one of the above k+1 data modules is included. A contaminated or tombed, the original data obtained in the above step S13 is incorrect or the k+1th storage node used for comparison is contaminated. To this end, the new, undownloaded data module needs to be downloaded again and replaced. The data module, and returning to step S13, recalculate using the newly downloaded data to obtain the correct original data X or the correct k+1th data module for verification. In this step, at least one data module is downloaded, and at least one newly downloaded data module is not previously downloaded. This step will be described in detail later.
在第一实施例的上述步骤中, 采用的是一次下载 k+1个数据模块, 并将其中的 k 个用于恢复原始数据, 一个用于验证上述恢复的原始数据是否正确。 在一些情况中, 也可以先下载 k 个数据模块, 并用于恢复原始数据; 当原始数据恢复后, 再下载一 个 (与上述 k个) 来自不同存储节点的数据模块, 用于验证取得的原始数据是否正 确。 这两种方法的效果是一样的。 当然, 对于后者而言, 第一实施例中的相关步骤 也需要微小的调节, 以适应其下载方式。  In the above steps of the first embodiment, it is used to download k+1 data modules at a time, and k of them are used to restore original data, and one is used to verify whether the restored original data is correct. In some cases, k data modules may also be downloaded first and used to recover the original data; after the original data is restored, a data module (from the k above) from different storage nodes is downloaded for verifying the obtained raw data. is it right or not. The effect of these two methods is the same. Of course, for the latter, the relevant steps in the first embodiment also require minor adjustments to suit the way they are downloaded.
在第一实施例中, 如图 2所示, 步骤 S17进一步包括如下步驟: 步骤 S21 选择 k+1个存储节点, 其中至少一个不同于上次下载的存储节点: 在 本步骤中, 首先重新选择 k+1 个存储节点, 由于上次下载的存储节点是已知的, 所 以, 在本步骤中, 保证选择的 k+1 个存储节点与上次下载的存储节点不完全相同是 容易实现的, 只要将上次下载或已经下载的存储节点清单中的一个或多个去掉, 补 充相同数量的未下载过的存储节点即可。 In the first embodiment, as shown in FIG. 2, step S17 further includes the following steps: Step S21: Select k+1 storage nodes, at least one of which is different from the storage node that was last downloaded: In this step, k+1 storage nodes are first reselected, since the last downloaded storage node is known, In this step, it is easy to ensure that the selected k+1 storage nodes are not identical to the last downloaded storage node, as long as one or more of the list of storage nodes that were last downloaded or downloaded have been removed. Add the same number of undownloaded storage nodes.
步骤 S22 下载所选择 k+1个存储节点上的数据, 并将下载的数据替换上次下载 的数据: 在本步骤中, 下载上一步骤中选择的存储节点上的数据模块, 并将本次下 载取得的数据模块用于替换上次运算或验证的数据模块即可。 在执行完本步骤后, 跳转到步骤 S13执行。  Step S22 downloads the data on the selected k+1 storage nodes, and replaces the downloaded data with the last downloaded data: In this step, download the data module on the storage node selected in the previous step, and this time The downloaded data module is used to replace the data module of the last operation or verification. After performing this step, the process jumps to step S13 for execution.
值得注意的是, 在一次下载 k个数据中获得原始数据, 再下载一个数据用于验 证所取得的原始数据的情况上, 上述步骤 S21、 S22中也是先下载 k个之前未下载过 的数据进行运算而得到原始数据 X, 再下载一个之前未下载过的数据用于对其取得 的原始数据 X进行验证的。  It should be noted that, in the case where the original data is obtained by downloading k data at a time, and then downloading one data for verifying the obtained original data, in the above steps S21 and S22, k pieces of data that have not been downloaded are also downloaded first. The operation obtains the original data X, and then downloads a previously undownloaded data for verifying the original data X obtained.
在第一实施例中, 每个数据模块 自身可以表示成一个包含 m符号的列向量 (·¾,½ ····^) , 其中对于所有 和/ = l,2〜m , 都有 ¾ e GF( )。 于是, 每个编 码模块 γ」也可以表示成在 G ^)域中一个包含 m符号的列向量。 线性组合 ^ = XGj 通过符号到符号方式计算, 这意味着对于所有 · = 1, 2· · ·π和 Z = l, 2...m, 都有 In the first embodiment, each data module itself can be represented as a column vector (·3⁄4, 1⁄2 ····^) containing m symbols, where for all and / = l, 2~m, there are 3⁄4 e GF(). Thus, each coding module γ" can also be represented as a column vector containing m symbols in the G^) domain. Linear combination ^ = XGj is calculated by symbol to symbol, which means that for all · = 1, 2 · · · π and Z = l, 2...m,
¾ =∑ i¾。 因此, 我们可以将 X和 Y分别看成大小为 m*k和 m*n的矩阵。 3⁄4 =∑ i3⁄4. Therefore, we can think of X and Y as matrices of size m*k and m*n, respectively.
在第一实施例中, 可以分析数据污染攻击对存储数据的译码影响, 假设攻击者可 以访问 t个存储节点且可以观察和修改节点所存储的方程(数据), 也就是说如果攻 击者可以访问存储节点 j, 就可以更改节点 j 所存储的!^和 。 令 CT = G + AG和 Υ' = + Δ 分别为攻击后所篡改的生成矩阵和编码模块向量,其中篡改主要在矩阵 和向量 中。 假设攻击可以改变 t个存储节点的通信链路, 这显然给出了攻击的更 多可能性, 但没有扩大攻击所带来的可能影响。 为筒单起见, 定义那些所存储的数 据被篡改的节点为妥协节点, 并且不能区别妥协节点与一般的存储节点。 按照实际 的攻击发生的情形, 假设不能攻击源节点, 而只是改变存储系统的输出结果。 这是 因为存储节点长期暴露在攻击下, 而源节点只可能在数据产生的有限时间段内被攻 击, 其被攻击的可能性大大减小。  In the first embodiment, the decoding effect of the data pollution attack on the stored data can be analyzed. It is assumed that the attacker can access the t storage nodes and can observe and modify the equations (data) stored by the node, that is, if the attacker can By accessing storage node j, you can change the storage stored by node j! ^ and . Let CT = G + AG and Υ ' = + Δ be the generator matrix and coding module vector that are falsified after the attack, respectively, where the tampering is mainly in the matrix and the vector. It is assumed that the attack can change the communication link of t storage nodes, which obviously gives more possibilities for the attack, but does not expand the possible impact of the attack. For the sake of simplicity, the nodes whose stored data has been tampered are defined as compromise nodes, and the compromise nodes and the general storage nodes cannot be distinguished. According to the actual attack situation, it is assumed that the source node cannot be attacked, but only the output result of the storage system is changed. This is because the storage node is exposed to the attack for a long time, and the source node can only be attacked within a limited period of time during which the data is generated, and the probability of being attacked is greatly reduced.
为了不失一般性, 假设攻击者随机篡改 t个存储节点, 同时接收节点随机选择 k 个存储节点来下载 k个线性方程。 接收节点下载的方程集合为 Z ..k = (G:..k,} :.k), 其中 Gi = (G; , G; , - GD , Y* = (γ; ,- γ;-)。 接收节点为了获得 X从而解 SLEsJ = XG; K , 进而获得结果 ^ ^( ^)-1In order not to lose generality, it is assumed that the attacker randomly tampers with t storage nodes, and the receiving node randomly selects k storage nodes to download k linear equations. The set of equations downloaded by the receiving node is Z .. k = (G:.. k ,} :. k ), where Gi = (G; , G; , - GD , Y* = (γ; , - γ; -). The receiving node obtains the result ^ ^( ^) - 1 in order to obtain X to solve SLEsJ = XG; K .
首先假设攻击者只篡改编码模块, 也就意味着 <T=G。 在这种情况下, First assume that the attacker only tampers with the encoding module, which means <T=G. under these circumstances,
X* =
Figure imgf000012_0001
, 攻 击 对 原 始 数据 的 改 变 可 以 写 成 如 下 形 式 :
X* =
Figure imgf000012_0001
The attack's changes to the original data can be written as follows:
ΔΧ = f X = } ((^ k)- L X = (} k)- L X = k) , 其中我们使用 了ΔΧ = f X = } ((^ k )- LX = (} k )- LX = k ) , where we used
Χ = .. ^ )-1。 这意味着如果 Δί^的一个给定的行中只包含零元素, 就相当于 ΔΧ的 行中也只包含零元素; 以及在△} k的一个给定的行中的非零元素将会影响 ΔΧ中的整 个行。 因此, 在对 k个编码模块中的任意一个给定的行所做的篡改都会影响整个数 据模块的译码, 但是这个影响将只局限于相应的行。 Χ = .. ^ )- 1 . This means that if a given row of Δί^ contains only zero elements, the row corresponding to ΔΧ contains only zero elements; and non-zero elements in a given row of △} k will affect The entire line in ΔΧ. Therefore, tampering on a given row of any of the k coding modules affects the decoding of the entire data module, but this effect will be limited to the corresponding row.
再次假设攻击者只墓改系数向量, 也就意味着 y* = y。 在这种情况下, Again assume that the attacker only changed the coefficient vector, which means y* = y. under these circumstances,
^ = ..k(Gikr^如果 k个系数中至少有一个被对手改变, 即 那么(GU1完 全不同于 (^^ 。 因此, 这个更改将影响所有行的数据模块译码。 ^ = .. k (Gi k r^ If at least one of the k coefficients is changed by the opponent, ie (GU 1 is completely different from (^^). Therefore, this change will affect the decoding of the data blocks of all rows.
最后假设攻击者同时更改系数向量和编码模块, 这个影响将是组合的。 在一般 情况下, 攻击者所造成的数据模块的更改可以通过下面公式推导:
Figure imgf000012_0002
, 从而
Figure imgf000012_0003
XGu^j。 通过以上公式可以观察 出, 如果 Ai ...k被攻击者控制, 意味着所有下载的方程来自妥协节点, ΔΖ值可以由攻 击者来选择; 其可以从节点所存储的内容中重构 X , 也可以通过将第 i个妥协节点 所存储的内容伪造为 i = X'G,来形成任意的 X* =Χ +ΔΧ。 因此在 t≥ 的情景下, 不仅可 以篡改原始数据模块向量, 也可以伪造一个特定的值。 事实上, 节点所存储编码信 息里的一小部分改变将会导致译码数据的大量改变。 在最糟糕的情况下, 所有数据 模块都被破坏。
Finally, suppose the attacker changes the coefficient vector and the encoding module at the same time, and the effect will be combined. In general, changes to the data module caused by an attacker can be derived by the following formula:
Figure imgf000012_0002
Thus
Figure imgf000012_0003
XGu^j. It can be observed from the above formula that if Ai ... k is controlled by the attacker, it means that all downloaded equations come from the compromise node, and the ΔΖ value can be selected by the attacker; it can reconstruct X from the content stored by the node. It is also possible to form an arbitrary X* = Χ + ΔΧ by forging the content stored in the i-th compromise node as i = X'G. Therefore, in the case of t≥, not only the original data module vector but also a specific value can be forged. In fact, a small change in the encoded information stored by the node will result in a large amount of change in the decoded data. In the worst case, all data modules are destroyed.
在第一实施例中, 采取方法的基本思想如下: 大多数情况下, 攻击者不能伪造一 个特定的解 f z GU1 , 因为它不可能篡改一开始下载的所有 k个方程。 除了这 k 个方程都未被篡改的情况即 X*=X, 其他情况下 T可以被看成是一个随机向量。 如果 Χ'≠Χ , 则 X*从一个大小至少为 q的集合中随机取值。 假设有另一个未被篡改的方 程: +1 = ¾¾+1 (如: 接收节点下载
Figure imgf000012_0004
)。 如果 X*是随机的或者被攻击者 选择, 它不满足另外一个未被墓改的方程的概率非常大, 而如果 = x则将会以概率
In the first embodiment, the basic idea of the method is as follows: In most cases, the attacker cannot forge a particular solution fz GU 1 because it is impossible to tamper with all the k equations that were initially downloaded. Except that the k equations have not been tampered with, X*=X, in other cases T can be considered as a random vector. If Χ'≠Χ , then X* takes a random value from a set of at least q. Suppose there is another equation that has not been tampered with: +1 = 3⁄43⁄4 +1 (eg: Receive node download
Figure imgf000012_0004
). If X* is random or chosen by the attacker, the probability that it does not satisfy another equation that has not been modified by tomb is very large, and if = x then the probability will be
1满足该方程。 因此, 我们可以通过一个额外的未被篡改的方程来判断这个数据模块 向量 f是否被污染。 所以, 在第一实施例中, 接收节点首先下载 k个方程 z.A, 并计算 f
Figure imgf000013_0001
然后接收节点再下载另一个 z:+1。 如果 1^ = .±, 那么未检测到攻击(信宿将 看 成是正确的解); 否则若 ≠ G..k, 将会发出一个被攻击信号。 该算法所需要的通 信复杂性是 k+1, 计算复杂性是 1。根据现有知识即任何纠错码的汉明距离至少为 2, 得出在所描述的系统中任何攻击检测算法至少要下载 k+1 个方程。 因此, 从通信复 杂性上来说所提出的攻击检测算法是最优的。
1 satisfies the equation. Therefore, we can judge whether the data module vector f is contaminated by an additional equation that has not been tampered with. Therefore, in the first embodiment, the receiving node first downloads k equations z. A and calculates f
Figure imgf000013_0001
The receiving node then downloads another z: +1 . If 1^ = . ± , then no attack is detected (the sink will be treated as the correct solution); otherwise, if ≠ G.. k , an attacked signal will be sent. The communication complexity required for this algorithm is k+1 and the computational complexity is 1. According to the prior knowledge that any error correction code has a Hamming distance of at least 2, it is concluded that any attack detection algorithm in the described system must download at least k+1 equations. Therefore, the proposed attack detection algorithm is optimal in terms of communication complexity.
对于上述攻击检测的结果, 可能会出现一个假阴性决定 (错误未检测), 主要分 两种情况: f是一个随机值; 或由攻击者伪造的一个特定 f值。 假设在所下载的 k 个方程中至少有一个是正确的, 或攻击者对节点所存储的内容的篡改是不相关的, 则; T可以被看成是一个随机值。 假设攻击者没有墓改系数矩阵即 CT =G , 通过上面 的分析知这种情况下信宿所获得方程解 T =X+A1^G =Χ+ΔΧ。 进一步假设用于 错误检测的额外方程未被篡改即 Z:+1 = Zk+1 = (Gk+1,Yk+1) , 在这种情况下假阴性决定的概 For the result of the above attack detection, a false negative decision (error not detected) may occur, mainly in two cases: f is a random value; or a specific f value forged by the attacker. Assuming that at least one of the downloaded k equations is correct, or the attacker's tampering with the content stored by the node is irrelevant, then T can be considered a random value. Suppose the attacker does not have a matrix of tomb modifier coefficients, CT = G. From the above analysis, the solution of the equation obtained by the sink is T = X + A1 ^ G = Χ + ΔΧ. Further assume that the additional equation for error detection has not been tampered with Z: +1 = Z k+1 = (G k+1 , Y k+1 ), in which case the false negative decision
PJneg =Vr{Yk+1 = X'Gk+1\ Yi k ≠0} P Jneg =Vr{Y k+1 = X'G k+1 \ Y ik ≠0}
率 P 为: =Pr{ +1 =(X+A K¾+1IAi ≠0}, 最后一步中用了 +1 = ¾¾+1Ratio P is: = Pr {+1 = (X + A K¾ +1 IAi ≠ 0}, the last step by +1 = ¾¾ +1.
= Pr{AXGM =0\AY k ≠0},(1) 如杲 A在第 i行有一个非零元素并且 G1 A是正确的,那么 ΔΖ在第 i行也有一些 非零元素。 否则, 如果 Δί λ的第 i行只有零元素, 那么 Δ 的第 i行也只有零元素。 我们可以将 Δ¾¾+1的第 i行元素写成 A¾ +1),(2) , 通过上面的分析知公式( 2 )是一 个重要的关于 元素的线性组合。 而, 元素是随机选的, 并且由于所下载的 方程是随机的且攻击者事先不知道, 所以公式 (2) 为 0的概率等于 l/q。 如果 ΔΧ元 素是互不相关的, 则/^ =^,(3), 其中 ί'是 Δί λ矩阵中包含非零元素的行数。 = Pr{AXG M =0\AY k ≠0}, (1) If 杲A has a non-zero element on the ith line and G 1 A is correct, ΔΖ also has some non-zero elements on the ith line. Otherwise, if the ith row of Δί λ has only zero elements, then the ith row of Δ has only zero elements. We can Δ¾¾ +1 i-th row element written A¾ +1), (2), known by the above analytical formula (2) is an important element of the linear combinations. However, the elements are randomly selected, and since the downloaded equation is random and the attacker does not know in advance, the probability that the formula (2) is 0 is equal to l/q. If the ΔΧ elements are uncorrelated, then /^ =^,(3), where ί' is the number of rows in the Δί λ matrix that contain non-zero elements.
当攻击者所做的篡改是相关联的, Ρ ≤1 仍成立。 从这点可以看出; Τ从一个大 小至少为 q 的集合中随机取值。 显然为了最大化检测的错误概率 (因此最小化成功 概率 ), 攻击者必须将对于编码模块所做的篡改限制在同一行或者使被篡改的行线性 相关。  When the tampering done by the attacker is related, Ρ ≤ 1 is still true. It can be seen from this point that 随机 randomly takes values from a set of at least q in size. Obviously, in order to maximize the probability of error detection (thus minimizing the probability of success), an attacker must limit the tampering of the encoding module to the same line or linearly correlate the tamper-corrected line.
如果攻击者没有墓改系数矩阵即 CT = G , 但是我们假设用于检测的额外的方程是 被篡改的, 意味着 +i =(<¾+1, ) = +1+1+Δ;Γλ+1)。 在这种情况下, 可以利用一个筒 单的类似先前例子的结果派生出下面的结果: Pfneg =Pr{AXGi , =AYk+1\AY1 k≠0},(4)。 由 前面对攻击所带来的影响的分析知: 如果 Δί k的第 i行只包含零元素那么 ΔΖ的第 i 行也只包含零元素, 在这种情况下 ΔΧ +1的第 i行必然也只包含零元素。 因此, 如果 Δί +1的第 i个元素非零, 则上述错误概率为 0 (如, 即使用于检测的额外方程不是正 确的也可以检测到攻击); 另一方面, 如果 Δΐ +1在每一行中都含有零元素, 其中 ¾ 只包含零元素, 那么由于 +1的随机性, 又可得出: Pf g≤l/q。 If the attacker does not change the coefficient matrix tomb i.e. CT = G, but we assume that the equation for the additional detection of tampering is meant + i = (<¾ +1, ) = +1, ί +1 + Δ; λ λ+1 ). In this case, the following result can be derived from a result similar to the previous example of a single cartridge: P fneg =Pr{AXG i , =AY k+1 \AY 1 k ≠0}, (4). By The analysis of the impact of the attack on the previous one is known: If the ith row of Δί k contains only zero elements, the ith row of ΔΖ also contains only zero elements, in which case the ith row of ΔΧ +1 must also Contains only zero elements. Therefore, if the ith element of Δί +1 is non-zero, the above error probability is 0 (eg, even if the additional equation used for detection is not correct, the attack can be detected); on the other hand, if Δΐ +1 is in each Each line contains zero elements, of which 3⁄4 only contains zero elements, then due to the randomness of +1 , it can be concluded that: P fg ≤ l / q.
如果攻击者同时篡改系数向量和编码模块, 因此 ?≠Ο,Δ;ΚΟ。这种情况必须小心 处理, 因为 Τ = T (G* T Γ1值是不可能完全的随机。 例如, 如果 ί = XG t那么即If the attacker tampers with the coefficient vector and the encoding module at the same time, then ≠Ο, Δ; Κ Ο. This situation must be handled with care because Τ = T (G* T Γ 1 value is not completely random. For example, if ί = XG t then
/l AGi k ≠ ^AY t ≠0 , 都有 ΔΖ =0。 显然这种墓改不能被看成是一次攻击, 因为所 篡改的方程未被污染。 这个例子是为了指出即使系数向量和编码模块的所有元素都 被墓改, 可能等价于一个只有一个墓改元素的正确方程。 因此如果考虑各篡改元素 间 最 大 可 能 的 相 关 性 , 我 们 回 到 了 先 前 的 情 况 , 即 Pfneg =
Figure imgf000014_0001
≠ 0}≤ 1/ g,(5)也满足这种情况。在敌手随机攻击的情况下, X* 所有值可能都是随机且互不相关的, 因此这种情况下 降为 \/qm
/ l AG ik ≠ ^AY t ≠0 , both have ΔΖ =0. Obviously this kind of tomb change cannot be seen as an attack because the falsified equation is not contaminated. This example is to point out that even if the coefficient vector and all elements of the coding module are tombed, it may be equivalent to a correct equation with only one tomb element. So if we consider the most likely correlation between the tamper elements, we return to the previous situation, ie P fneg =
Figure imgf000014_0001
≠ 0} ≤ 1/ g, (5) also satisfies this situation. In the case of a random attack by an adversary, all values of X* may be random and uncorrelated, so this situation is reduced to \/q m .
所以 , 当不是所有下载的方程都属于相同的 ≠ X值, 一个错误未检测的概率的 最大值 ^^ =1/^。 因此如果 q值足够大, 那么没有检测到污染攻击的概率将可忽略。 当然, q值越大所带来的通信量以及存储开销也就越大。 注意如果编码模块包含标准 的错误检测元素, 例如对于一个 CRC校验和, 攻击者在每次攻击编码模块中必须至 少篡改两行且保持墓改互不相关, 在这种情况下得出 /^ ≤1 2。 这个结论使得可以 选择较小的域, 因而也加快了有限域上的计算。 如果实际上 g = 22Q , 那么错误概率为 2- 40So, when not all downloaded equations belong to the same ≠X value, the maximum value of an error undetected is ^^ =1/^. So if the q value is large enough, then the probability of not detecting a contaminated attack will be negligible. Of course, the larger the q value, the greater the traffic and storage overhead. Note that if the encoding module contains standard error detection elements, such as for a CRC checksum, the attacker must tamper with at least two lines in each attack coding module and keep the tomb change uncorrelated, in this case /^ ≤1 2 . This conclusion makes it possible to select smaller domains and thus speed up the calculations on the finite field. If fact, g = 2 2Q, then an error probability of 2-40.
现在, 考虑当所有下载的 k+1个方程均被污染, f由攻击者伪造, 在这种情况下, 第一实施例中的方法并不能检测攻击, 因为 也满足额外的检测方程。 这种错误未 检测的概率为 Δ =「 1 }/{ n )^(t/n)k+l (对于一个实际的系统, Δ值主要依赖于被攻 k + l k + 1 击的节点数)。当 t相对于 n来说不大,且 k足够大时这个值将非常小(如当 n=100,k=10, t=20, Δ«10- 9 )。 因此我们可以评估一个错误未检测的概率的上限值: Ρ <1/ + Δ。 在大多数情况下, 可以假设 η远大于 t, 因此 Δ值接近 0; 然而如果我们考虑一个强的 攻击和大的 t值, Δ将不能忽略。 另外攻击检测也会出现假阳性决定(正确被判定错误), 如节点一开始下载的 k 个方程是正确 的 , 意味着 ζ: Λ = zL , 因 此接收节点计算正确 的解 = Yik (G:j γι == γΙ λ yl = x ' 而用于攻击检测的额外的方程是不正确的' 因此 一个假阳性决定的概率可以被定义成: Pfps = Ρτ{ΔΖ^ ≠ 0 1 = 0} , (6)。 一开始下载 n— k—t 的 k个方程是正确的, 第 + 1个方程也是正确的概率是 ,(7)Now, consider that when all downloaded k+1 equations are contaminated, f is forged by the attacker, in which case the method in the first embodiment does not detect the attack because the additional detection equation is also satisfied. The probability that such an error is not detected is Δ = " 1 } / { n ) ^ (t / n) k + l (For an actual system, the value of Δ depends mainly on the number of nodes hit by k + lk + 1) when t is relative small for n, and k is sufficiently large when the value will be very small (e.g., when n = 100, k = 10, t = 20, Δ «10- 9). Therefore, we can not evaluate an error The upper limit of the probability of detection: Ρ <1/ + Δ. In most cases, it can be assumed that η is much larger than t, so the value of Δ is close to 0; however, if we consider a strong attack and a large value of t, Δ will Can not be ignored. In addition, the attack detection will also have a false positive decision (correctly determined error), such as the k downloaded at the beginning of the node. The equation is correct, which means ζ: Λ = z L , so the receiving node calculates the correct solution = Yi k (G:j γ ι == γ Ι λ y l = x ' and the additional equation for attack detection It is not correct' so the probability of a false positive decision can be defined as: P fp . s = Ρτ{ΔΖ^ ≠ 0 1 = 0} , (6). Download k of n-k-t at the beginning The equation is correct, the probability that the +1 equation is also correct is (7)
Figure imgf000015_0002
n— k 其中 t为攻击者攻击的存储节点数, 从公式(7 )得到/。s = 1-^^ = 1, (8)。 虽然 n—k n—k
Figure imgf000015_0002
N - k where t is the number of storage nodes attacked by the attacker, and / is obtained from equation (7). s = 1-^^ = 1, (8). Although n-kn-k
Pfpos不可忽略, 但是^ 日性决定对系统没有严重的影响。 总之, 第一实施例中的方法给出任意一个大小为 k+1的集合而不管这个集合是否 包含被污染的方程。 如果发现一个攻击我们只知道在这个集合中有一些被污染的方 程, 但是不知道有多少个以及是哪些方程被污染。 因此, 恢复算法可以被看成在 z = 个正确的集合,而需要从这些正确的集合中找到一个正 jt
Figure imgf000015_0001
P fpos can not be ignored, but ^ Japanese nature has no serious impact on the system. In summary, the method in the first embodiment gives any set of size k+1 regardless of whether the set contains a contaminated equation. If we find an attack we only know that there are some contaminated equations in this set, but we don't know how many and which equations are polluted. Therefore, the recovery algorithm can be seen as z = the correct set, and you need to find a positive jt from these correct sets.
Figure imgf000015_0001
确的集合来恢复原始数据。 The exact set to restore the original data.
由上述分析可知当正确的方程数少于 k+1 时将不能成功恢复文件, 因为要获得 正确的数据模块向量的解至少需要 k个正确方程, 并且需要一个额外的正确的方程 来验证方程组的解。 由于 q值可取足够大并且 Δ往往可以忽略, 因而上述方法得到的 结果总是正确的。  From the above analysis, it can be seen that the file cannot be successfully recovered when the correct number of equations is less than k+1, because at least k correct equations are needed to obtain the correct solution of the data module vector, and an additional correct equation is needed to verify the equations. Solution. Since the q value is large enough and Δ is often negligible, the results obtained by the above method are always correct.
在第一实施例中, 还涉及一种实现上述分布式存储数据的污染恢复方法的装置。 如图 3所示, 该装置包括数据分配及存储单元 31、 数据下载单元 32、 数据获得单元 33、 数据比较单元 34以及数据恢复单元 35。 其中, 数据下载单元 32用于由 n个存 储节点中的任意 k个分别下载其存储的数据或由上述 n个存储节点中的任意 k+1个 分别下载其存储的数据; 数据获得单元 33在一种情况下 (先下载 k个数据模块时 ) 用于对下载的 k个数据按照 YfXGj进行运算, 得到下载的原始数据 X, 之后, 在上 述 n个存储节点中选择第 k+1个存储节点并下载其存储的数据;在另一种情况下(即 一次下载 k+1个数据模块时)用于对所述下载的 k个数据按照 YfXGj进行运算, 得 到下载的原始数据 X; 数据比较单元 34用于将上述第 k+1个存储节点中下载的数据 模块 Yk+1, 与上述步驟中得到的原始数据 X与矩阵 Gk+1的乘积比较; 如果相同, 则 判为所下载的 k+1 个数据模块均未受到污染; 否则, 判断为至少一个数据模块受到 污染, 需要清除; 数据恢复单元 35 用于依据上述数据比较单元的输出 (或判断), 再次下载至少一个存储节点的数据, 所述再次下载的数据模块中至少一个与上次下 载的 k+1 个数据不同, 以所述再次下载的数据模块替代之前下载数据模块中的至少 一个。数据分配及存储单元 31用于将原始数据 X分为 k份,得到 ,其中 i=l、2、. . .k, 并通过对上述 进行 MDS码编码后得到线性无关的 n个数据模块 Yj , 其中 j=l、 2、 ...n, 所述 n个数据模块 Υ」分别与生成矩阵(¾存储在 n个存储节点上; 其中, k 小于 n。 该数据分配及存储单元 31实际上不是必不可少的, 为了便于理解, 在此将 其列出。 In a first embodiment, the invention also relates to an apparatus for implementing the above-described pollution recovery method for distributed storage data. As shown in FIG. 3, the apparatus includes a data distribution and storage unit 31, a data download unit 32, a data obtaining unit 33, a data comparing unit 34, and a data restoring unit 35. The data downloading unit 32 is configured to respectively download the stored data by any k of the n storage nodes or download the stored data by any k+1 of the n storage nodes respectively; the data obtaining unit 33 is In one case (when k data modules are downloaded first), the k data to be downloaded is calculated according to YfXGj to obtain the downloaded original data X, and then the k+1th storage node is selected among the n storage nodes. And downloading the stored data; in another case (that is, when downloading k+1 data modules at a time), the k data to be downloaded is operated according to YfXGj to obtain the downloaded original data X; the data comparison unit 34 is used to compare the data module Y k+1 downloaded in the k+1th storage node with the product of the original data X and the matrix G k+1 obtained in the above step; if they are the same, It is determined that the downloaded k+1 data modules are not contaminated; otherwise, it is determined that at least one data module is contaminated and needs to be cleared; the data recovery unit 35 is configured to download again according to the output (or judgment) of the above data comparison unit. At least one of the data of the storage node, at least one of the re-downloaded data modules is different from the last downloaded k+1 data, and at least one of the previously downloaded data modules is replaced with the re-downloaded data module. The data distribution and storage unit 31 is configured to divide the original data X into k shares, and obtain i=l, 2, . . . k, and obtain the linearly independent n data modules Yj by encoding the MDS code described above. Where j=l, 2, . . . n, the n data modules Υ” and the generation matrix respectively (3⁄4 are stored on n storage nodes; wherein k is less than n. The data allocation and storage unit 31 is not actually Essential, for ease of understanding, list them here.
在本实施例中, 数据恢复单元 35进一步包括: 第一选择模块 351和第一数据下 载及替换模块 352; 其中, 第一选择模块 351用于选择 k+1个存储节点, 此处选择的 存储节点中至少一个不同于之前已经选择并下载过数据的 k+1 个存储节点; 第一数 据下载及替换模块 352用于下载所选择的 k+1个存储节点中的数据模块, 使用再次 下载的 k+1个数据模块代替上次下载的数据模块。  In this embodiment, the data recovery unit 35 further includes: a first selection module 351 and a first data download and replace module 352; wherein, the first selection module 351 is configured to select k+1 storage nodes, where the selected storage is At least one of the nodes is different from the k+1 storage nodes that have previously selected and downloaded data; the first data download and replace module 352 is configured to download the data module of the selected k+1 storage nodes, using the downloaded again k+1 data modules replace the last downloaded data module.
在本发明的第二实施例中, 与第一实施例比较而言, 其方法中的多数步骤及装 置中的各单元大致上是相同的, 不同之处在于, 对于已判断为被污染的数据的清除 方法不太一样, 进而导致装置中的数据恢复单元的结构稍有不同。 如图 4 所示, 在 第二实施例中, 其再次下载至少一个与之前下载的数据模块不同的数据模块, 用其 替换下载的 k+1个数据模块中的一个或多个的步骤具体包括:  In the second embodiment of the present invention, most of the steps in the method and the units in the apparatus are substantially the same as in comparison with the first embodiment, except that the data that has been judged to be contaminated is different. The method of cleaning is not the same, which in turn causes the structure of the data recovery unit in the device to be slightly different. As shown in FIG. 4, in the second embodiment, the step of downloading at least one data module different from the previously downloaded data module and replacing one or more of the downloaded k+1 data modules with the specific one includes: :
步骤 S41设置再次下载的存储节点数, 如已设置, 使其加 1 : 在本步骤中, 设置 一个再次下载的存储节点数 τ, τ小于 k; 如果已经多次判断得到的原始数据 X不 能通过验证, 即在比较的时候不相等, 则在第一次运算及比较后已经设置了上述存 储节点数 τ, 在这种情况下, 使得该存储节点数加 1。  Step S41 sets the number of storage nodes to be downloaded again, and if it is set, it is incremented by 1: In this step, the number of storage nodes τ that is downloaded again is set, and τ is less than k; if the original data X that has been determined multiple times cannot pass Verification, that is, unequal at the time of comparison, the number of storage nodes τ has been set after the first operation and comparison, in which case the number of storage nodes is incremented by one.
步骤 S42按上述存储节点数选择再次下载的存储节点: 在本步驟中,选择 τ (或 一次或多次加 1得到的个数) 个存储节点, 选择存储节点是在上述 η个存储节点中 任意进行的, 但是, 需要保证被选中的存储节点至少一个不同于之前已经选择并下 载过数据用于取得 X值的 k个存储节点; 例如, 选择以上一次运算使用的数据来源 完全不同的存储节点; 当选择后, 分别下载这些选择的存储节点上的数据模块。  Step S42 selects the storage node to be downloaded again according to the number of storage nodes: In this step, select τ (or the number of one or more times plus 1) storage nodes, and select the storage node to be any of the above n storage nodes. But, it is necessary to ensure that at least one storage node that is selected is different from k storage nodes that have previously selected and downloaded data for obtaining the X value; for example, selecting a storage node whose data source is completely different from the previous operation; When selected, the data modules on these selected storage nodes are downloaded separately.
步骤 S43使用再次下载的数据替换用于上次计算的数据: 将再次下载的数据模 块作为一个子集, 替换其中相同数量的数据模块, 得到本次运算使用的数据。 步骤 S44选择一个与上次比较不同的存储节点下载用于比较的数据模块: 在本 步骤中, 选择一个与之前用于比较的第 k+1 存储节点不同的存储节点下载其数据, 并以其替代用于比较的第 k+1 个数据模块。 在执行完本步驟后, 跳转到运算取得原 始数据并比较验证其是否正确; 这样当没有查找正确的原始数据时, 上述过程将会 循环进行, 直到清除污染数据, 得到正确的原始数据。 Step S43 replaces the data for the last calculation with the data downloaded again: The data module that is downloaded again is used as a subset, and the same number of data modules are replaced, and the data used for this operation is obtained. Step S44 selects a storage node different from the previous comparison to download the data module for comparison: In this step, select a storage node different from the k+1th storage node previously used for comparison to download the data, and use the same Replace the k+1th data module for comparison. After performing this step, jump to the operation to obtain the original data and compare and verify that it is correct; thus, when the correct original data is not found, the above process will be repeated until the contaminated data is cleared to obtain the correct original data.
图 5利用伪代码示出了第二实施例中实现数据取得、 恢复的过程。 在图 5中, 其 第 1行首先下载 z +1 ; 其第 2行用 *+1作为测试方程在 z 上执行攻击检测算法, 如 果未检测到攻击说明 Ζ ..Α被清除, 算法结束; 否则在图 5 中的第 5-24 行开始清除 S=z 的迭代过程。 在图 5中的第 7-24行的每次迭代中, 第 8行要下载一个新的方 程, 这个新的下载的方程定义为 e, 第 10行将其作为测试方程来用于在当前迭代中 执行攻击检测算法。 到目前为止所下载的不包含集合 S 中的剩下方程组成了清除集 合 (:。 取 C中任何一个可能的子集 C ', 使得 I C' r小于 k, 并通过所有可能的方式用 C '中的方程来代替集合 S中的 r个方程。 在每次替换后, 用 e作为检测方程在集合 中执行攻击检测算法。 如果没有攻击, 则 被清除并且算法执行结束。 Fig. 5 shows the process of realizing data acquisition and recovery in the second embodiment by using pseudo code. In Figure 5, the first line downloads z +1 first; the second line uses * +1 as the test equation to perform the attack detection algorithm on z. If no attack description is detected. Α is cleared, the algorithm ends; Otherwise, the iteration process of S=z is cleared starting on lines 5-24 in Figure 5. In each iteration of lines 7-24 in Figure 5, the new line is downloaded in line 8, the new downloaded equation is defined as e, and the 10th line is used as the test equation for the current iteration. Perform an attack detection algorithm. The remaining equations downloaded so far that do not contain the set S constitute a cleanup set (: take any possible subset C of C, such that IC'r is less than k, and use C' in all possible ways The equation in the middle replaces the r equations in the set S. After each replacement, the attack detection algorithm is executed in the set using e as the detection equation. If there is no attack, it is cleared and the algorithm execution ends.
与上述步骤相对应, 在第二实施例中其装置的数据恢复单元的结构与第一实施 例中稍有不同。 如图 6 所示, 第二实施例中的数据恢复单元包括: 第一子集设置模 块 51、 第二选择模块 52、 第二数据下载及替换模块 53及第二比较数据下载及替换 模块 54; 其中, 第一子集设置模块 51用于设置再次下载所选择的存储节点数 τ ; 如 果该值已在前面的循环中被设置, 则使 τ加 1 ; 其中, τ小于 k; 第二选择模块 52 用于选择 τ或 τ +l (或多次加 1 后得到的个数) 个存储节点; 此处, 选择的存储节 点中至少一个不同于之前已经选择并下载过数据用于取得 X值的 k个存储节点; 第 二数据下载及替换模块 53用于下载所选择的、 在上述第一子集设置模块 51得到其 个数的存储节点中的数据模块,使用再次下载的 τ或 τ +1个数据模块作为一个子集, 代替上次用于计算取得 X值的数据模块中的任意 τ或 τ +1个数据模块; 第二比较数 据下载及替换模块 54用于选择一个与之前用于比较的第 k+1存储节点不同的存储节 点下载其数据, 替代所述用于比较的第 k+1个数据模块。  Corresponding to the above steps, the structure of the data recovery unit of the apparatus in the second embodiment is slightly different from that in the first embodiment. As shown in Figure 6, the data recovery unit in the second embodiment includes: a first subset setting module 51, a second selection module 52, a second data download and replace module 53 and a second comparison data download and replace module 54; The first subset setting module 51 is configured to set to download the selected storage node number τ again; if the value has been set in the previous cycle, add τ to 1; where τ is less than k; the second selection module 52 is used to select τ or τ +l (or the number of times after adding 1) storage nodes; here, at least one of the selected storage nodes is different from the data that has been previously selected and downloaded for obtaining the X value. k storage nodes; the second data downloading and replacing module 53 is configured to download the selected data module in the storage node whose number is obtained by the first subset setting module 51, and use the downloaded τ or τ +1 Data modules as a subset, instead of any τ or τ +1 data modules in the data module used to calculate the X value; the second comparison data download and replace module 54 is used for selection A storage node different from the k+1th storage node previously used for comparison downloads its data, replacing the k+1th data module for comparison.
图 7、图 8和图 9分别示出了本发明第三实施例中的清出污染数据过程的流程图、 伪代码及数据恢复单元的结构, 如图 7 所示, 第三实施例和第二实施例的情况大致 相同, 其步骤 S61-S64 中, 与第二实施例的区别在于在步骤 S61 中, 不管进行多少 次数据替换, 其 τ均不会改变。 图 8利用伪代码示出了第三实施例中实现数据取得、 恢复的过程。 在图 8 中, 首先在其第 1-4行执行和第二实施例中相同的步骤, 下载 z +1个方程并执行攻击检 测。 如果没有检测到攻击则结束本次数据取得; 否则在其第 5-26行中开始一个清除 过程。 其中, 其第 5行中定义清除集合 C的大小 w为一个固定值「《 , 其中 α是一 个输入参数。 在其第 6行下载方程 +2...^ , 并用 ζ ..¾来初始化集合 S, 用 来初 始化清除集合 (:。 这两个集合在每次迭代中均改变, 并用变量 和c来标明当前迭代 中它们各自的第一个方程。 同样 表示用于攻击检测的测试方程。 在图 8中第 7行将 S,K进行初始化并且开始迭代过程。 在每次迭代过程, 在第 9行下载一个新的方程 Ζ,*并用它作为其第 12行攻击检测的测试方程。在其第 13-15行取出 C中所有可能的 子集 并使 I C' l= r不超过 max , 在其第 16-18行通过所有可能的方式用 C '中方程来 代替 S中的 r个方程。 在其第 19行用 e作为测试方程, 在替换后的集合 上执行攻 击检测算法。 其第 20行表示如果未检测到攻击则说明 S '被清除此算法运行结束; 否 则在第 25行将增加我们的标量 值并继续下轮迭代。注意集合 由来自 7: 一个大小为 yt + w + l滚动窗口中的方程组成, 在第三实施例中结束的条件是: 要么清 除成功, 要么已下载 中所有方程。 7, FIG. 8, and FIG. 9 respectively show a flow chart of the process of clearing the pollution data in the third embodiment of the present invention, a pseudo code, and a structure of the data recovery unit, as shown in FIG. 7, the third embodiment and the The case of the second embodiment is substantially the same, and in the steps S61-S64, the difference from the second embodiment is that in step S61, the τ does not change regardless of how many data replacements are performed. Fig. 8 shows the process of realizing data acquisition and recovery in the third embodiment by using pseudo code. In Fig. 8, the same steps as in the second embodiment are first performed on the 1-4th lines, z +1 equations are downloaded and attack detection is performed. End this data acquisition if no attack is detected; otherwise, start a cleanup process on lines 5-26. Wherein, the size w of the clear set C defined in the fifth row is a fixed value "", where α is an input parameter. In the sixth line, the equation +2 ... ^ is downloaded, and the set is initialized with ζ.. 3⁄4 S, used to initialize the cleanup set (:. These two sets change in each iteration, and use variables and c to indicate their respective first equations in the current iteration. Also denotes the test equation used for attack detection. Line 7 in Figure 8 initializes S, K and begins the iterative process. In each iteration, download a new equation 第 on line 9, and use it as the test equation for its 12th line attack detection. Lines 13-15 take all possible subsets of C and make IC' l= r not exceed max , and replace the r equations in S with the equations in C ' in all possible ways in lines 16-18. Line 19 uses e as the test equation to perform the attack detection algorithm on the replaced set. The 20th line indicates that if no attack is detected, S ' is cleared and the algorithm runs. Otherwise, line 25 will increase our Scalar values and continue the next iteration. Note the collection From 7: a size yt + w l + scrollable window equation composition, in the third embodiment, the end conditions are: either successfully cleared, or all the equations have been downloaded.
请参见图 9, 对于第三实施例中的装置而言, 其数据恢复单元的结构包括: 第二 子集设置模块 71、 第三选择模块 72、 第三数据下载及替换模块 73和第三比较数据 下载及替换模块 74; 其中, 第二子集设置模块 71用于设置再次下载所选择的存储节 点数 τ , 其中, τ小于 k; 第三选择模块 72用于选择 τ个存储节点, 选择的存储节 点中至少一个不同于之前已经选择并下载过数据用于取得 X值的 k个存储节点; 第 三数据下载及替换模块 73用于下载所选择的 τ个存储节点中的数据模块, 使用再次 下载的 τ个数据模块代替上次用于计算取得 X 值的数据模块中的任意 τ个数据模 块; 第三比较数据下载及替换模块 74用于选择一个与之前用于比较的第 k+1存储模 块不同的存储模块下载其数据, 替代用于比较的第 k+1个数据模块。  Referring to FIG. 9, for the apparatus in the third embodiment, the structure of the data recovery unit includes: a second subset setting module 71, a third selection module 72, a third data downloading and replacing module 73, and a third comparison. The data downloading and replacing module 74; wherein, the second subset setting module 71 is configured to set to download the selected number of storage nodes τ again, where τ is less than k; the third selecting module 72 is configured to select τ storage nodes, the selected At least one of the storage nodes is different from k storage nodes that have previously selected and downloaded data for obtaining the X value; the third data download and replace module 73 is configured to download the data module of the selected τ storage nodes, using again The downloaded τ data modules replace any τ data modules in the data module used to calculate the X value; the third comparison data download and replace module 74 is used to select a k+1th storage that was previously used for comparison. Modules with different modules download their data, replacing the k+1th data module for comparison.
在第二实施例和第三实施例中, 对于数据的替换而言, 主要是利用清除概念, 定义用于清除所下载的方程集合为 C, 令 e为一个额外的方程, 用 C中的方程来替 代 S (即用于计算原始数据的数据集合) 中一个大小为 I C I的子集, 并定义新的方程 集合为^ ; 然后, 我们在集合 ^上执行攻击检测, 用方程 e作为检测。 换句话说, 将 解决 SLEs看成 并核对其解是否满足方程 e。 如果未检测到攻击, 将获得的解看成 是正确的数据编码向量, 否则我们再次使用 S, 用 C中的方程来替代 S中另外一个 大小为 I C I的子集; 然后再执行攻击检测算法。重复执行这些步骤直到要么清除成功, 要么 S中所有大小为 I C I的子集都被替代。 In the second embodiment and the third embodiment, for the replacement of data, mainly using the clearing concept, the definition is used to clear the downloaded set of equations to C, let e be an additional equation, using the equation in C Instead of S (ie, the data set used to calculate the raw data), a subset of size ICI, and define a new set of equations as ^; Then, we perform attack detection on the set ^, using the equation e as the detection. In other words, the solution to SLEs is seen as a union to determine if the solution satisfies equation e. If no attack is detected, the solution obtained is treated as the correct data encoding vector, otherwise we use S again, replacing the other one in S with the equation in C. The size is a subset of ICI; then the attack detection algorithm is executed. Repeat these steps until either the cleanup is successful, or all subsets of S in size ICI are replaced.
如果 e是正确的, C只包含正确的方程, S 中被攻击的方程数不超过 I C I, 那么 上述方法最终会成功, 因为最终会用 C中正确的方程替代 S中所有被攻击(或污染) 的方程(数据模块)。 对于失败的情况, 要么 e是被篡改的, 要么 C中包含被污染的 方程, 要么 S中被攻击的方程数大于 I C L 在这种情况下, 我们可以下载另外一个方 程集合 C '并使 I C' W C I , 同样下载另外一个测试方程 e ', 并再次清除 S。  If e is correct, C only contains the correct equation, and the number of equations attacked in S does not exceed ICI, then the above method will eventually succeed, because eventually all the attacked (or polluted) in S will be replaced by the correct equation in C. Equation (data module). In the case of failure, either e is tampered with, or C contains a contaminated equation, or the number of equations attacked in S is greater than ICL. In this case, we can download another set of equations C 'and make IC' WCI, also download another test equation e ', and clear S again.
第二实施例的基本思想是用一个清除集合 C来开始清除(如: 一开始假设在集 合 S中只有一个攻击方程), 然后如果清除失败则重复增加集合 C的大小。 通过这种 方法我们迟早会获得一个清除集合 C, 并且 C中包含正确的方程数和集合 S中被攻 击的方程数相同。 在每次迭代中, 选择 C 中所有可能的方程的子集。 因此最终会用 集合 C中正确的方程来替代集合 S中被攻击的方程, 并得到一个最终的清除集合 c。  The basic idea of the second embodiment is to start the clearing with a clearing set C (e.g., initially assume that there is only one attacking equation in the set S), and then repeatedly increase the size of the set C if the clearing fails. In this way, we will get a clear set C sooner or later, and the correct number of equations in C is the same as the number of equations in the set S. In each iteration, select a subset of all possible equations in C. Therefore, the correct equation in set C will eventually be used instead of the attacked equation in set S, and a final set of cleanes c will be obtained.
在第二实施例中, 首先下载 用 +1作为测试方程在 Z 上判断数据是否被 污染;如果未检测到,说明 皮清除,结束本次循环取得数据;否则开始清除
Figure imgf000019_0001
的迭代过程。 每次迭代中, 要下载一个新的方程, 这个新的下载的方程定义为 e; 将 其作为测试方程来用于在当前迭代中执行攻击检测算法。 至此, 所下载的不包含集 合 S中的剩下方程组成了清除集合( 。取 C中任何一个可能的子集 C ',使得 I C' r不 大于 k, 并通过所有可能的方式用 C '中的方程来代替集合 S中的 r个方程。 在每次替 换后,用 e作为检测方程在集合 中执行攻击检测算法。如果没有攻击,则 ^被清除。
In the second embodiment, first downloading +1 as a test equation to determine whether the data is contaminated on Z; if not detected, indicating that the skin is cleared, ending the loop to obtain data; otherwise, starting to clear
Figure imgf000019_0001
Iterative process. In each iteration, to download a new equation, this new downloaded equation is defined as e; it is used as a test equation for performing the attack detection algorithm in the current iteration. At this point, the downloaded remaining equations in the set S do not constitute a clear set (take any possible subset C in C, so that IC' r is not greater than k, and use C ' in all possible ways The equation replaces the r equations in set S. After each replacement, the attack detection algorithm is executed in the set using e as the detection equation. If there is no attack, ^ is cleared.
第二实施例中, 在每次迭代中下载一个新的方程, 因而其通信复杂性依赖于算法 所执行的迭代数。  In the second embodiment, a new equation is downloaded in each iteration, and thus its communication complexity depends on the number of iterations performed by the algorithm.
在计算复杂性上, 由于第二实施例中只需要一些简单的操作, 因此只要系统中 被攻击的方程数是可接受的, 将大大降低计算复杂性。  In terms of computational complexity, since only a few simple operations are required in the second embodiment, the computational complexity will be greatly reduced as long as the number of equations attacked in the system is acceptable.
根据上述分析, 本发明所记载的各实施例在实际中比基于同态数字签名的方案 有更好的应用。 首先, 不需要一个 PKI, 由于未用到密码技术因此也不需要任何密钥 管理方案; 其次, 虽然也需要计算开销, 但是只对于分布式存储系统中需要找回原 始信息的实体。 在无线传感网絡中, 虽然受计算复杂性约束, 但这个实体往往是基 站并且具有足够的能力来计算。 然而基于同态数字签名方案, 信源和存储节点均需 要进行大量的计算, 而这些节点通常都是资源受限的传感节点。  According to the above analysis, the embodiments described in the present invention have a better application in practice than the scheme based on the homomorphic digital signature. First, there is no need for a PKI, and no key management scheme is needed because no cryptographic techniques are used. Second, although computational overhead is also required, it is only for entities in the distributed storage system that need to retrieve the original information. In wireless sensor networks, although constrained by computational complexity, this entity is often a base station and has sufficient capacity to compute. However, based on the homomorphic digital signature scheme, both the source and the storage node require a large amount of computation, and these nodes are usually resource-limited sensing nodes.
作为第二实施例的一种改进或折中, 第三实施例在每次迭代中改变固定大小的集 合 S和 C而不是增加 C的大小。集合 S和 C由一个固定窗口大小的来自 z*方程组成, 导致不管何种情况下此方法的成功概率不等于 1。如果 S中包含的攻击方程数不超过 ax那么将成功恢复, 其中 ^是一个限制此方法计算复杂性的输入参数, 它是通过 限制来自集合 C和 S的方程子集的大小来限制的。 As an improvement or compromise of the second embodiment, the third embodiment changes the set of fixed size in each iteration S and C instead of increasing the size of C. Sets S and C consist of a fixed window size from the z* equation, resulting in a probability of success for this method not equal to 1 in any case. If the number of attack equations contained in S does not exceed a x then the recovery will be successful, where ^ is an input parameter that limits the computational complexity of this method, which is limited by limiting the size of the subset of equations from sets C and S.
第三实施例中, 首先执行和第二实施例相同的步骤, 下载 个方程并执行攻击 检测。 如果没有攻击则结束本次数据取得; 否则开始一个清除过程; 定义清除集合 C 的大小 w为一个固定值「《 , 其中 "是一个输入参数; 下载方程 ζ:+2 ^, 并用 .^来 初始化集合 S, 用 来初始化清除集合 (:。 这两个集合在每次迭代中均改变, 并 用变量 ^和 c来标明当前迭代中它们各自的第一个方程。 同样 4表示用于攻击检测的 测试方程。 将 进行初始化并且开始迭代过程。 在每次迭代过程, 下载一个新的 方程 Z;并用它作为图 8中第 12行攻击检测的测试方程。取出 C中所有可能的子集 并使 I C' l= r不超过 rmax,通过所有可能的方式用 C '中方程来代替 S中的 r个方程; 用 e 作为测试方程,在替换后的集合^ '上执行攻击检测算法;如果未检测到攻击则说明 被清除; 否则, 将增加我们的标量 值并继续下轮迭代。 注意集合 由来 自 —个大小为 t + w + 1滚动窗口中的方程组成, 算法结束的条件是: 要么清除成功, 要么已下载 中所有方程。 In the third embodiment, the same steps as in the second embodiment are first performed, an equation is downloaded, and attack detection is performed. If there is no attack, the data acquisition is ended; otherwise, a clearing process is started; the size of the cleanup set C is defined as a fixed value "", where "is an input parameter; the download equation ζ: +2 ^, and initialized with .^ Set S, used to initialize the cleanup set (:. These two sets are changed in each iteration, and the variables ^ and c are used to indicate their respective first equations in the current iteration. The same 4 represents the test for attack detection. The equation will be initialized and the iterative process begins. In each iterative process, a new equation Z is downloaded; and it is used as the test equation for attack detection in line 12 of Figure 8. All possible subsets in C are taken and IC' l= r does not exceed r max , replace the r equations in S with the equations in C ' in all possible ways; use e as the test equation and perform the attack detection algorithm on the replaced set ^ '; if not detected The attack indicates that it is cleared; otherwise, it will increase our scalar value and continue the next iteration. Note that the set consists of equations from a scroll window of size t + w + 1, the algorithm Conditions beam is: either remove succeed or all equations downloaded.
下面分别对上述几个实施例的成功概率、 平均通信量和计算复杂性进行简要的说 明。 在第二实施中, 成功概率是关于被攻击的方程数 t的函数其成功概率在一个阈值 t内超过 90%, 在这个阈值后就开始下降。 对于^ = 4 = 5 = 6其阈值分别近似 为 t=85, t=100和 t=110; 如果增加 rmax值, 那么第二实施例中的方法可以确保从更强 的攻击 (集合 S中包含更多的攻击方程) 中恢复。 然而其计算复杂性也相应会增加。 The success probability, average traffic, and computational complexity of the above several embodiments are briefly described below. In the second implementation, the probability of success is a function of the number of equations t being attacked whose success probability exceeds 90% within a threshold t, after which it begins to fall. For ^ = 4 = 5 = 6, the thresholds are approximately t = 85, t = 100 and t = 1010, respectively; if the r max value is increased, the method in the second embodiment can ensure a stronger attack (set S) Contains more attack equations) to recover. However, its computational complexity will increase accordingly.
在第一实施例中, 成功概率在阈值 = ?ι- 内将保持为 1 , 对于" = 100( = 100的 情况下阈值 ί = 899 , 其阈值远大于第二实施例中的阈值。 忽略这点, 第二实施例中的 法可以清除大小 = 100的集合中 4 个污染的方程, 这意味着在整个大小 w = 1000的系 统中能够容忍有 40个方程被污染。 然而, 即使被污染的方程数达到 85该成功的概 率仍然很高,其原因是当 t=85时,在大小 = 100的系统中平均被攻击的方程数是 8.5, 这意味着系统中含有被攻击的方程数比较小的集合。 很明显, 可通过在整个方程集 合 Z*中滚动一个大小 = 100窗口, 从得到的集合中可以看出被攻击的方程数不超过 4 的集合的概率非常大。 对于其他 rmax值仍可以得出类似结果。 对于第二实施例中的平均通信复杂性 (即所下载的方程数) 而言, 可以将 t=120 作为一个分界点,因此超过该值后,该的成功概率将不理想,因此我们不关心其 t>120 后的平均通信复杂性。 其平均通信复杂性随着被攻击的方程数 t的增加而增大, 因为 t越大就越难发现被攻击的方程, 另外在大小为 k的集合 S中包含的被攻击的方程数 不超过一个固定值 rmax , 同时在大小为 的集合 C中至少包含 rmax个正确方程。然而, 由于平均下载的方程数小于总的方程数 n的一半, 因而通信复杂性总是可以接受的。 特别当攻击的方程数 t为 50时即只有 5%的存储节点被攻击时,系统的通信复杂性非 常小。 我们还可以得出通信复杂性随着 r皿值减小而增加的结论。 从后面的计算复杂 性分析中可以看出通信复杂性的降低是以增加计算复杂性为代价获得的。 In the first embodiment, the probability of success will remain at 1 in the threshold = ?ι-, and the threshold ί = 899 in the case of "100 = 100, the threshold is much larger than the threshold in the second embodiment. Ignore this Point, the method in the second embodiment can eliminate the four contaminated equations in the set of size = 100, which means that 40 equations can be tolerated in the entire system of size w = 1000. However, even contaminated The probability that the number of equations reaches 85 is still high. The reason is that when t=85, the average number of equations attacked in a system with size = 100 is 8.5, which means that the number of equations in the system that are attacked is relatively small. It is obvious that by scrolling a size = 100 window across the entire set of equations Z*, it can be seen from the resulting set that the probability of the set of attacked equations not exceeding 4 is very large. For other r max values Similar results can still be obtained. For the average communication complexity (ie, the number of equations downloaded) in the second embodiment, t=120 can be used as a demarcation point, so after the value is exceeded, the probability of success will not be ideal, so we do not care. Its average communication complexity after t>120. The average communication complexity increases with the number of equations t being attacked, because the larger t is, the harder it is to find the attacked equation, and the number of attacked equations contained in the set S of size k does not exceed A fixed value r max , while at the same time containing at least r max correct equations in the set C of size. However, since the average downloaded number of equations is less than half of the total number of equations n, communication complexity is always acceptable. Especially when the number of equations t of the attack is 50, that is, only 5% of the storage nodes are attacked, the communication complexity of the system is very small. We can also conclude that the communication complexity increases as the value of the r-span decreases. It can be seen from the later computational complexity analysis that the reduction in communication complexity is obtained at the expense of increased computational complexity.
第二实施例中的计算复杂性 (即 SLEs 中需要解的方程数)是关于被攻击的方程 数 t的一个函数。 可以认为计算复杂性随着 t的增加而迅速增大, 同时也随着 rmax增 加而增大。 事实上, rmax增加 1 导致计算复杂性增加一个重大阶数。 最好的折中是 rmax = 4的情况下, 第二实施例中的方法可以以一个非常低的通信复杂性处理 t=30个 被攻击的方程 (即占总的方程数的 3% ), 同时计算复杂性上也是可以接受的 (解 108 * 226 SLEs方程)。 The computational complexity in the second embodiment (i.e., the number of equations in the SLEs that need to be solved) is a function of the number of equations t being attacked. It can be considered that the computational complexity increases rapidly as t increases, and also increases as r max increases. In fact, an increase in r max of 1 results in a significant increase in computational complexity. The best compromise is that rmax = 4, the method in the second embodiment can handle t = 30 attacked equations (ie 3% of the total number of equations) with a very low communication complexity. Simultaneous computational complexity is also acceptable (solution 10 8 * 2 26 SLEs equation).
此外, 在本发明中, 在一些极为特殊的情况下, 针对于前面攻击检测结果所出 现的假阴性和假阳性情况, 也可以采用如下方法; 设一个大小 k的集合 S和一个额 外的方程 e, 在现有攻击检测中未发现一个攻击, 可以假设要么这个解是一个正确的 数据模块 X, 要么是攻击者伪造的 。 对于这个假设可以根据这个解所满足的剩下 方程数来作出决定。 首先根据现有上述三个实施例中的方法找到 S 和 e , 令对象 r = k + T为由集合 S形成的 SLE的解; 然后选择一个方程 e S UW, 如果 满足 e 则将 直增 1。 重复上述步骤, 直到 直超过 w / 2或者所有可能的方程都被执行。 一种情况下 Τ = Χ , 也就是找到正确的解, 本次数据取得结束; 否则 e和集合 S中满 足解 的方程都被忽略, 该方法在剩下的方程集合中重复上述过程。 In addition, in the present invention, in some very special cases, for the false negative and false positive cases that occur in the previous attack detection result, the following method may also be employed; setting a set S of size k and an additional equation e An attack was not found in the existing attack detection. It can be assumed that either the solution is a correct data module X or the attacker forged. For this assumption, the decision can be made based on the number of remaining equations that the solution satisfies. First find S and e according to the existing methods in the above three embodiments, so that the object r = k + T is the solution of the SLE formed by the set S; then select an equation e S UW, if it is satisfied, it will increase directly by 1 . Repeat the above steps until straight over w / 2 or all possible equations are executed. In one case Τ = Χ , that is, to find the correct solution, the current data acquisition ends; otherwise, the equations satisfying the solution in e and set S are ignored, and the method repeats the above process in the remaining set of equations.
至此, 描述了基于网络编码的分布式存储中污染攻击问题的解决方案, 并提出 了明确的方案来检测污染攻击并从该攻击中恢复。 该方案的一个显著特点是它不是 基于校验和或数字签名, 这在密码学上一般用于提供数据完整性服务, 而是利用分 布式存储系统固有的冗余特性。  So far, a solution to the pollution attack problem in distributed storage based on network coding has been described, and a clear scheme is proposed to detect and recover from the pollution attack. A notable feature of this approach is that it is not based on checksums or digital signatures, which are commonly used in cryptography to provide data integrity services, but rather to exploit the redundancy features inherent in distributed storage systems.
上述各实施例中需要的编码模块比得到原始数据所需要的编码模块要多, 并利 用这些额外的编码模块来进行攻击检测和恢复。 攻击检测和恢复都只需要解决有限 域 上系统中的线性方程组的解。 由于不使用密码算法, 本发明不需要依赖于一 个 PKI或提前建立的安全通道。 The coding modules required in the above embodiments are more than the coding modules required to obtain the original data, and these additional coding modules are used for attack detection and recovery. Attack detection and recovery only need to be limited The solution of a system of linear equations in a system on a domain. Since no cryptographic algorithm is used, the present invention does not need to rely on a PKI or a secure channel established in advance.
上述方法在通信和计算负载上均有显著地优势。 对于上述三个实施例: 第一实 施例提供了系统中最低可能的平均计算复杂性; 第二实施例在通信复杂性上是最优 的, 并确保从一个比较强的攻击 (大部分编码模块被篡改) 中恢复, 在大多数系统 中仍是一个比较可行的实际方案; 第三实施例是对一个非常大的系统的折中, 在恢 复能力 (成功概率) 以及通信负载上均没有前两种方案有效, 但是对于非常大的系 统的计算负载是可接受的。  The above methods have significant advantages in both communication and computational load. For the above three embodiments: The first embodiment provides the lowest possible average computational complexity in the system; the second embodiment is optimal in terms of communication complexity and ensures a strong attack (most coding modules) Recovery in tampering) is still a more practical solution in most systems; the third embodiment is a compromise for a very large system, with no resilience (success probability) and communication load. The solution is effective, but the computational load for very large systems is acceptable.
上述各实施例可以用于任何基于网络编码的分布式存储系统, P2P文件分布式领 域或无线传感网络。 另外, 上述方法并不需要在存储节点上执行额外的编码或者在 编码模块增加额外的信息, 只有接收节点需要执行一定量的计算。 由于这个原因, 本发明特别适合无线传感网絡, 在该网络中存储节点是能力资源受限的传感节点, 而接收节点是一个比较强的基站。  The above embodiments can be applied to any distributed storage system based on network coding, a distributed domain of P2P files or a wireless sensor network. In addition, the above method does not require additional coding on the storage node or additional information is added to the coding module, only the receiving node needs to perform a certain amount of computation. For this reason, the present invention is particularly suitable for wireless sensor networks in which the storage node is a sensor node with limited capacity resources and the receiving node is a relatively strong base station.
值得一提的是, 在本发明的各实施例中, 其不同的部分主要是在于对被污染数 据的替换方法上, 在一些情况下, 一个实际的操作中, 可能包括了上述不同实施例 中的污染数据替换方法, 例如, 可以先使用第二实施例中的方法进行污染数据替换, 而当满足一定条件时, 例如, 消耗了设定时间而仍未找到正确的数据, 可以将替换 方法转换为第一实施例或第三实施例中所记载的方法。 总之, 在上述各实施例中, 其技术特征均可以相互合理地组合为一个新的实施例。  It is worth mentioning that, in various embodiments of the present invention, different parts thereof mainly lie in the replacement method of the contaminated data, and in some cases, in an actual operation, may include the above different embodiments. The pollution data replacement method may be, for example, the method of the second embodiment may be used to perform the pollution data replacement, and when a certain condition is met, for example, the set time is consumed and the correct data is still not found, the replacement method may be converted. It is the method described in the first embodiment or the third embodiment. In summary, in the above embodiments, the technical features can be reasonably combined with each other into a new embodiment.
以上所述实施例仅表达了本发明的几种实施方式, 其描述较为具体和详细, 但 并不能因此而理解为对本发明专利范围的限制。 应当指出的是, 对于本领域的普通 技术人员来说, 在不脱离本发明构思的前提下, 还可以做出若干变形和改进, 这些 都属于本发明的保护范围。 因此, 本发明专利的保护范围应以所附权利要求为准。  The above-mentioned embodiments are merely illustrative of several embodiments of the present invention, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of the invention should be determined by the appended claims.

Claims

权利要求书 Claim
1、 一种分布式存储数据的污染恢复方法, 其特征在于, 包括如下步骤:  A method for recovering pollution of distributed storage data, comprising the steps of:
A ) 由所述 n个存储节点中的任意 k个分别下载其存储的数据或由所述 n个 存储节点中的任意 k+1个分别下载其存储的数据;  A) downloading its stored data by any k of the n storage nodes or downloading the stored data by any k+1 of the n storage nodes respectively;
B )对所述下载的 k个数据按照 Yj=XGj进行运算, 得到下载的原始数据 X, 并由所述 n个存储节点中选择第 k+1个存储节点并下载其存储的数据; 或对所述下 载的 k个数据按照 Y尸 XGj进行运算, 得到下载的原始数据 X; 其中, j=l、 2、 ...n, k小于 n;  B) calculating, according to Yj=XGj, the downloaded k data, obtaining the downloaded original data X, and selecting the k+1th storage node from the n storage nodes and downloading the stored data; or The downloaded k data is calculated according to the Y corpse XGj, and the downloaded original data X is obtained; wherein j=l, 2, ..., n, k is less than n;
C )将所述第 k+1个存储节点中下载的数据模块 Yk+1, 与上述步骤中得到的 原始数据 X与矩阵 Gk+1的乘积比较, 如相同, 退出本次循环取得数据; 否则, 执行 步骤 D ); C) comparing the data module Y k+1 downloaded in the k+1th storage node with the product of the original data X and the matrix G k+1 obtained in the above step, and if the same, exiting the current loop to obtain data Otherwise, perform step D);
D )再次下载至少一个存储节点的数据, 所述再次下载的数据模块中至少一 个与之前下载的 k+1 个数据不同, 以所述再次下载的数据模块替代之前下载数据模 块中的至少一个, 并返回步骤 B )。  D) downloading data of at least one storage node again, at least one of the re-downloaded data modules being different from the previously downloaded k+1 data, replacing at least one of the previously downloaded data modules with the re-downloaded data module, And return to step B).
2、 根据权利要求 1所述的分布式存储数据的污染恢复方法, 其特征在于, 所述 步骤 D ) 中进一步包括:  The method for recovering the pollution of the distributed storage data according to claim 1, wherein the step D) further comprises:
D1 )选择 k+1个存储节点, 所述选择的存储节点中至少一个不同于之前已经 选择并下载过数据的 k+1个存储节点;  D1) selecting k+1 storage nodes, at least one of the selected storage nodes being different from k+1 storage nodes that have previously selected and downloaded data;
D2 )下载所选择的 k+1个存储节点中的数据模块,使用再次下载的 k+1个数 据模块代替上次下载的数据模块。  D2) Download the data module of the selected k+1 storage nodes, and replace the last downloaded data module with the k+1 data modules downloaded again.
3、 根据权利要求 1所述的分布式存储数据的污染恢复方法, 其特征在于, 所述 步骤 D ) 中进一步包括:  The method for recovering the pollution of the distributed storage data according to claim 1, wherein the step D) further comprises:
D31 )设置再次下载所选择的存储节点数 τ , 如果已设置, 则使 τ加 1 ; 其 中, τ小于 k;  D31) set to download the selected number of storage nodes τ again, if it is set, add τ to 1; where τ is less than k;
D41 ) 选择 τ或 τ +l 个存储节点, 所述选择的存储节点中至少一个不同于 之前已经选择并下载过数据用于取得 X值的 k个存储节点;  D41) selecting τ or τ + l storage nodes, at least one of the selected storage nodes being different from k storage nodes that have previously selected and downloaded data for obtaining X values;
D51 ) 下载所选择的 τ或 τ +l 个存储节点中的数据模块, 使用再次下载的 τ或 τ +1个数据模块代替上次用于计算取得 X值的数据模块中的任意 τ或 τ +1个数 据模块。  D51) Download the selected τ or τ +1 data nodes in the storage node, and replace the τ or τ + in the data module used to calculate the X value with the τ or τ +1 data modules downloaded again. 1 data module.
4、 根据权利要求 3所述的分布式存储数据的污染恢复方法, 其特征在于, 所述 步骤 D )还包括如下步骤: 4. The method for recovering pollution of distributed storage data according to claim 3, wherein: Step D) also includes the following steps:
D61 )选择一个与之前用于比较的第 k+1存储模块不同的存储模块下载其数 据, 替代用于比较的第 k+1个数据模块。  D61) Select a memory module different from the k+1th memory block previously used for comparison to download its data, replacing the k+1th data module for comparison.
5、 根据权利要求 1所述的分布式存储数据的污染恢复方法, 其特征在于, 所述 步骤 D ) 进一步包括:  The method for recovering the pollution of the distributed storage data according to claim 1, wherein the step D) further comprises:
D32 )设置再次下载所选择的存储节点数 τ , 其中, τ小于 k;  D32) setting to download the selected number of storage nodes τ again, where τ is less than k;
D42 )选择 τ个存储节点, 所述选择的存储节点中至少一个不同于之前已经 选择并下载过数据用于取得 X值的 k个存储节点;  D42) selecting τ storage nodes, at least one of the selected storage nodes being different from k storage nodes that have previously selected and downloaded data for obtaining an X value;
D52 )下载所选择的 τ个存储节点中的数据模块, 使用再次下载的 τ个数据 模块代替上次用于计算取得 X值的数据模块中的任意 τ个数据模块。  D52) Download the data modules of the selected τ storage nodes, and replace any τ data modules in the data module used to calculate the X value by using the τ data modules downloaded again.
6、 根据权利要求 5所述的分布式存储数据的污染恢复方法, 其特征在于, 所述 步骤 D )还包括如下步骤:  The method for recovering the pollution of the distributed storage data according to claim 5, wherein the step D) further comprises the following steps:
D62 )选择一个与之前用于比较的第 k+1存储模块不同的存储模块下载其数 据, 替代用于比较的第 k+1个数据模块。  D62) Select a memory module different from the k+1th memory block previously used for comparison to download its data, replacing the k+1th data module for comparison.
7、 根据权利要求 1-6任意一项所述的分布式存储数据的污染恢复方法, 其特征 在于, 还包括如下步骤:  The method for recovering pollution of distributed storage data according to any one of claims 1-6, further comprising the steps of:
AO )将原始数据 X分为 k份, 得到 , i=l、 2...k, 并通过对上述 进行最大距离可分离码(maximum distance separable, MDS )编码后得到线性无关的 n个数据模块 η , 所述 n个数据模块 Yj分别与生成矩阵 (¾存储在 n个存储节点上。  AO ) divides the original data X into k parts, obtains i=l, 2...k, and obtains linearly independent n data modules by encoding the above maximum distance separable code (MDS). η, the n data modules Yj and the generation matrix are respectively stored on the n storage nodes.
8、 一种实现如权利要求 1所述的分布式存储数据污染恢复方法的装置, 其特征 在于, 包括:  8. A device for implementing a distributed storage data pollution recovery method according to claim 1, comprising:
数据下载单元: 用于由所述 n个存储节点中的任意 k个分别下载其存储的 数据或由所述 n个存储节点中的任意 k+ 1个分别下载其存储的数据;  a data download unit: configured to download, by each of the n storage nodes, the stored data thereof or download the stored data by any k+1 of the n storage nodes respectively;
数据获得单元: 用于对所述下载的 k个数据按照 Υ」=Χ 进行运算, 得到 下载的原始数据 X, 并由所述 n个存储节点中选择第 k+1个存储节点并下载其存储 的数据; 或对所述下载的 k个数据按照 η=Χ 进行运算, 得到下载的原始数据 X; 数据比较单元: 用于将所述第 k+1个存储节点中下载的数据模块 Yk+1 , 与 上述步骤中得到的原始数据 X与矩阵 Gk+1的乘积比较; a data obtaining unit: configured to perform operation on the downloaded k data according to “Υ”=Χ to obtain the downloaded original data X, and select the k+1th storage node from the n storage nodes and download the storage thereof Or the operation of the downloaded k data according to η=Χ to obtain the downloaded original data X; the data comparison unit: for downloading the data module Y k+ of the k+1th storage node 1 , comparing with the product of the original data X and the matrix G k+1 obtained in the above steps;
数据恢复单元: 用于依据所述数据比较单元的输出, 再次下载至少一个存 储节点的数据, 所述再次下载的数据模块中至少一个与之前下载的 k+1个数据不同, 以所述再次下载的数据模块替代之前下载数据模块中的至少一个。 a data recovery unit: configured to download data of at least one storage node again according to an output of the data comparison unit, where at least one of the downloaded data modules is different from the previously downloaded k+1 data, Replacing at least one of the previously downloaded data modules with the data module that is downloaded again.
9、 根据权利要求 8所述的装置, 其特征在于, 所述数据恢复单元进一步包括: 第一选择模块: 用于选择 k+1 个存储节点, 所述选择的存储节点中至少一 个不同于之前已经选择并下载过数据的 k+1个存储节点;  The device according to claim 8, wherein the data recovery unit further comprises: a first selection module: configured to select k+1 storage nodes, at least one of the selected storage nodes being different from before k+1 storage nodes that have selected and downloaded data;
第一数据下载及替换模块: 用于下载所选择的 k+1 个存储节点中的数据模 块, 使用再次下载的 k+1个数据模块代替上次下载的数据模块。  The first data downloading and replacing module is configured to download the data module of the selected k+1 storage nodes, and replace the last downloaded data module with the k+1 data modules downloaded again.
10、 根据权利要求 8所述的装置, 其特征在于, 所述数据恢复单元进一步包括: 第一子集设置模块: 用于设置再次下载所选择的存储节点数 τ , 如果已设 置, 则使 τ加 1 ; 其中, τ小于 k;  The device according to claim 8, wherein the data restoring unit further comprises: a first subset setting module: configured to set to download the selected number of storage nodes τ again, and if so, to make τ Add 1; where τ is less than k;
第二选择模块: 用于选择 τ或 τ +l个存储节点, 所述选择的存储节点中至 少一个不同于之前已经选择并下载过数据用于取得 X值的 k个存储节点;  a second selection module: configured to select τ or τ + 1 storage nodes, wherein at least one of the selected storage nodes is different from k storage nodes that have previously selected and downloaded data for obtaining an X value;
第二数据下载及替换模块: 用于下载所选择的 τ或 τ +1个存储节点中的数 据模块, 使用再次下载的 τ或 τ +1个数据模块代替上次用于计算取得 X值的数据模 块中的任意 τ或 τ +1个数据模块;  The second data downloading and replacing module is configured to download the data module in the selected τ or τ +1 storage nodes, and replace the data used to calculate the obtained X value by using the downloaded τ or τ +1 data modules again. Any τ or τ +1 data modules in the module;
第二比较数据下载及替换模块: 用于选择一个与之前用于比较的第 k+1存 储节点不同的存储节点下载其数据, 替代所述用于比较的第 k+1个数据模块。  The second comparison data downloading and replacing module is configured to select a storage node different from the k+1th storage node previously used for comparison to download the data, instead of the k+1th data module for comparison.
11、 根据权利要求 8所述的装置, 其特征在于, 所述数据恢复单元进一步包括: 第二子集设置模块: 用于设置再次下载所选择的存储节点数 τ , 其中, τ 小于 k;  The device according to claim 8, wherein the data recovery unit further comprises: a second subset setting module: configured to set to download the selected number of storage nodes τ, where τ is less than k;
第三选择模块: 用于选择 τ个存储节点, 所述选择的存储节点中至少一个 不同于之前已经选择并下载过数据用于取得 X值的 k个存储节点;  a third selection module: configured to select τ storage nodes, wherein at least one of the selected storage nodes is different from k storage nodes that have previously selected and downloaded data for obtaining an X value;
第三数据下载及替换模块: 用于下载所选择的 τ个存储节点中的数据模块, 使用再次下载的 τ个数据模块代替上次用于计算取得 X值的数据模块中的任意 τ个 数据模块;  The third data downloading and replacing module is configured to download the data module in the selected τ storage nodes, and replace the τ data modules in the data module used for calculating the obtained X value by using the τ data modules downloaded again. ;
第三比较数据下载及替换模块: 用于选择一个与之前用于比较的第 k+1存 储节点不同的存储节点下载其数据, 替代上述用于比较的第 k+1个数据模块。  The third comparison data downloading and replacing module is configured to select a storage node different from the k+1th storage node previously used for comparison to download the data, instead of the k+1th data module for comparison.
PCT/CN2012/072007 2012-03-06 2012-03-06 Pollution data recovery method and apparatus for distributed storage data WO2013131253A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/072007 WO2013131253A1 (en) 2012-03-06 2012-03-06 Pollution data recovery method and apparatus for distributed storage data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/072007 WO2013131253A1 (en) 2012-03-06 2012-03-06 Pollution data recovery method and apparatus for distributed storage data

Publications (1)

Publication Number Publication Date
WO2013131253A1 true WO2013131253A1 (en) 2013-09-12

Family

ID=49115868

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/072007 WO2013131253A1 (en) 2012-03-06 2012-03-06 Pollution data recovery method and apparatus for distributed storage data

Country Status (1)

Country Link
WO (1) WO2013131253A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110401703A (en) * 2019-07-10 2019-11-01 东华大学 Cloud storage data reconstruction method based on multistage network coding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010095183A1 (en) * 2009-02-17 2010-08-26 日本電気株式会社 Storage system
CN102024016A (en) * 2010-11-04 2011-04-20 天津曙光计算机产业有限公司 Rapid data restoration method for distributed file system (DFS)
CN102193746A (en) * 2010-03-11 2011-09-21 Lsi公司 System and method for optimizing redundancy restoration in distributed data layout environments

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010095183A1 (en) * 2009-02-17 2010-08-26 日本電気株式会社 Storage system
CN102193746A (en) * 2010-03-11 2011-09-21 Lsi公司 System and method for optimizing redundancy restoration in distributed data layout environments
CN102024016A (en) * 2010-11-04 2011-04-20 天津曙光计算机产业有限公司 Rapid data restoration method for distributed file system (DFS)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110401703A (en) * 2019-07-10 2019-11-01 东华大学 Cloud storage data reconstruction method based on multistage network coding

Similar Documents

Publication Publication Date Title
Pannetrat et al. Efficient multicast packet authentication.
Wang et al. Dependable and secure sensor data storage with dynamic integrity assurance
Luk et al. Seven cardinal properties of sensor network broadcast authentication
Chang et al. An efficient broadcast authentication scheme in wireless sensor networks
Karlof et al. Distillation codes and applications to DoS resistant multicast authentication
Wang et al. MIS: Malicious nodes identification scheme in network-coding-based peer-to-peer streaming
Kim et al. On counteracting byzantine attacks in network coded peer-to-peer networks
Wang et al. ShortPK: A short-term public key scheme for broadcast authentication in sensor networks
Slater et al. A coding-theoretic approach for efficient message verification over insecure channels
Gunter et al. DoS Protection for Reliably Authenticated Broadcast.
Pannetrat et al. Authenticating real time packet streams and multicasts
Azadmanesh et al. A reliable and efficient micro-protocol for data transmission over an RTP-based covert channel
WO2013131253A1 (en) Pollution data recovery method and apparatus for distributed storage data
Liu et al. A secure and efficient code-based signature scheme
Lin et al. Lightweight, pollution-attack resistant multicast authentication scheme
Tartary et al. Authentication of digital streams
Li et al. Secure regenerating code
Choi Denial-of-service resistant multicast authentication protocol with prediction hashing and one-way key chain
Habib et al. Verifying data integrity in peer-to-peer media streaming
Venkadesh et al. Techniques to enhance security in SCTP for multi-homed networks
Coelho et al. Challenging the feasibility of authentication mechanisms for P2P live streaming
Deng et al. Lightweight One-Time Signature for multicast authentication
Tartary et al. An hybrid approach for efficient multicast stream authentication over unsecured channels
Ning et al. A novel secure coding-based approach for multi-cast transmission
Mao et al. Et-dmd: an error-tolerant scheme to detect malicious file deletion on distributed storage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12870360

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12870360

Country of ref document: EP

Kind code of ref document: A1