CN114912631A

CN114912631A - Block chain medical data sharing-based federal learning method

Info

Publication number: CN114912631A
Application number: CN202210390877.0A
Authority: CN
Inventors: 谢光武
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2022-08-16

Abstract

The invention relates to the field of data security, in particular to a block chain medical data sharing-based federal learning method. According to the method, based on the block chain characteristics and privacy protection means, privacy protection is performed on federal learning while medical data sharing is completed, and the problem of medical data leakage in federal learning is avoided.

Description

Block chain medical data sharing-based federal learning method

Technical Field

The invention relates to the field of data security, in particular to a block chain medical data sharing-based federal learning method.

Background

In recent medical reform, informatization construction of medical institutions is strengthened, but many problems and challenges still exist in the aspects of medical data intercommunication mutual recognition, data security, transparency, privacy protection and the like.

In recent years, artificial intelligence is rapidly developed in the medical field, and can play a role in auxiliary diagnosis in a part of scenes. To improve the accuracy of machine learning models, large amounts of data are required, so data sharing across organizations is often required.

Federated learning accomplishes the computation by distributing the machine learning model (to the model owners) to the various nodes (data owners), rather than aggregating the data of the various data owners. The classification performance of the method is equivalent to that of local training, and the method has better universality and generalization capability because more data is contained. However, federal learning does not have privacy protection capabilities by itself. Research shows that the reverse attack can reconstruct a picture with high reducibility from model weight and gradient updating. In order to protect the privacy of the patient during this process, technical means are required to simultaneously compromise the privacy of the data and the effectiveness of the data.

The blockchain can participate in maintaining a reliable database as a whole in a decentralized manner based on cryptography rather than trust. The emergence of the blockchain technology provides a new idea for data sharing in the medical industry.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a block chain medical data sharing-based federal learning method, which is used for completing the sharing of medical data and simultaneously performing privacy protection on federal learning based on block chain characteristics and privacy protection means, so that the problem of medical data leakage in the federal learning is avoided.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme.

A block chain medical data sharing-based federal learning method comprises the following steps:

step 1, establishing a block chain network;

step 2, uploading the machine learning model to an interplanetary file system IPFS of the block chain network by a model owner, and planning a federal learning process by setting an Orchester consisting of a plurality of intelligent contracts of a contract layer in the block chain network;

step 3, the data owner acquires the encrypted data and the breakpoint from the interplanetary file system IPFS of the block chain network, and then decrypts the encrypted data;

step 4, training the machine learning model by using the decrypted data according to the Orchester planning federal learning process to obtain a training gradient of the machine learning model;

step 5, the data owner adds noise to the obtained training gradient to realize differential privacy, and then sends the training gradient to the security aggregator;

step 6, recording training events in the machine learning process by the distributed non-falsifiable account book of the block chain network; the safety aggregator generates a nonce for the training record and sends the nonce to a medical institution participating in the training, and records the hash value hash (K, nonce) of the training record on the block chain;

and 7, encrypting the training gradients transmitted by the data owners by the security aggregator, collecting the training gradients transmitted by all the data owners, and updating the machine learning model.

Compared with the prior art, the invention has the beneficial effects that: based on the block chain characteristics and privacy protection means, privacy protection is performed on federal learning while medical data sharing is completed, and the problem of medical data leakage in federal learning is avoided.

Drawings

The invention is described in further detail below with reference to the figures and specific embodiments.

FIG. 1 is a schematic overall flow chart of the present invention for training a machine learning model;

FIG. 2 is an original drawing of a photograph of melanoma in a training set;

FIG. 3 is a schematic diagram of the process of restoring the training set photographs in experiment 3;

fig. 4 is a photograph of the training set that was recovered after 180 iterations in trial 4 using the reverse attack.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only illustrative of the present invention and should not be construed as limiting the scope of the present invention.

Referring to fig. 1, a federal learning method based on blockchain medical data sharing includes the following steps:

step 1, establishing a block chain network;

specifically, a FISCO BCOS block chain bottom platform is used for building a block chain network, and a plurality of medical institutions which are owners of medical data are used as block chain network nodes; setting an interplanetary File System (IPFS) to store medical data in a link mode;

step 2, uploading the machine learning model to an interplanetary file system IPFS of the block chain network by a model owner-a certain scientific research unit, and planning a federal learning process by setting an Orchester consisting of a plurality of intelligent contracts of a contract layer in the block chain network;

wherein the Orchester comprises a medical institution list, and medical institutions in the medical institution list have data for training the machine learning model; in the medical institution table, medical institutions can enable own medical data to be used by only certain people or not by certain people by setting a white list or a black list of addresses (hash of public keys), so that the data owner can control the data; the medical institution communicates through a secure communication channel;

specifically, differential privacy is achieved by randomly modifying a data set, so that the individual information disclosure is reduced, and the statistical properties and the inference capability of the whole data are kept.

Assume the raw data is { x } ₁ ,x ₂ ,...,x _n The function to be calculated is f (x) ₁ ,x ₂ ,...,x _n ) The raw data may be noisy before calculation, and f (x) may be calculated ₁ +r ₁ ,x ₂ +r ₂ ,...,x _n +r _n ). Due to the addition of noise, privacy is provided to a certain extentThe protection is obtained, but the result of the calculation after the noise is added is close to the real result, which needs to make a trade-off between the privacy and the usability of the data.

Intuitively, if the result obtained by querying and calculating the whole database is almost the same after the data is modified, the privacy of the data is protected to some extent.

Then for a randomization algorithm M (the randomization algorithm is not a fixed value for the output of a particular input, but a random value following a certain distribution), provided that:

Pr[M(x)∈s]≤e ^ε Pr[M(y)∈s]+δ

if the algorithm M is all

All hold, then this algorithm M is said to satisfy (e, δ) differential privacy. ε is the consumption of the privacy budget for a single query. The smaller epsilon indicates the better privacy protection.

If two data sets x, y differ by only one record, i.e., | | x-y | | purple ₁ 1, then the two datasets are "adjacent datasets". The contiguous data set means that protection is provided for each record. When the probabilities are close, the similarity is high, and it is difficult to distinguish x output from y output, so that data with different x and y is protected.

When an external observer obtains a result from a certain data set, whether a certain individual is used or not cannot be known, and then the data set meets the differential privacy.

the training record needs to satisfy two characteristics: 1. the medical institution participating in the training can prove that the medical institution participates in the training by the record; 2. until now, no other person than the security aggregator could know whether this medical facility is involved in the training.

To do this, the blockchain network needs to be set with two published parameters, the prime number q and its primitive root a. Private keys of a certain medical institution and a security aggregator in a certain training are respectively pri ₁ ,pri ₂ And both parties calculate and disclose:

to obtain

Since only the medical institution participating in the training has K, only the medical institution participating in the training can prove that the piece of training record belongs to itself. Meanwhile, other people cannot deduce who participates in the training from the hash (K, nonce) of the training record.

Specifically, the encryption mode is that the worker selects worker selection, that is, the security aggregator randomly selects only a part of collected gradients to generate output, and the medical institution does not know whether the gradient uploaded by the medical institution is selected. This way, illegal attackers can be prevented from obtaining data of a certain party, and the method and the device can play a role of privacy protection together with differential privacy.

Simulation test results

Taking a melanoma photo set from International Skin Imaging corporation as a training set of a machine learning model; the photos in the melanoma photo set are divided into benign tumor photos and malignant tumor photos; storing the training set in an IPFS;

randomly selecting 80 benign tumor photos and 80 malignant tumor photos in a training set, and taking the 160 photos as a test set;

test 1: training the machine learning model by using a general federal learning method, and testing the trained machine learning model by using a test set;

test 2: the method is characterized in that a machine learning model is trained by using the federal learning method, and the trained machine learning model is tested by using a test set;

the recognition accuracy of both federal studies is shown in table 1. Wherein, Accuracy represents the proportion of correct number of classifications in the test set; sensitivity represents the ratio of the number of photographs that successfully identified malignant melanoma to the number of photographs of all malignant melanomas in the test set; ROC-AUC represents the area under the ROC (receiver Operating characteristics) curve, and the closer to 1, the better; the MCC is a mausis correlation coefficient, and is used to measure the classification effect when the difference between the sizes of two classes is large during binary classification.

TABLE 1

	Accuracy	Sensitivity	ROC-AUC	MCC
					Test 1	0.92	0.86	0.92	0.85
Test 2	0.85	0.81	0.88	0.78

As can be seen from table 1, although the recognition accuracy of test 2 is worse than that of test 1, test 2 still has higher recognition accuracy, and even if the federal learning method of the present invention does not have too great influence on the recognition accuracy of the machine learning model, the machine learning model can still maintain higher recognition accuracy.

Test 3: using a reverse attack to attack a general federal learning method, and restoring a picture in a training set;

test 4: attacking the federal learning method by using reverse attack to restore the photos in the training set;

referring to fig. 2, an original image of a melanoma photograph is collected for training.

Referring to fig. 3, a schematic diagram of the process of restoring the training set photo in experiment 3 is shown.

Comparing fig. 2 and 3, it can be seen that the shape of melanoma is substantially visualized after 30 iterations with the reverse attack; after 50 iterations with the reverse attack, the original image is substantially restored. This means that the picture information is leaked and the privacy of the medical data cannot be guaranteed.

Referring to fig. 4, a photograph of the training set is restored after 180 iterations in trial 4 using the reverse attack.

Comparing fig. 2 and fig. 4, it can be seen that the photos in the training set cannot be restored by using a reverse attack, which indicates that the federal learning method of the present invention does not leak the image information, and can ensure the privacy of the medical data.

Although the present invention has been described in detail in this specification with reference to specific embodiments and illustrative embodiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto based on the present invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A federal learning method based on block chain medical data sharing is characterized by comprising the following steps:

step 1, establishing a block chain network;

2. The federal learning method for blockchain medical data sharing according to claim 1, wherein in step 1, a blockchain network is built by using a FISCO BCOS blockchain underlying platform, and a data owner is used as a blockchain network node; and setting an interplanetary File System IPFS (Inter-planet File System) to store the medical data in a link mode.

3. A federal learning method as claimed in claim 1, wherein the differential privacy in step 5, particularly for a randomized algorithm M, is provided by the following equation:

Pr[M(x)∈s]≤e ^ε Pr[M(y)∈s]+δ

and for all

All hold, then this algorithm M is said to satisfy (e, δ) differential privacy; where ε is the consumption of the privacy budget by a single query;

4. A federal learning method based on blockchain medical data sharing as claimed in claim 1, wherein K of hash (K, nonce) in step 6, specifically, the blockchain network sets two public parameters, prime number q and its primitive root a; private keys of a certain data owner and a security aggregator in certain training are respectively pri ₁ ,pri ₂ And both parties calculate and disclose:

then obtain

5. A federal learning method based on block chain medical data sharing as claimed in claim 1, wherein the security aggregator encrypts the training gradient transmitted by the data owner in step 7, specifically, the encryption selects worker selection, that is, the security aggregator randomly selects only a part of the collected gradient to generate output, and the medical institution does not know whether the gradient uploaded by itself is selected.