CN114912631A - Block chain medical data sharing-based federal learning method - Google Patents

Block chain medical data sharing-based federal learning method Download PDF

Info

Publication number
CN114912631A
CN114912631A CN202210390877.0A CN202210390877A CN114912631A CN 114912631 A CN114912631 A CN 114912631A CN 202210390877 A CN202210390877 A CN 202210390877A CN 114912631 A CN114912631 A CN 114912631A
Authority
CN
China
Prior art keywords
training
data
block chain
federal learning
aggregator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210390877.0A
Other languages
Chinese (zh)
Inventor
谢光武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202210390877.0A priority Critical patent/CN114912631A/en
Publication of CN114912631A publication Critical patent/CN114912631A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Storage Device Security (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention relates to the field of data security, in particular to a block chain medical data sharing-based federal learning method. According to the method, based on the block chain characteristics and privacy protection means, privacy protection is performed on federal learning while medical data sharing is completed, and the problem of medical data leakage in federal learning is avoided.

Description

Block chain medical data sharing-based federal learning method
Technical Field
The invention relates to the field of data security, in particular to a block chain medical data sharing-based federal learning method.
Background
In recent medical reform, informatization construction of medical institutions is strengthened, but many problems and challenges still exist in the aspects of medical data intercommunication mutual recognition, data security, transparency, privacy protection and the like.
In recent years, artificial intelligence is rapidly developed in the medical field, and can play a role in auxiliary diagnosis in a part of scenes. To improve the accuracy of machine learning models, large amounts of data are required, so data sharing across organizations is often required.
Federated learning accomplishes the computation by distributing the machine learning model (to the model owners) to the various nodes (data owners), rather than aggregating the data of the various data owners. The classification performance of the method is equivalent to that of local training, and the method has better universality and generalization capability because more data is contained. However, federal learning does not have privacy protection capabilities by itself. Research shows that the reverse attack can reconstruct a picture with high reducibility from model weight and gradient updating. In order to protect the privacy of the patient during this process, technical means are required to simultaneously compromise the privacy of the data and the effectiveness of the data.
The blockchain can participate in maintaining a reliable database as a whole in a decentralized manner based on cryptography rather than trust. The emergence of the blockchain technology provides a new idea for data sharing in the medical industry.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a block chain medical data sharing-based federal learning method, which is used for completing the sharing of medical data and simultaneously performing privacy protection on federal learning based on block chain characteristics and privacy protection means, so that the problem of medical data leakage in the federal learning is avoided.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme.
A block chain medical data sharing-based federal learning method comprises the following steps:
step 1, establishing a block chain network;
step 2, uploading the machine learning model to an interplanetary file system IPFS of the block chain network by a model owner, and planning a federal learning process by setting an Orchester consisting of a plurality of intelligent contracts of a contract layer in the block chain network;
step 3, the data owner acquires the encrypted data and the breakpoint from the interplanetary file system IPFS of the block chain network, and then decrypts the encrypted data;
step 4, training the machine learning model by using the decrypted data according to the Orchester planning federal learning process to obtain a training gradient of the machine learning model;
step 5, the data owner adds noise to the obtained training gradient to realize differential privacy, and then sends the training gradient to the security aggregator;
step 6, recording training events in the machine learning process by the distributed non-falsifiable account book of the block chain network; the safety aggregator generates a nonce for the training record and sends the nonce to a medical institution participating in the training, and records the hash value hash (K, nonce) of the training record on the block chain;
and 7, encrypting the training gradients transmitted by the data owners by the security aggregator, collecting the training gradients transmitted by all the data owners, and updating the machine learning model.
Compared with the prior art, the invention has the beneficial effects that: based on the block chain characteristics and privacy protection means, privacy protection is performed on federal learning while medical data sharing is completed, and the problem of medical data leakage in federal learning is avoided.
Drawings
The invention is described in further detail below with reference to the figures and specific embodiments.
FIG. 1 is a schematic overall flow chart of the present invention for training a machine learning model;
FIG. 2 is an original drawing of a photograph of melanoma in a training set;
FIG. 3 is a schematic diagram of the process of restoring the training set photographs in experiment 3;
fig. 4 is a photograph of the training set that was recovered after 180 iterations in trial 4 using the reverse attack.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only illustrative of the present invention and should not be construed as limiting the scope of the present invention.
Referring to fig. 1, a federal learning method based on blockchain medical data sharing includes the following steps:
step 1, establishing a block chain network;
specifically, a FISCO BCOS block chain bottom platform is used for building a block chain network, and a plurality of medical institutions which are owners of medical data are used as block chain network nodes; setting an interplanetary File System (IPFS) to store medical data in a link mode;
step 2, uploading the machine learning model to an interplanetary file system IPFS of the block chain network by a model owner-a certain scientific research unit, and planning a federal learning process by setting an Orchester consisting of a plurality of intelligent contracts of a contract layer in the block chain network;
wherein the Orchester comprises a medical institution list, and medical institutions in the medical institution list have data for training the machine learning model; in the medical institution table, medical institutions can enable own medical data to be used by only certain people or not by certain people by setting a white list or a black list of addresses (hash of public keys), so that the data owner can control the data; the medical institution communicates through a secure communication channel;
step 3, the data owner acquires the encrypted data and the breakpoint from the interplanetary file system IPFS of the block chain network, and then decrypts the encrypted data;
step 4, training the machine learning model by using the decrypted data according to the Orchester planning federal learning process to obtain a training gradient of the machine learning model;
step 5, the data owner adds noise to the obtained training gradient to realize differential privacy, and then sends the training gradient to the security aggregator;
specifically, differential privacy is achieved by randomly modifying a data set, so that the individual information disclosure is reduced, and the statistical properties and the inference capability of the whole data are kept.
Assume the raw data is { x } 1 ,x 2 ,...,x n The function to be calculated is f (x) 1 ,x 2 ,...,x n ) The raw data may be noisy before calculation, and f (x) may be calculated 1 +r 1 ,x 2 +r 2 ,...,x n +r n ). Due to the addition of noise, privacy is provided to a certain extentThe protection is obtained, but the result of the calculation after the noise is added is close to the real result, which needs to make a trade-off between the privacy and the usability of the data.
Intuitively, if the result obtained by querying and calculating the whole database is almost the same after the data is modified, the privacy of the data is protected to some extent.
Then for a randomization algorithm M (the randomization algorithm is not a fixed value for the output of a particular input, but a random value following a certain distribution), provided that:
Pr[M(x)∈s]≤e ε Pr[M(y)∈s]+δ
if the algorithm M is all
Figure BDA0003596883490000041
All hold, then this algorithm M is said to satisfy (e, δ) differential privacy. ε is the consumption of the privacy budget for a single query. The smaller epsilon indicates the better privacy protection.
If two data sets x, y differ by only one record, i.e., | | x-y | | purple 1 1, then the two datasets are "adjacent datasets". The contiguous data set means that protection is provided for each record. When the probabilities are close, the similarity is high, and it is difficult to distinguish x output from y output, so that data with different x and y is protected.
When an external observer obtains a result from a certain data set, whether a certain individual is used or not cannot be known, and then the data set meets the differential privacy.
Step 6, recording training events in the machine learning process by the distributed non-falsifiable account book of the block chain network; the safety aggregator generates a nonce for the training record and sends the nonce to a medical institution participating in the training, and records the hash value hash (K, nonce) of the training record on the block chain;
the training record needs to satisfy two characteristics: 1. the medical institution participating in the training can prove that the medical institution participates in the training by the record; 2. until now, no other person than the security aggregator could know whether this medical facility is involved in the training.
To do this, the blockchain network needs to be set with two published parameters, the prime number q and its primitive root a. Private keys of a certain medical institution and a security aggregator in a certain training are respectively pri 1 ,pri 2 And both parties calculate and disclose:
Figure BDA0003596883490000051
Figure BDA0003596883490000052
to obtain
Figure BDA0003596883490000053
Since only the medical institution participating in the training has K, only the medical institution participating in the training can prove that the piece of training record belongs to itself. Meanwhile, other people cannot deduce who participates in the training from the hash (K, nonce) of the training record.
And 7, encrypting the training gradients transmitted by the data owners by the security aggregator, collecting the training gradients transmitted by all the data owners, and updating the machine learning model.
Specifically, the encryption mode is that the worker selects worker selection, that is, the security aggregator randomly selects only a part of collected gradients to generate output, and the medical institution does not know whether the gradient uploaded by the medical institution is selected. This way, illegal attackers can be prevented from obtaining data of a certain party, and the method and the device can play a role of privacy protection together with differential privacy.
Simulation test results
Taking a melanoma photo set from International Skin Imaging corporation as a training set of a machine learning model; the photos in the melanoma photo set are divided into benign tumor photos and malignant tumor photos; storing the training set in an IPFS;
randomly selecting 80 benign tumor photos and 80 malignant tumor photos in a training set, and taking the 160 photos as a test set;
test 1: training the machine learning model by using a general federal learning method, and testing the trained machine learning model by using a test set;
test 2: the method is characterized in that a machine learning model is trained by using the federal learning method, and the trained machine learning model is tested by using a test set;
the recognition accuracy of both federal studies is shown in table 1. Wherein, Accuracy represents the proportion of correct number of classifications in the test set; sensitivity represents the ratio of the number of photographs that successfully identified malignant melanoma to the number of photographs of all malignant melanomas in the test set; ROC-AUC represents the area under the ROC (receiver Operating characteristics) curve, and the closer to 1, the better; the MCC is a mausis correlation coefficient, and is used to measure the classification effect when the difference between the sizes of two classes is large during binary classification.
TABLE 1
Accuracy Sensitivity ROC-AUC MCC
Test 1 0.92 0.86 0.92 0.85
Test 2 0.85 0.81 0.88 0.78
As can be seen from table 1, although the recognition accuracy of test 2 is worse than that of test 1, test 2 still has higher recognition accuracy, and even if the federal learning method of the present invention does not have too great influence on the recognition accuracy of the machine learning model, the machine learning model can still maintain higher recognition accuracy.
Test 3: using a reverse attack to attack a general federal learning method, and restoring a picture in a training set;
test 4: attacking the federal learning method by using reverse attack to restore the photos in the training set;
referring to fig. 2, an original image of a melanoma photograph is collected for training.
Referring to fig. 3, a schematic diagram of the process of restoring the training set photo in experiment 3 is shown.
Comparing fig. 2 and 3, it can be seen that the shape of melanoma is substantially visualized after 30 iterations with the reverse attack; after 50 iterations with the reverse attack, the original image is substantially restored. This means that the picture information is leaked and the privacy of the medical data cannot be guaranteed.
Referring to fig. 4, a photograph of the training set is restored after 180 iterations in trial 4 using the reverse attack.
Comparing fig. 2 and fig. 4, it can be seen that the photos in the training set cannot be restored by using a reverse attack, which indicates that the federal learning method of the present invention does not leak the image information, and can ensure the privacy of the medical data.
Although the present invention has been described in detail in this specification with reference to specific embodiments and illustrative embodiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto based on the present invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (5)

1. A federal learning method based on block chain medical data sharing is characterized by comprising the following steps:
step 1, establishing a block chain network;
step 2, uploading the machine learning model to an interplanetary file system IPFS of the block chain network by a model owner, and planning a federal learning process by setting an Orchester consisting of a plurality of intelligent contracts of a contract layer in the block chain network;
step 3, the data owner acquires the encrypted data and the breakpoint from the interplanetary file system IPFS of the block chain network, and then decrypts the encrypted data;
step 4, training the machine learning model by using the decrypted data according to the Orchester planning federal learning process to obtain a training gradient of the machine learning model;
step 5, the data owner adds noise to the obtained training gradient to realize differential privacy, and then sends the training gradient to the security aggregator;
step 6, recording training events in the machine learning process by the distributed non-falsifiable account book of the block chain network; the safety aggregator generates a nonce for the training record and sends the nonce to a medical institution participating in the training, and records the hash value hash (K, nonce) of the training record on the block chain;
and 7, encrypting the training gradients transmitted by the data owners by the security aggregator, collecting the training gradients transmitted by all the data owners, and updating the machine learning model.
2. The federal learning method for blockchain medical data sharing according to claim 1, wherein in step 1, a blockchain network is built by using a FISCO BCOS blockchain underlying platform, and a data owner is used as a blockchain network node; and setting an interplanetary File System IPFS (Inter-planet File System) to store the medical data in a link mode.
3. A federal learning method as claimed in claim 1, wherein the differential privacy in step 5, particularly for a randomized algorithm M, is provided by the following equation:
Pr[M(x)∈s]≤e ε Pr[M(y)∈s]+δ
and for all
Figure FDA0003596883480000024
All hold, then this algorithm M is said to satisfy (e, δ) differential privacy; where ε is the consumption of the privacy budget by a single query;
when an external observer obtains a result from a certain data set, whether a certain individual is used or not cannot be known, and then the data set meets the differential privacy.
4. A federal learning method based on blockchain medical data sharing as claimed in claim 1, wherein K of hash (K, nonce) in step 6, specifically, the blockchain network sets two public parameters, prime number q and its primitive root a; private keys of a certain data owner and a security aggregator in certain training are respectively pri 1 ,pri 2 And both parties calculate and disclose:
Figure FDA0003596883480000021
Figure FDA0003596883480000022
then obtain
Figure FDA0003596883480000023
5. A federal learning method based on block chain medical data sharing as claimed in claim 1, wherein the security aggregator encrypts the training gradient transmitted by the data owner in step 7, specifically, the encryption selects worker selection, that is, the security aggregator randomly selects only a part of the collected gradient to generate output, and the medical institution does not know whether the gradient uploaded by itself is selected.
CN202210390877.0A 2022-04-14 2022-04-14 Block chain medical data sharing-based federal learning method Pending CN114912631A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210390877.0A CN114912631A (en) 2022-04-14 2022-04-14 Block chain medical data sharing-based federal learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210390877.0A CN114912631A (en) 2022-04-14 2022-04-14 Block chain medical data sharing-based federal learning method

Publications (1)

Publication Number Publication Date
CN114912631A true CN114912631A (en) 2022-08-16

Family

ID=82765705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210390877.0A Pending CN114912631A (en) 2022-04-14 2022-04-14 Block chain medical data sharing-based federal learning method

Country Status (1)

Country Link
CN (1) CN114912631A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665913A (en) * 2023-07-13 2023-08-29 之江实验室 Cross-institution patient matching system and method
CN116682543A (en) * 2023-08-03 2023-09-01 山东大学齐鲁医院 Sharing method and system of regional rehabilitation information

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665913A (en) * 2023-07-13 2023-08-29 之江实验室 Cross-institution patient matching system and method
CN116665913B (en) * 2023-07-13 2023-10-13 之江实验室 Cross-institution patient matching system and method
CN116682543A (en) * 2023-08-03 2023-09-01 山东大学齐鲁医院 Sharing method and system of regional rehabilitation information
CN116682543B (en) * 2023-08-03 2023-11-10 山东大学齐鲁医院 Sharing method and system of regional rehabilitation information

Similar Documents

Publication Publication Date Title
Hao et al. Towards efficient and privacy-preserving federated deep learning
Sharma et al. Hiding data in images using cryptography and deep neural network
Avudaiappan et al. Medical image security using dual encryption with oppositional based optimization algorithm
Zhang et al. HF-TPE: High-fidelity thumbnail-preserving encryption
WO2020034754A1 (en) Secure multi-party computation method and apparatus, and electronic device
Mugunthan et al. Smpai: Secure multi-party computation for federated learning
Lugan et al. Secure architectures implementing trusted coalitions for blockchained distributed learning (TCLearn)
CN114912631A (en) Block chain medical data sharing-based federal learning method
CN105100083B (en) A kind of secret protection and support user's revocation based on encryption attribute method and system
Doss et al. Memetic optimization with cryptographic encryption for secure medical data transmission in IoT-based distributed systems
KR102289419B1 (en) Method and apparatus for authentification of user using biometric
CN104363215A (en) Encryption method and system based on attributes
Niu et al. Toward verifiable and privacy preserving machine learning prediction
Koppu et al. A fast enhanced secure image chaotic cryptosystem based on hybrid chaotic magic transform
CN111800252A (en) Information auditing method and device based on block chain and computer equipment
Kim et al. Efficient Privacy‐Preserving Fingerprint‐Based Authentication System Using Fully Homomorphic Encryption
Tang et al. A secure and trustworthy medical record sharing scheme based on searchable encryption and blockchain
Kalapaaking et al. Blockchain-based federated learning with SMPC model verification against poisoning attack for healthcare systems
Li et al. SPFM: Scalable and privacy-preserving friend matching in mobile cloud
Liu et al. A color image encryption scheme based on a novel 3d chaotic mapping
Liu et al. Face image publication based on differential privacy
Fan et al. Lightweight privacy and security computing for blockchained federated learning in IoT
CN112380404B (en) Data filtering method, device and system
Manisha et al. CBRC: a novel approach for cancelable biometric template generation using random permutation and Chinese Remainder Theorem
Eltaieb et al. Efficient implementation of cancelable face recognition based on elliptic curve cryptography

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination