CN116305233A - Scientific research data management method and system based on federal migration learning - Google Patents
Scientific research data management method and system based on federal migration learning Download PDFInfo
- Publication number
- CN116305233A CN116305233A CN202211579085.4A CN202211579085A CN116305233A CN 116305233 A CN116305233 A CN 116305233A CN 202211579085 A CN202211579085 A CN 202211579085A CN 116305233 A CN116305233 A CN 116305233A
- Authority
- CN
- China
- Prior art keywords
- data
- training
- model
- classification
- department
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000011160 research Methods 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000013523 data management Methods 0.000 title claims abstract description 16
- 238000013508 migration Methods 0.000 title claims abstract description 8
- 230000005012 migration Effects 0.000 title claims abstract description 8
- 238000012549 training Methods 0.000 claims abstract description 116
- 238000013526 transfer learning Methods 0.000 claims abstract description 18
- 239000013598 vector Substances 0.000 claims abstract description 15
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 230000008569 process Effects 0.000 claims abstract description 9
- 230000000694 effects Effects 0.000 claims abstract description 8
- 230000002159 abnormal effect Effects 0.000 claims abstract description 6
- 238000005070 sampling Methods 0.000 claims abstract description 4
- 230000035945 sensitivity Effects 0.000 claims description 51
- 238000007726 management method Methods 0.000 claims description 43
- 238000013145 classification model Methods 0.000 claims description 36
- 230000006870 function Effects 0.000 claims description 36
- 238000013528 artificial neural network Methods 0.000 claims description 26
- 238000003062 neural network model Methods 0.000 claims description 16
- 238000012360 testing method Methods 0.000 claims description 15
- 238000009826 distribution Methods 0.000 claims description 11
- 238000005516 engineering process Methods 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 9
- 238000003058 natural language processing Methods 0.000 claims description 8
- 238000011478 gradient descent method Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 238000000586 desensitisation Methods 0.000 claims description 6
- 230000014509 gene expression Effects 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 5
- 230000005284 excitation Effects 0.000 claims description 5
- 210000002569 neuron Anatomy 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 230000007704 transition Effects 0.000 claims 3
- 238000013136 deep learning model Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000001556 precipitation Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/629—Protecting access to data via a platform, e.g. using keys or access control rules to features or functions of an application
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a scientific research data management method and system based on federal migration learning, comprising the steps that a data acquisition module acquires data of each service system; the data of each service system collected by the data preprocessing module is converted into a required format, and the data is cleaned, corrected and abnormal values are removed; the data feature extraction module performs hierarchical sampling according to the key attributes of the database table to form feature vectors of character type data; constructing a data classification grading model based on a federal transfer learning architecture; noise is added in the training process, and a deep learning model with differential privacy protection is obtained, so that data privacy can be protected when the model is used for prediction. The invention can improve the model effect and ensure the safety of data.
Description
Technical Field
The invention relates to a scientific research data management method and system based on federal transfer learning, and belongs to the technical field of data management.
Background
The scientific research management work is a core means for promoting the scientific development of enterprises and the allocation of scientific research resources, the scientific research management data types of the enterprises are multiple, the data volume is large, and the enterprise covers various business fields such as project management, production and research integration, human resources, result conversion, finance and the like, and a large amount of structured, semi-structured and unstructured professional data is generated. The current enterprise scientific research management has complex challenges of asymmetric information, difficult data interconnection and intercommunication, incomplete data, large field span and the like of each service system, and seriously hinders the integrated sharing and innovative application of scientific research data. The scientific research management of enterprises needs to break through the barriers among systems, promote data management, data sharing and data precipitation, provide support for decision making and accelerate the digital transformation of enterprises.
Disclosure of Invention
The invention aims to: the scientific research data management method and system based on federal migration learning are provided, and the safety of data is ensured while the model effect is improved.
The technical scheme is as follows: a scientific research data management method based on federal transfer learning comprises the following steps:
step 1: the data acquisition module acquires data of each department of scientific research management service and transmits the data to the data preprocessing module;
step 2: the data preprocessing module cleans, corrects and removes abnormal values of the data of each department of the scientific research management service acquired by the data acquisition module, and transmits the processed data to the data characteristic extraction module;
step 3: the data feature extraction module performs hierarchical sampling according to key attributes of the database table, extracts length distribution features and character distribution features of character type data character strings, extracts word vectors of the character strings by using a natural language processing method, and performs named entity recognition to form data feature values of each department of scientific research management service;
step 4: constructing a data classification directory system according to the domain, the module and the activity three-level directory; classifying classified data according to influences caused by the destroyed safety attributes, and constructing a data sensitivity classification catalog system according to preset content sensitivity degrees of different data sets;
step 5: based on a federation transfer learning distributed architecture, a residual convolutional neural network is adopted, a data characteristic value of a scientific research management service system is taken as input, a data category in a data classification catalog is taken as a data classification model to be output, a data sensitivity level in the data sensitivity classification catalog is taken as a data classification model to be output, a data classification model and a data classification model are trained, and a differential privacy algorithm for adaptively distributing differential privacy budget is adopted to noise the model in the training process, so that the data privacy can be protected when the model is used for prediction;
step 6: inputting the data to be tested into a data classification model and a data classification model to obtain corresponding classification and classification results, and according to the classification and classification results of different data, adopting different privacy presets to noise the trained data set and then publishing the data set to obtain the desensitized and demoistened data set.
Further, the step 3 includes:
step 3.1: extracting character pattern distribution characteristics of the character strings, and matching whether the character strings conform to the regular expression or not by using a preset regular expression;
step 3.2: after the character string is segmented by using a natural language processing technology, a word vector is extracted by using an One-hot, TFIDF, word2Vec technology, and a text feature vector of the field is constructed.
Further, the step 5 includes:
step 5.1: data normalization
Dividing data in a data set into a training set and a testing set, and carrying out normalization processing on the data in the training set and the testing set by a maximum and minimum method, wherein a calculation formula is as follows:
wherein data X k Difference from the minimum value Xmin of the column, divided by the difference Xmax-Xmin, all data are converted to [0,1 ]]To cancel order of magnitude differences between the dimensions;
step 5.2: construction of neural network model
The neuron excitation function of the BP neural network selects a linear rectification function ReLU, the loss function selects a distance loss function MSE, and the predicted value y is calculated i And true valueThe sum of squares of the distances between them is given by:
the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes; the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of sensitivity grades;
step 5.3: inputting the training set obtained in the step 5.1 into the neural network model constructed in the step 5.2 for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:
wherein θ t A parameter set of the neural network at the t-th iteration; i r The formula is that the network learning rate is as follows
J(θ t ) After the iterative training is finished, inputting the data of the test set into a trained neural network, and judging whether the classification is correct or not according to whether the output prediction result is consistent with the actual or not;
step 5.4: the differential privacy algorithm for adaptively distributing the privacy budget is as follows: after normalizing the data in step 5.1, setting privacy budgets with different intensities for training models of each scientific research management business department according to the size of the data set, wherein the data set with small data volume has large global sensitivity, and setting larger privacy budgets so as to reduce noise level; the global sensitivity of a data set with a large data volume is small and a smaller privacy budget is set. The corresponding relation between the data set size sigma and the privacy budget epsilon is as follows: epsilon=2 -σ 。
Further, the step 6 includes:
step 6.1: initializing parameters of training model loss functions, noise scales and learning rates of various departments, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h j Batch size B j Noise scale sigma j Learning rate gamma j ;
Step 6.2: the initialization model trained by the jth department model is W j ,W j =h j Training with j-th department data set I with size of B j The formula for updating the gradient by the random gradient descent method is as follows:
wherein, gamma k In order for the rate of learning to be high,training parameters f of the jth department in the h training j A loss function for the j-th department;
step 6.3: after training, the gradient is normalized:
wherein W is k For each department model to participate in training initial weight parameters,training parameters of the jth department model in the kth training are used;
step 6.4: and (3) adding noise to the gradient, wherein the formula is as follows:
wherein,,for the training parameters of the jth client during the kth training, N (0, r #) 2 * I) For normal distributed noise, r is the number of department data sets participating in training, sigma is privacy budget, and I is the data set of the departments;
step 6.5: after the training of the round is completed by each department model, the neural network parameters are uploaded to a central server, the central server aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to each department participating in the training of the round;
step 6.6: classifying and grading the data set by adopting a trained neural network model: the data with different sensitivity levels are matched with corresponding privacy budgets, and are released after being noisy. The privacy protection budget corresponding to each of the ultra-high sensitivity attribute group, the high sensitivity group, the medium sensitivity attribute group and the low sensitivity attribute group is 0.7,0.5,0.3,0.1.
A scientific research data management system based on federal migration learning comprises a data acquisition module, a data preprocessing module, a data characteristic extraction module, a data classification and classification module and a data desensitization and decryption module;
the data acquisition module is used for: acquiring data of each business system of scientific research management of project management, scientific and technological achievement management, human resource management and financial management, and transmitting the acquired multi-source heterogeneous data of structured data, semi-structured data and unstructured data into a data preprocessing module;
the data preprocessing module is used for: converting various service system data acquired by the data acquisition module into a required format, correcting or eliminating abnormal data, and transmitting the processed data to the data characteristic extraction module;
the data characteristic extraction module is used for: extracting length distribution characteristics and character distribution characteristics of character data strings; extracting word vectors of character strings by using a natural language processing method, and identifying named entities to form data characteristic values of departments of scientific research management service
The data classification and grading module is used for: based on a federal transfer learning architecture, constructing a data classification hierarchical model, and constructing a data classification directory system according to domain, module and activity tertiary directories; classifying classified data according to influences caused by the destroyed safety attributes, and constructing a data sensitivity classification catalog system according to preset content sensitivity degrees of different data sets; the method comprises the steps of carrying out a first treatment on the surface of the In order to ensure the data privacy of each department, based on a federal transfer learning distributed architecture, a residual convolutional neural network is adopted, the data characteristic value of a scientific research management service system is taken as input, the data category in a data classification catalog is taken as a data classification model to be output, the data sensitivity level in the data sensitivity classification catalog is taken as a data classification model to be output, the data classification model and the data classification model are trained, and a differential privacy algorithm for adaptively distributing differential privacy budget is adopted to noise the model in the training process, so that the data privacy can be protected when the model is used for prediction;
the data desensitization decryption module: inputting the data to be tested into a data classification model and a data classification model to obtain corresponding classification and classification results, and according to the classification and classification results of different data, adopting different privacy presets to noise the trained data set and then publishing the data set to obtain the desensitized and demoistened data set.
Further, the data classification and grading module comprises:
a. data normalization
Dividing data in a data set into a training set and a testing set, and carrying out normalization processing on the data in the training set and the testing set by a maximum and minimum method, wherein a calculation formula is as follows:
wherein data X k Difference from the minimum value Xmin of the column, divided by the difference Xmax-Xmin, all data are converted to [0,1 ]]To cancel order of magnitude differences between the dimensions;
b. construction of neural network model
The neuron excitation function of the BP neural network selects a linear rectification function ReLU, the loss function selects a distance loss function MSE, and the predicted value y is calculated i And true valueThe sum of squares of the distances between them is given by:
the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes; the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of sensitivity grades; similarly, a data classification model based on a BP neural network is constructed, the number of nodes of an input layer of the network is data dimension for data classification, the number of nodes of an output layer is sensitivity level, and the nodes are divided into four levels of an ultra-high sensitivity level, a medium sensitivity level and a low sensitivity level.
c. Inputting the training set obtained in the step a into the neural network model constructed in the step b for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:
wherein θ t A parameter set of the neural network at the t-th iteration; i r The formula is that the network learning rate is as follows
J(θ t ) After the iterative training is finished, inputting the data of the test set into a trained neural network, and judging whether the classification is correct or not according to whether the output prediction result is consistent with the actual or not;
d. after normalizing the data set of each department, setting privacy budgets with different intensities for training models of each scientific research management service department according to the data set, wherein the data set with small data quantity has large global sensitivity, and setting larger privacy budgets so as to reduce noise level; the global sensitivity of a data set with a large data volume is small and a smaller privacy budget is set.
Further, the data desensitization decryption module: comprising the following steps:
(1) Initializing parameters of training model loss functions, noise scales and learning rates of various departments, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h j Batch size B j Noise scale sigma j Learning rate gamma j ;
(2) The initialization model trained by the jth department model is W j ,W j =h j By the j-thTraining department data set I with data set size of B j The formula for updating the gradient by the random gradient descent method is as follows:
wherein, gamma k In order for the rate of learning to be high,training parameters f of the jth department in the h training j A loss function for the j-th department;
(3) After training, the gradient is normalized:
wherein W is k For each department model to participate in training initial weight parameters,training parameters of the jth department model in the h training;
(4) And (3) adding noise to the gradient, wherein the formula is as follows:
wherein,,for the training parameters of the jth client during the kth training, N (0, r #) 2 * I) For normal distributed noise, r is the number of department data sets participating in training, sigma is privacy budget, and I is the data set of the departments;
(5) After the training of the round is completed by each department model, the neural network parameters are uploaded to a central server, the central server aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to each department participating in the training of the round;
(6) Classifying and grading the data set by adopting a trained neural network model: the data with different sensitivity levels are matched with corresponding privacy budgets, and are released after being noisy.
The beneficial effects are that:
1. the intelligent classification and classification of the scientific research management data are realized by adopting the neural network, the problem that the classification and classification of the scientific research management sensitive data depend on manpower at present is huge in consumption is solved, the data privacy is effectively protected, and the working efficiency is improved.
2. Based on the federal transfer learning architecture, a differential privacy calculation technology based on a random gradient descent method is adopted in the process of training a data classification hierarchical model, so that the privacy of training data is protected, and meanwhile, the risk of parameter leakage of a network model is reduced.
3. The privacy budget is distributed according to the size of the data set of each department, so that the influence of noise on the model is reduced, and the performance of the model is further improved while effective differential privacy assurance is provided.
4. Different privacy budgets are allocated for data sets with different sensitivity levels, so that privacy protection degrees with different intensities are provided, and the problems of low data availability and insufficient protection of sensitive attributes caused by average allocation of the privacy budgets are well solved.
Drawings
FIG. 1 is a schematic block diagram of a system according to the present invention
FIG. 2 is a flowchart of a scientific research data management algorithm based on federal transfer learning
FIG. 3 is a schematic diagram of a data classification hierarchical BP neural network topology
FIG. 4 is a federal transfer learning fusion architecture for various department models
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
The invention mainly comprises
1. Based on scientific research management business systems such as scientific research project management, scientific and technological achievement management, human resource management, financial management and the like, constructing a data classification catalog system according to domain, module and activity tertiary catalogs; and constructing a data hierarchical directory system according to the preset content sensitivity degrees of different data resources.
2. Based on the federal transfer learning architecture, a neural network model for classifying data among departments is constructed, the combined training of the model is realized on the premise that important privacy data of each department cannot be obtained locally by each department, and the safety of the data is ensured while the model effect is improved.
3. In order to fully ensure the data privacy of each department, the neural network model is noisy by adopting a self-adaptive gradient descent method based on differential privacy calculation during model training, and privacy budgets are distributed according to the size of each department data set, so that the data privacy can be effectively protected when the model is used for carrying out data classification hierarchical prediction.
4. And extracting the data characteristic vector of each data set by using methods such as natural language processing and the like. And after normalizing the feature vectors, respectively inputting the feature vectors into a trained data classification neural network and a trained data classification neural network to realize intelligent classification and classification of business data of each department.
5. And adopting a differential privacy encryption technology, distributing privacy budgets with different intensities for data with different levels according to the grading condition of data resources of each department, realizing privacy protection of the original data, and generating a noisy public data set.
As shown in fig. 1, the scientific research data management method and system based on federal migration learning provided by the invention comprise the following steps:
step 1: and acquiring data of each business system for scientific research management such as project management, human resources, finance and the like, wherein the data comprises structured data, semi-structured data and unstructured data.
Step 2: and converting various service system data acquired by the data acquisition module into a required format, and cleaning, correcting and removing abnormal values.
Step 3: and carrying out hierarchical sampling according to the key attributes of the database table, and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like.
Step 3.1: extracting character pattern distribution characteristics of the character strings, and matching whether the character strings conform to the regular expressions or not by using preset regular expressions such as mailboxes, mobile phone numbers, identity card numbers and the like.
Step 3.2: after the character string is segmented by using a natural language processing technology, word vectors are extracted by using technologies such as One-hot, TFIDF, word Vec and the like, and text feature vectors of the field are constructed.
Step 4: and constructing a data classification directory system according to the domain, the module and the active three-level directory.
Table 1 scientific research data classification identification table
Step 5: on the basis of the data classification, classifying the classified data according to the influence caused by the destroyed safety attribute to form a unified classification standard.
Table 2 scientific research data grading identification table
Step 6: in order to fully ensure the data privacy of each department, a federal transfer learning distributed architecture is adopted, a differential privacy encryption technology is adopted based on a residual convolutional neural network, the data characteristic values under each service scene of scientific research management are taken as input, the data category and the data sensitivity level are taken as output, and a data classification model are trained. Noise is added in the training process, and a deep learning model with differential privacy protection is obtained, so that data privacy can be protected when the model is used for prediction.
Step 6.1: training data classification model
Step 6.1.1: data normalization
Dividing the data in the data set into a training set and a test set by a maximum-minimum method x k =(x k -x min )/(x max -x min ) Normalizing the data in the training set and the test set to convert all the data into [0,1 ]]To cancel order of magnitude differences between the dimensions.
Step 6.1.2: construction of neural network model
The number of nodes of the input layer of the network is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes. The neuron excitation function of the BP neural network selects a linear rectification function ReLU, namely phi (x) =max (0, x), and the loss function selects a distance loss function MSE, namely L (y, v) (m) )=‖y-v (m) ‖ 2 。
Similarly, a data classification model based on a BP neural network is constructed. The number of nodes of the input layer of the network is the data dimension for data classification, and the number of nodes of the output layer is the data class number.
Step 6.1.3: inputting the training set obtained in the step 6.1.1 into the neural network model constructed in the step 6.1.2 for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:
wherein θ t A parameter set of the neural network at the t-th iteration; i r The formula is that the network learning rate is as followsJ(θ t ) As a loss function.
After the iterative training is finished, the data of the test set is input into a trained neural network, and whether the classification is correct or not is judged according to whether the output prediction result is consistent with the actual result or not.
Step 6.1.4: the data sets in different business fields of departments such as project management, human resources, finance, scientific quality and the like are different in size. For unequal amounts of data, the noise added by each department training model is different if the same degree of privacy protection is maintained, i.e., the privacy budget ε for each portion is maintained equal. The noise added by the part with small data volume is larger than the noise added by the department with large data volume, and the training of the federal learning model is inevitably negatively influenced.
Therefore, after normalizing the sizes of the data sets of the departments, privacy budgets with different intensities are set for the training model of each department according to the sizes of the data sets. The global sensitivity of a data set with small data volume is high, and the larger privacy budget is set, so that the noise level is reduced; the global sensitivity of a data set with a large data volume is small and a relatively small privacy budget can be set. The correspondence of data set size to privacy budget is shown in table 3.
Table 3 data set size and privacy budget correspondence table
Step 6.1.5: based on the federal transfer learning distributed architecture, a differential privacy algorithm is adopted to noise model parameters in the model training process.
(1) Initializing parameters such as a model loss function, a noise scale, a learning rate and the like of training of each department, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h j Batch size B j Noise scale sigma j Learning rate gamma j 。
(2) The initialization model trained by the jth department model is W j ,W j =h j Training with j-th department data set I with size of B j The formula for updating the gradient by the random gradient descent method is as follows:
wherein, gamma k In order for the rate of learning to be high,training parameters f of the jth department in the h training j A loss function for the j-th department;
(3) After training, the gradient is normalized:
wherein W is k For each department model to participate in training initial weight parameters,training parameters of the jth department model in the h training are obtained.
(4) And (3) adding noise to the gradient, wherein the formula is as follows:
wherein,,the training parameters of the jth client side in the h training are r, the number of the data sets of departments participating in the training, sigma, the privacy budget and I, wherein the data sets of the departments are obtained.
(5) After the models of all departments complete the training of the round, the neural network parameters are uploaded to a central server. The central server side aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to all departments participating in the training of the round.
Step 6.2: training a data grading model:
similar to step 6.1, a federal transfer learning distributed architecture is adopted, a differential privacy encryption technology is adopted based on a residual convolutional neural network, data characteristic values under each service scene of scientific research management are used as input, and a data sensitivity level is used as an output training data grading model. Noise is added in the training process, and a deep learning model with differential privacy protection is obtained, so that data privacy can be protected when the model is used for prediction.
The sensitivity of the data sets of different levels is different, and the intensity of the data sets to be protected is also different, so that the data of different levels needs to be matched with corresponding privacy budgets. The privacy protection budget corresponding to each of the ultra-high sensitivity attribute group, the high sensitivity group, the medium sensitivity attribute group and the low sensitivity attribute group is 0.7,0.5,0.3,0.1.
For example: assuming that the financial statement belongs to an ultra-high sensitive group, the strength required to protect the financial statement is higher, the noise required to be added is larger, and the allocated smaller privacy budget is 0.1; the department organization architecture belongs to a low-sensitivity data group, the noise to be added is small, and a large privacy budget of 0.7 is allocated for the department organization architecture.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (7)
1. The scientific research data management method based on federal migration learning is characterized by comprising the following steps of:
step 1: the data acquisition module acquires data of each department of scientific research management service and transmits the data to the data preprocessing module;
step 2: the data preprocessing module cleans, corrects and removes abnormal values of the data of each department of the scientific research management service acquired by the data acquisition module, and transmits the processed data to the data characteristic extraction module;
step 3: the data feature extraction module performs hierarchical sampling according to key attributes of the database table, extracts length distribution features and character distribution features of character type data character strings, extracts word vectors of the character strings by using a natural language processing method, and performs named entity recognition to form data feature values of each department of scientific research management service;
step 4: constructing a data classification directory system according to the domain, the module and the activity three-level directory; classifying classified data according to influences caused by the destroyed safety attributes, and constructing a data sensitivity classification catalog system according to preset content sensitivity degrees of different data sets;
step 5: based on a federation transfer learning distributed architecture, a residual convolutional neural network is adopted, a data characteristic value of a scientific research management service system is taken as input, a data category in a data classification catalog is taken as a data classification model to be output, a data sensitivity level in the data sensitivity classification catalog is taken as a data classification model to be output, a data classification model and a data classification model are trained, and a differential privacy algorithm for adaptively distributing differential privacy budget is adopted to noise the model in the training process, so that the data privacy can be protected when the model is used for prediction;
step 6: inputting the data to be tested into a data classification model and a data classification model to obtain corresponding classification and classification results, and according to the classification and classification results of different data, adopting different privacy presets to noise the trained data set and then publishing the data set to obtain the desensitized and demoistened data set.
2. The scientific research data management method based on federal transfer learning according to claim 1, wherein the step 3 includes:
step 3.1: extracting character pattern distribution characteristics of the character strings, and matching whether the character strings conform to the regular expression or not by using a preset regular expression;
step 3.2: after the character string is segmented by using a natural language processing technology, a word vector is extracted by using an One-hot, TFIDF, word2Vec technology, and a text feature vector of the field is constructed.
3. The method for managing scientific research data based on federal transition learning according to claim 1, wherein the step 5 comprises:
step 5.1: data normalization
Dividing data in a data set into a training set and a testing set, and carrying out normalization processing on the data in the training set and the testing set by a maximum and minimum method, wherein a calculation formula is as follows:
wherein data X k Difference from the minimum value Xmin of the column, divided by the difference Xmax-Xmin, all data are converted to [0,1 ]]To cancel order of magnitude differences between the dimensions;
step 5.2: construction of neural network model
The neuron excitation function of the BP neural network selects a linear rectification function ReLU, the loss function selects a distance loss function MSE, and the predicted value y is calculated i And true valueThe sum of squares of the distances between them is given by:
the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes; the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of sensitivity grades;
step 5.3: inputting the training set obtained in the step 5.1 into the neural network model constructed in the step 5.2 for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:
wherein θ t For the parameter set of the neural network at the t-th iteration;I r The formula is that the network learning rate is as follows
J(θ t ) After the iterative training is finished, inputting the data of the test set into a trained neural network, and judging whether the classification is correct or not according to whether the output prediction result is consistent with the actual or not;
step 5.4: the differential privacy algorithm for adaptively distributing the privacy budget is as follows: after normalizing the data in step 5.1, setting privacy budgets with different intensities for training models of each scientific research management business department according to the size of the data set, wherein the data set with small data volume has large global sensitivity, and setting larger privacy budgets so as to reduce noise level; the global sensitivity of a data set with a large data volume is small and a smaller privacy budget is set.
4. The method for managing scientific research data based on federal transition learning according to claim 1, wherein the step 6 comprises:
step 6.1: initializing parameters of training model loss functions, noise scales and learning rates of various departments, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h j Batch size B j Noise scale sigma j Learning rate gamma j ;
Step 6.2: the initialization model trained by the jth department model is W j ,W j =h j Training with j-th department data set I with size of B j The formula for updating the gradient by the random gradient descent method is as follows:
wherein, gamma k In order for the rate of learning to be high,training parameters f of the jth department in the h training j A loss function for the j-th department;
step 6.3: after training, the gradient is normalized:
wherein W is k For each department model to participate in training initial weight parameters,training parameters of the jth department model in the kth training are used;
step 6.4: and (3) adding noise to the gradient, wherein the formula is as follows:
wherein,,for the training parameters of the jth client during the kth training, N (0, r #) 2 * I) For normal distributed noise, r is the number of department data sets participating in training, sigma is privacy budget, and I is the data set of the departments;
step 6.5: after the training of the round is completed by each department model, the neural network parameters are uploaded to a central server, the central server aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to each department participating in the training of the round;
step 6.6: classifying and grading the data set by adopting a trained neural network model: the data with different sensitivity levels are matched with corresponding privacy budgets, and are released after being noisy.
5. The scientific research data management system based on federal migration learning is characterized by comprising a data acquisition module, a data preprocessing module, a data characteristic extraction module, a data classification and classification module and a data desensitization and decryption module;
the data acquisition module is used for: acquiring data of each business system of scientific research management of project management, scientific and technological achievement management, human resource management and financial management, and transmitting the acquired multi-source heterogeneous data of structured data, semi-structured data and unstructured data into a data preprocessing module;
the data preprocessing module is used for: converting various service system data acquired by the data acquisition module into a required format, correcting or eliminating abnormal data, and transmitting the processed data to the data characteristic extraction module;
the data characteristic extraction module is used for: extracting length distribution characteristics and character distribution characteristics of character data strings; extracting word vectors of character strings by using a natural language processing method, and identifying named entities to form data characteristic values of departments of scientific research management service
The data classification and grading module is used for: based on a federal transfer learning architecture, constructing a data classification hierarchical model, and constructing a data classification directory system according to domain, module and activity tertiary directories; classifying classified data according to influences caused by the destroyed safety attributes, and constructing a data sensitivity classification catalog system according to preset content sensitivity degrees of different data sets; the method comprises the steps of carrying out a first treatment on the surface of the In order to ensure the data privacy of each department, based on a federal transfer learning distributed architecture, a residual convolutional neural network is adopted, the data characteristic value of a scientific research management service system is taken as input, the data category in a data classification catalog is taken as a data classification model to be output, the data sensitivity level in the data sensitivity classification catalog is taken as a data classification model to be output, the data classification model and the data classification model are trained, and a differential privacy algorithm for adaptively distributing differential privacy budget is adopted to noise the model in the training process, so that the data privacy can be protected when the model is used for prediction;
the data desensitization decryption module: inputting the data to be tested into a data classification model and a data classification model to obtain corresponding classification and classification results, and according to the classification and classification results of different data, adopting different privacy presets to noise the trained data set and then publishing the data set to obtain the desensitized and demoistened data set.
6. The research data management system based on federal transition learning of claim 5, wherein the data classification module comprises:
a. data normalization
Dividing data in a data set into a training set and a testing set, and carrying out normalization processing on the data in the training set and the testing set by a maximum and minimum method, wherein a calculation formula is as follows:
wherein data X k Difference from the minimum value Xmin of the column, divided by the difference Xmax-Xmin, all data are converted to [0,1 ]]To cancel order of magnitude differences between the dimensions;
b. construction of neural network model
The neuron excitation function of the BP neural network selects a linear rectification function ReLU, the loss function selects a distance loss function MSE, and the predicted value y is calculated i And true valueThe sum of squares of the distances between them is given by:
the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes; the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of sensitivity grades;
c. inputting the training set obtained in the step a into the neural network model constructed in the step b for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:
wherein θ t A parameter set of the neural network at the t-th iteration; i r The formula is that the network learning rate is as follows
J(θ t ) After the iterative training is finished, inputting the data of the test set into a trained neural network, and judging whether the classification is correct or not according to whether the output prediction result is consistent with the actual or not;
d. after normalizing the data set of each department, setting privacy budgets with different intensities for training models of each scientific research management service department according to the data set, wherein the data set with small data quantity has large global sensitivity, and setting larger privacy budgets so as to reduce noise level; the global sensitivity of a data set with a large data volume is small and a smaller privacy budget is set.
7. The research data management system based on federal transfer learning of claim 5, wherein the data desensitization and decryption module: comprising the following steps:
(1) Initializing parameters of training model loss functions, noise scales and learning rates of various departments, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h j Batch size B j Noise scale sigma j Learning rate gamma j ;
(2) The initialization model trained by the jth department model is W j ,W j =h j Training with j-th department data set I with size of B j The formula for updating the gradient by the random gradient descent method is as follows:
wherein, gamma k In order for the rate of learning to be high,training parameters f of the jth department in the h training j A loss function for the j-th department;
(3) After training, the gradient is normalized:
wherein W is k For each department model to participate in training initial weight parameters,training parameters of the jth department model in the h training;
(4) And (3) adding noise to the gradient, wherein the formula is as follows:
wherein,,for the training parameters of the jth client during the kth training, N (0, r #) 2 * I) For normal distributed noise, r is the number of department data sets participating in training, sigma is privacy budget, and I is the data set of the departments;
(5) After the training of the round is completed by each department model, the neural network parameters are uploaded to a central server, the central server aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to each department participating in the training of the round;
(6) Classifying and grading the data set by adopting a trained neural network model: the data with different sensitivity levels are matched with corresponding privacy budgets, and are released after being noisy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211579085.4A CN116305233A (en) | 2022-12-08 | 2022-12-08 | Scientific research data management method and system based on federal migration learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211579085.4A CN116305233A (en) | 2022-12-08 | 2022-12-08 | Scientific research data management method and system based on federal migration learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116305233A true CN116305233A (en) | 2023-06-23 |
Family
ID=86782213
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211579085.4A Pending CN116305233A (en) | 2022-12-08 | 2022-12-08 | Scientific research data management method and system based on federal migration learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116305233A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117056979A (en) * | 2023-10-11 | 2023-11-14 | 杭州金智塔科技有限公司 | Service processing model updating method and device based on user privacy data |
-
2022
- 2022-12-08 CN CN202211579085.4A patent/CN116305233A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117056979A (en) * | 2023-10-11 | 2023-11-14 | 杭州金智塔科技有限公司 | Service processing model updating method and device based on user privacy data |
CN117056979B (en) * | 2023-10-11 | 2024-03-29 | 杭州金智塔科技有限公司 | Service processing model updating method and device based on user privacy data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109783639B (en) | Mediated case intelligent dispatching method and system based on feature extraction | |
CN103177114B (en) | Based on the shift learning sorting technique across data field differentiating stream shape | |
CN110704694B (en) | Organization hierarchy dividing method based on network representation learning and application thereof | |
CN109190698B (en) | Classification and identification system and method for network digital virtual assets | |
CN116305233A (en) | Scientific research data management method and system based on federal migration learning | |
CN112085086A (en) | Multi-source transfer learning method based on graph convolution neural network | |
CN113869384B (en) | Privacy protection image classification method based on field self-adaption | |
CN111985680A (en) | Criminal multi-criminal name prediction method based on capsule network and time sequence | |
Zhang et al. | A review on the construction of business intelligence system based on unstructured image data | |
Niu et al. | Using image feature extraction to identification of ancient ceramics based on partial differential equation | |
CN113139603A (en) | Federal learning method based on EMD distance fusion multi-source heterogeneous data | |
CN112464289A (en) | Method for cleaning private data | |
CN106529601A (en) | Image classification prediction method based on multi-task learning in sparse subspace | |
CN116051924A (en) | Divide-and-conquer defense method for image countermeasure sample | |
CN115034762A (en) | Post recommendation method and device, storage medium, electronic equipment and product | |
CN110880047A (en) | Enterprise project service closed-loop information processing modeling method | |
Lv | Cloud Computation-Based Clustering Method for Nonlinear Complex Attribute Big Data | |
CN109271593A (en) | Military project group personal information labeling process based on cluster | |
Kanikar et al. | Extracting actionable association rules from multiple datasets | |
CN114372559A (en) | Construction method of deep adaptive network based on robust soft label | |
Zhang et al. | [Retracted] Application of Artificial Neural Network Algorithm in Facial Biological Image Information Scanning and Recognition | |
Rahman | Reframing in Clustering: An Introductory Survey | |
Ge et al. | Research on Seafood Traceable Data Based on k-Modes Clustering Algorithm | |
CN117935936A (en) | Single cell RNA-seq data clustering method, device, equipment and medium | |
Ghore | Data Mining used of Neural Networks Approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |