CN116305233A - Scientific research data management method and system based on federal migration learning - Google Patents

Scientific research data management method and system based on federal migration learning Download PDF

Info

Publication number
CN116305233A
CN116305233A CN202211579085.4A CN202211579085A CN116305233A CN 116305233 A CN116305233 A CN 116305233A CN 202211579085 A CN202211579085 A CN 202211579085A CN 116305233 A CN116305233 A CN 116305233A
Authority
CN
China
Prior art keywords
data
training
model
classification
department
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211579085.4A
Other languages
Chinese (zh)
Inventor
徐舒
徐艺
郭旭周
顾勇
许小伟
张跃
刘思娴
朱鹏
孙昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Panda Electronics Co Ltd
Nanjing Panda Information Industry Co Ltd
Original Assignee
Nanjing Panda Electronics Co Ltd
Nanjing Panda Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Panda Electronics Co Ltd, Nanjing Panda Information Industry Co Ltd filed Critical Nanjing Panda Electronics Co Ltd
Priority to CN202211579085.4A priority Critical patent/CN116305233A/en
Publication of CN116305233A publication Critical patent/CN116305233A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/629Protecting access to data via a platform, e.g. using keys or access control rules to features or functions of an application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a scientific research data management method and system based on federal migration learning, comprising the steps that a data acquisition module acquires data of each service system; the data of each service system collected by the data preprocessing module is converted into a required format, and the data is cleaned, corrected and abnormal values are removed; the data feature extraction module performs hierarchical sampling according to the key attributes of the database table to form feature vectors of character type data; constructing a data classification grading model based on a federal transfer learning architecture; noise is added in the training process, and a deep learning model with differential privacy protection is obtained, so that data privacy can be protected when the model is used for prediction. The invention can improve the model effect and ensure the safety of data.

Description

Scientific research data management method and system based on federal migration learning
Technical Field
The invention relates to a scientific research data management method and system based on federal transfer learning, and belongs to the technical field of data management.
Background
The scientific research management work is a core means for promoting the scientific development of enterprises and the allocation of scientific research resources, the scientific research management data types of the enterprises are multiple, the data volume is large, and the enterprise covers various business fields such as project management, production and research integration, human resources, result conversion, finance and the like, and a large amount of structured, semi-structured and unstructured professional data is generated. The current enterprise scientific research management has complex challenges of asymmetric information, difficult data interconnection and intercommunication, incomplete data, large field span and the like of each service system, and seriously hinders the integrated sharing and innovative application of scientific research data. The scientific research management of enterprises needs to break through the barriers among systems, promote data management, data sharing and data precipitation, provide support for decision making and accelerate the digital transformation of enterprises.
Disclosure of Invention
The invention aims to: the scientific research data management method and system based on federal migration learning are provided, and the safety of data is ensured while the model effect is improved.
The technical scheme is as follows: a scientific research data management method based on federal transfer learning comprises the following steps:
step 1: the data acquisition module acquires data of each department of scientific research management service and transmits the data to the data preprocessing module;
step 2: the data preprocessing module cleans, corrects and removes abnormal values of the data of each department of the scientific research management service acquired by the data acquisition module, and transmits the processed data to the data characteristic extraction module;
step 3: the data feature extraction module performs hierarchical sampling according to key attributes of the database table, extracts length distribution features and character distribution features of character type data character strings, extracts word vectors of the character strings by using a natural language processing method, and performs named entity recognition to form data feature values of each department of scientific research management service;
step 4: constructing a data classification directory system according to the domain, the module and the activity three-level directory; classifying classified data according to influences caused by the destroyed safety attributes, and constructing a data sensitivity classification catalog system according to preset content sensitivity degrees of different data sets;
step 5: based on a federation transfer learning distributed architecture, a residual convolutional neural network is adopted, a data characteristic value of a scientific research management service system is taken as input, a data category in a data classification catalog is taken as a data classification model to be output, a data sensitivity level in the data sensitivity classification catalog is taken as a data classification model to be output, a data classification model and a data classification model are trained, and a differential privacy algorithm for adaptively distributing differential privacy budget is adopted to noise the model in the training process, so that the data privacy can be protected when the model is used for prediction;
step 6: inputting the data to be tested into a data classification model and a data classification model to obtain corresponding classification and classification results, and according to the classification and classification results of different data, adopting different privacy presets to noise the trained data set and then publishing the data set to obtain the desensitized and demoistened data set.
Further, the step 3 includes:
step 3.1: extracting character pattern distribution characteristics of the character strings, and matching whether the character strings conform to the regular expression or not by using a preset regular expression;
step 3.2: after the character string is segmented by using a natural language processing technology, a word vector is extracted by using an One-hot, TFIDF, word2Vec technology, and a text feature vector of the field is constructed.
Further, the step 5 includes:
step 5.1: data normalization
Dividing data in a data set into a training set and a testing set, and carrying out normalization processing on the data in the training set and the testing set by a maximum and minimum method, wherein a calculation formula is as follows:
Figure BDA0003987540150000021
wherein data X k Difference from the minimum value Xmin of the column, divided by the difference Xmax-Xmin, all data are converted to [0,1 ]]To cancel order of magnitude differences between the dimensions;
step 5.2: construction of neural network model
The neuron excitation function of the BP neural network selects a linear rectification function ReLU, the loss function selects a distance loss function MSE, and the predicted value y is calculated i And true value
Figure BDA0003987540150000022
The sum of squares of the distances between them is given by:
Figure BDA0003987540150000023
the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes; the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of sensitivity grades;
step 5.3: inputting the training set obtained in the step 5.1 into the neural network model constructed in the step 5.2 for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:
Figure BDA0003987540150000024
wherein θ t A parameter set of the neural network at the t-th iteration; i r The formula is that the network learning rate is as follows
Figure BDA0003987540150000025
J(θ t ) After the iterative training is finished, inputting the data of the test set into a trained neural network, and judging whether the classification is correct or not according to whether the output prediction result is consistent with the actual or not;
step 5.4: the differential privacy algorithm for adaptively distributing the privacy budget is as follows: after normalizing the data in step 5.1, setting privacy budgets with different intensities for training models of each scientific research management business department according to the size of the data set, wherein the data set with small data volume has large global sensitivity, and setting larger privacy budgets so as to reduce noise level; the global sensitivity of a data set with a large data volume is small and a smaller privacy budget is set. The corresponding relation between the data set size sigma and the privacy budget epsilon is as follows: epsilon=2
Further, the step 6 includes:
step 6.1: initializing parameters of training model loss functions, noise scales and learning rates of various departments, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h j Batch size B j Noise scale sigma j Learning rate gamma j
Step 6.2: the initialization model trained by the jth department model is W j ,W j =h j Training with j-th department data set I with size of B j The formula for updating the gradient by the random gradient descent method is as follows:
Figure BDA0003987540150000031
wherein, gamma k In order for the rate of learning to be high,
Figure BDA0003987540150000032
training parameters f of the jth department in the h training j A loss function for the j-th department;
step 6.3: after training, the gradient is normalized:
Figure BDA0003987540150000033
wherein W is k For each department model to participate in training initial weight parameters,
Figure BDA0003987540150000034
training parameters of the jth department model in the kth training are used;
step 6.4: and (3) adding noise to the gradient, wherein the formula is as follows:
Figure BDA0003987540150000035
wherein,,
Figure BDA0003987540150000036
for the training parameters of the jth client during the kth training, N (0, r #) 2 * I) For normal distributed noise, r is the number of department data sets participating in training, sigma is privacy budget, and I is the data set of the departments;
step 6.5: after the training of the round is completed by each department model, the neural network parameters are uploaded to a central server, the central server aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to each department participating in the training of the round;
step 6.6: classifying and grading the data set by adopting a trained neural network model: the data with different sensitivity levels are matched with corresponding privacy budgets, and are released after being noisy. The privacy protection budget corresponding to each of the ultra-high sensitivity attribute group, the high sensitivity group, the medium sensitivity attribute group and the low sensitivity attribute group is 0.7,0.5,0.3,0.1.
A scientific research data management system based on federal migration learning comprises a data acquisition module, a data preprocessing module, a data characteristic extraction module, a data classification and classification module and a data desensitization and decryption module;
the data acquisition module is used for: acquiring data of each business system of scientific research management of project management, scientific and technological achievement management, human resource management and financial management, and transmitting the acquired multi-source heterogeneous data of structured data, semi-structured data and unstructured data into a data preprocessing module;
the data preprocessing module is used for: converting various service system data acquired by the data acquisition module into a required format, correcting or eliminating abnormal data, and transmitting the processed data to the data characteristic extraction module;
the data characteristic extraction module is used for: extracting length distribution characteristics and character distribution characteristics of character data strings; extracting word vectors of character strings by using a natural language processing method, and identifying named entities to form data characteristic values of departments of scientific research management service
The data classification and grading module is used for: based on a federal transfer learning architecture, constructing a data classification hierarchical model, and constructing a data classification directory system according to domain, module and activity tertiary directories; classifying classified data according to influences caused by the destroyed safety attributes, and constructing a data sensitivity classification catalog system according to preset content sensitivity degrees of different data sets; the method comprises the steps of carrying out a first treatment on the surface of the In order to ensure the data privacy of each department, based on a federal transfer learning distributed architecture, a residual convolutional neural network is adopted, the data characteristic value of a scientific research management service system is taken as input, the data category in a data classification catalog is taken as a data classification model to be output, the data sensitivity level in the data sensitivity classification catalog is taken as a data classification model to be output, the data classification model and the data classification model are trained, and a differential privacy algorithm for adaptively distributing differential privacy budget is adopted to noise the model in the training process, so that the data privacy can be protected when the model is used for prediction;
the data desensitization decryption module: inputting the data to be tested into a data classification model and a data classification model to obtain corresponding classification and classification results, and according to the classification and classification results of different data, adopting different privacy presets to noise the trained data set and then publishing the data set to obtain the desensitized and demoistened data set.
Further, the data classification and grading module comprises:
a. data normalization
Dividing data in a data set into a training set and a testing set, and carrying out normalization processing on the data in the training set and the testing set by a maximum and minimum method, wherein a calculation formula is as follows:
Figure BDA0003987540150000041
wherein data X k Difference from the minimum value Xmin of the column, divided by the difference Xmax-Xmin, all data are converted to [0,1 ]]To cancel order of magnitude differences between the dimensions;
b. construction of neural network model
The neuron excitation function of the BP neural network selects a linear rectification function ReLU, the loss function selects a distance loss function MSE, and the predicted value y is calculated i And true value
Figure BDA0003987540150000042
The sum of squares of the distances between them is given by:
Figure BDA0003987540150000043
the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes; the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of sensitivity grades; similarly, a data classification model based on a BP neural network is constructed, the number of nodes of an input layer of the network is data dimension for data classification, the number of nodes of an output layer is sensitivity level, and the nodes are divided into four levels of an ultra-high sensitivity level, a medium sensitivity level and a low sensitivity level.
c. Inputting the training set obtained in the step a into the neural network model constructed in the step b for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:
Figure BDA0003987540150000051
wherein θ t A parameter set of the neural network at the t-th iteration; i r The formula is that the network learning rate is as follows
Figure BDA0003987540150000052
J(θ t ) After the iterative training is finished, inputting the data of the test set into a trained neural network, and judging whether the classification is correct or not according to whether the output prediction result is consistent with the actual or not;
d. after normalizing the data set of each department, setting privacy budgets with different intensities for training models of each scientific research management service department according to the data set, wherein the data set with small data quantity has large global sensitivity, and setting larger privacy budgets so as to reduce noise level; the global sensitivity of a data set with a large data volume is small and a smaller privacy budget is set.
Further, the data desensitization decryption module: comprising the following steps:
(1) Initializing parameters of training model loss functions, noise scales and learning rates of various departments, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h j Batch size B j Noise scale sigma j Learning rate gamma j
(2) The initialization model trained by the jth department model is W j ,W j =h j By the j-thTraining department data set I with data set size of B j The formula for updating the gradient by the random gradient descent method is as follows:
Figure BDA0003987540150000053
wherein, gamma k In order for the rate of learning to be high,
Figure BDA0003987540150000054
training parameters f of the jth department in the h training j A loss function for the j-th department;
(3) After training, the gradient is normalized:
Figure BDA0003987540150000055
wherein W is k For each department model to participate in training initial weight parameters,
Figure BDA0003987540150000056
training parameters of the jth department model in the h training;
(4) And (3) adding noise to the gradient, wherein the formula is as follows:
Figure BDA0003987540150000061
wherein,,
Figure BDA0003987540150000062
for the training parameters of the jth client during the kth training, N (0, r #) 2 * I) For normal distributed noise, r is the number of department data sets participating in training, sigma is privacy budget, and I is the data set of the departments;
(5) After the training of the round is completed by each department model, the neural network parameters are uploaded to a central server, the central server aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to each department participating in the training of the round;
(6) Classifying and grading the data set by adopting a trained neural network model: the data with different sensitivity levels are matched with corresponding privacy budgets, and are released after being noisy.
The beneficial effects are that:
1. the intelligent classification and classification of the scientific research management data are realized by adopting the neural network, the problem that the classification and classification of the scientific research management sensitive data depend on manpower at present is huge in consumption is solved, the data privacy is effectively protected, and the working efficiency is improved.
2. Based on the federal transfer learning architecture, a differential privacy calculation technology based on a random gradient descent method is adopted in the process of training a data classification hierarchical model, so that the privacy of training data is protected, and meanwhile, the risk of parameter leakage of a network model is reduced.
3. The privacy budget is distributed according to the size of the data set of each department, so that the influence of noise on the model is reduced, and the performance of the model is further improved while effective differential privacy assurance is provided.
4. Different privacy budgets are allocated for data sets with different sensitivity levels, so that privacy protection degrees with different intensities are provided, and the problems of low data availability and insufficient protection of sensitive attributes caused by average allocation of the privacy budgets are well solved.
Drawings
FIG. 1 is a schematic block diagram of a system according to the present invention
FIG. 2 is a flowchart of a scientific research data management algorithm based on federal transfer learning
FIG. 3 is a schematic diagram of a data classification hierarchical BP neural network topology
FIG. 4 is a federal transfer learning fusion architecture for various department models
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
The invention mainly comprises
1. Based on scientific research management business systems such as scientific research project management, scientific and technological achievement management, human resource management, financial management and the like, constructing a data classification catalog system according to domain, module and activity tertiary catalogs; and constructing a data hierarchical directory system according to the preset content sensitivity degrees of different data resources.
2. Based on the federal transfer learning architecture, a neural network model for classifying data among departments is constructed, the combined training of the model is realized on the premise that important privacy data of each department cannot be obtained locally by each department, and the safety of the data is ensured while the model effect is improved.
3. In order to fully ensure the data privacy of each department, the neural network model is noisy by adopting a self-adaptive gradient descent method based on differential privacy calculation during model training, and privacy budgets are distributed according to the size of each department data set, so that the data privacy can be effectively protected when the model is used for carrying out data classification hierarchical prediction.
4. And extracting the data characteristic vector of each data set by using methods such as natural language processing and the like. And after normalizing the feature vectors, respectively inputting the feature vectors into a trained data classification neural network and a trained data classification neural network to realize intelligent classification and classification of business data of each department.
5. And adopting a differential privacy encryption technology, distributing privacy budgets with different intensities for data with different levels according to the grading condition of data resources of each department, realizing privacy protection of the original data, and generating a noisy public data set.
As shown in fig. 1, the scientific research data management method and system based on federal migration learning provided by the invention comprise the following steps:
step 1: and acquiring data of each business system for scientific research management such as project management, human resources, finance and the like, wherein the data comprises structured data, semi-structured data and unstructured data.
Step 2: and converting various service system data acquired by the data acquisition module into a required format, and cleaning, correcting and removing abnormal values.
Step 3: and carrying out hierarchical sampling according to the key attributes of the database table, and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like.
Step 3.1: extracting character pattern distribution characteristics of the character strings, and matching whether the character strings conform to the regular expressions or not by using preset regular expressions such as mailboxes, mobile phone numbers, identity card numbers and the like.
Step 3.2: after the character string is segmented by using a natural language processing technology, word vectors are extracted by using technologies such as One-hot, TFIDF, word Vec and the like, and text feature vectors of the field are constructed.
Step 4: and constructing a data classification directory system according to the domain, the module and the active three-level directory.
Table 1 scientific research data classification identification table
Figure BDA0003987540150000071
Figure BDA0003987540150000081
Step 5: on the basis of the data classification, classifying the classified data according to the influence caused by the destroyed safety attribute to form a unified classification standard.
Table 2 scientific research data grading identification table
Figure BDA0003987540150000082
Step 6: in order to fully ensure the data privacy of each department, a federal transfer learning distributed architecture is adopted, a differential privacy encryption technology is adopted based on a residual convolutional neural network, the data characteristic values under each service scene of scientific research management are taken as input, the data category and the data sensitivity level are taken as output, and a data classification model are trained. Noise is added in the training process, and a deep learning model with differential privacy protection is obtained, so that data privacy can be protected when the model is used for prediction.
Step 6.1: training data classification model
Step 6.1.1: data normalization
Dividing the data in the data set into a training set and a test set by a maximum-minimum method x k =(x k -x min )/(x max -x min ) Normalizing the data in the training set and the test set to convert all the data into [0,1 ]]To cancel order of magnitude differences between the dimensions.
Step 6.1.2: construction of neural network model
The number of nodes of the input layer of the network is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes. The neuron excitation function of the BP neural network selects a linear rectification function ReLU, namely phi (x) =max (0, x), and the loss function selects a distance loss function MSE, namely L (y, v) (m) )=‖y-v (m)2
Similarly, a data classification model based on a BP neural network is constructed. The number of nodes of the input layer of the network is the data dimension for data classification, and the number of nodes of the output layer is the data class number.
Step 6.1.3: inputting the training set obtained in the step 6.1.1 into the neural network model constructed in the step 6.1.2 for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:
Figure BDA0003987540150000091
wherein θ t A parameter set of the neural network at the t-th iteration; i r The formula is that the network learning rate is as follows
Figure BDA0003987540150000092
J(θ t ) As a loss function.
After the iterative training is finished, the data of the test set is input into a trained neural network, and whether the classification is correct or not is judged according to whether the output prediction result is consistent with the actual result or not.
Step 6.1.4: the data sets in different business fields of departments such as project management, human resources, finance, scientific quality and the like are different in size. For unequal amounts of data, the noise added by each department training model is different if the same degree of privacy protection is maintained, i.e., the privacy budget ε for each portion is maintained equal. The noise added by the part with small data volume is larger than the noise added by the department with large data volume, and the training of the federal learning model is inevitably negatively influenced.
Therefore, after normalizing the sizes of the data sets of the departments, privacy budgets with different intensities are set for the training model of each department according to the sizes of the data sets. The global sensitivity of a data set with small data volume is high, and the larger privacy budget is set, so that the noise level is reduced; the global sensitivity of a data set with a large data volume is small and a relatively small privacy budget can be set. The correspondence of data set size to privacy budget is shown in table 3.
Table 3 data set size and privacy budget correspondence table
Figure BDA0003987540150000093
Step 6.1.5: based on the federal transfer learning distributed architecture, a differential privacy algorithm is adopted to noise model parameters in the model training process.
(1) Initializing parameters such as a model loss function, a noise scale, a learning rate and the like of training of each department, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h j Batch size B j Noise scale sigma j Learning rate gamma j
(2) The initialization model trained by the jth department model is W j ,W j =h j Training with j-th department data set I with size of B j The formula for updating the gradient by the random gradient descent method is as follows:
Figure BDA0003987540150000094
wherein, gamma k In order for the rate of learning to be high,
Figure BDA0003987540150000095
training parameters f of the jth department in the h training j A loss function for the j-th department;
(3) After training, the gradient is normalized:
Figure BDA0003987540150000101
wherein W is k For each department model to participate in training initial weight parameters,
Figure BDA0003987540150000102
training parameters of the jth department model in the h training are obtained.
(4) And (3) adding noise to the gradient, wherein the formula is as follows:
Figure BDA0003987540150000103
wherein,,
Figure BDA0003987540150000104
the training parameters of the jth client side in the h training are r, the number of the data sets of departments participating in the training, sigma, the privacy budget and I, wherein the data sets of the departments are obtained.
(5) After the models of all departments complete the training of the round, the neural network parameters are uploaded to a central server. The central server side aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to all departments participating in the training of the round.
Step 6.2: training a data grading model:
similar to step 6.1, a federal transfer learning distributed architecture is adopted, a differential privacy encryption technology is adopted based on a residual convolutional neural network, data characteristic values under each service scene of scientific research management are used as input, and a data sensitivity level is used as an output training data grading model. Noise is added in the training process, and a deep learning model with differential privacy protection is obtained, so that data privacy can be protected when the model is used for prediction.
The sensitivity of the data sets of different levels is different, and the intensity of the data sets to be protected is also different, so that the data of different levels needs to be matched with corresponding privacy budgets. The privacy protection budget corresponding to each of the ultra-high sensitivity attribute group, the high sensitivity group, the medium sensitivity attribute group and the low sensitivity attribute group is 0.7,0.5,0.3,0.1.
For example: assuming that the financial statement belongs to an ultra-high sensitive group, the strength required to protect the financial statement is higher, the noise required to be added is larger, and the allocated smaller privacy budget is 0.1; the department organization architecture belongs to a low-sensitivity data group, the noise to be added is small, and a large privacy budget of 0.7 is allocated for the department organization architecture.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (7)

1. The scientific research data management method based on federal migration learning is characterized by comprising the following steps of:
step 1: the data acquisition module acquires data of each department of scientific research management service and transmits the data to the data preprocessing module;
step 2: the data preprocessing module cleans, corrects and removes abnormal values of the data of each department of the scientific research management service acquired by the data acquisition module, and transmits the processed data to the data characteristic extraction module;
step 3: the data feature extraction module performs hierarchical sampling according to key attributes of the database table, extracts length distribution features and character distribution features of character type data character strings, extracts word vectors of the character strings by using a natural language processing method, and performs named entity recognition to form data feature values of each department of scientific research management service;
step 4: constructing a data classification directory system according to the domain, the module and the activity three-level directory; classifying classified data according to influences caused by the destroyed safety attributes, and constructing a data sensitivity classification catalog system according to preset content sensitivity degrees of different data sets;
step 5: based on a federation transfer learning distributed architecture, a residual convolutional neural network is adopted, a data characteristic value of a scientific research management service system is taken as input, a data category in a data classification catalog is taken as a data classification model to be output, a data sensitivity level in the data sensitivity classification catalog is taken as a data classification model to be output, a data classification model and a data classification model are trained, and a differential privacy algorithm for adaptively distributing differential privacy budget is adopted to noise the model in the training process, so that the data privacy can be protected when the model is used for prediction;
step 6: inputting the data to be tested into a data classification model and a data classification model to obtain corresponding classification and classification results, and according to the classification and classification results of different data, adopting different privacy presets to noise the trained data set and then publishing the data set to obtain the desensitized and demoistened data set.
2. The scientific research data management method based on federal transfer learning according to claim 1, wherein the step 3 includes:
step 3.1: extracting character pattern distribution characteristics of the character strings, and matching whether the character strings conform to the regular expression or not by using a preset regular expression;
step 3.2: after the character string is segmented by using a natural language processing technology, a word vector is extracted by using an One-hot, TFIDF, word2Vec technology, and a text feature vector of the field is constructed.
3. The method for managing scientific research data based on federal transition learning according to claim 1, wherein the step 5 comprises:
step 5.1: data normalization
Dividing data in a data set into a training set and a testing set, and carrying out normalization processing on the data in the training set and the testing set by a maximum and minimum method, wherein a calculation formula is as follows:
Figure QLYQS_1
wherein data X k Difference from the minimum value Xmin of the column, divided by the difference Xmax-Xmin, all data are converted to [0,1 ]]To cancel order of magnitude differences between the dimensions;
step 5.2: construction of neural network model
The neuron excitation function of the BP neural network selects a linear rectification function ReLU, the loss function selects a distance loss function MSE, and the predicted value y is calculated i And true value
Figure QLYQS_2
The sum of squares of the distances between them is given by:
Figure QLYQS_3
the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes; the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of sensitivity grades;
step 5.3: inputting the training set obtained in the step 5.1 into the neural network model constructed in the step 5.2 for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:
Figure QLYQS_4
wherein θ t For the parameter set of the neural network at the t-th iteration;I r The formula is that the network learning rate is as follows
Figure QLYQS_5
J(θ t ) After the iterative training is finished, inputting the data of the test set into a trained neural network, and judging whether the classification is correct or not according to whether the output prediction result is consistent with the actual or not;
step 5.4: the differential privacy algorithm for adaptively distributing the privacy budget is as follows: after normalizing the data in step 5.1, setting privacy budgets with different intensities for training models of each scientific research management business department according to the size of the data set, wherein the data set with small data volume has large global sensitivity, and setting larger privacy budgets so as to reduce noise level; the global sensitivity of a data set with a large data volume is small and a smaller privacy budget is set.
4. The method for managing scientific research data based on federal transition learning according to claim 1, wherein the step 6 comprises:
step 6.1: initializing parameters of training model loss functions, noise scales and learning rates of various departments, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h j Batch size B j Noise scale sigma j Learning rate gamma j
Step 6.2: the initialization model trained by the jth department model is W j ,W j =h j Training with j-th department data set I with size of B j The formula for updating the gradient by the random gradient descent method is as follows:
Figure QLYQS_6
wherein, gamma k In order for the rate of learning to be high,
Figure QLYQS_7
training parameters f of the jth department in the h training j A loss function for the j-th department;
step 6.3: after training, the gradient is normalized:
Figure QLYQS_8
wherein W is k For each department model to participate in training initial weight parameters,
Figure QLYQS_9
training parameters of the jth department model in the kth training are used;
step 6.4: and (3) adding noise to the gradient, wherein the formula is as follows:
Figure QLYQS_10
wherein,,
Figure QLYQS_11
for the training parameters of the jth client during the kth training, N (0, r #) 2 * I) For normal distributed noise, r is the number of department data sets participating in training, sigma is privacy budget, and I is the data set of the departments;
step 6.5: after the training of the round is completed by each department model, the neural network parameters are uploaded to a central server, the central server aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to each department participating in the training of the round;
step 6.6: classifying and grading the data set by adopting a trained neural network model: the data with different sensitivity levels are matched with corresponding privacy budgets, and are released after being noisy.
5. The scientific research data management system based on federal migration learning is characterized by comprising a data acquisition module, a data preprocessing module, a data characteristic extraction module, a data classification and classification module and a data desensitization and decryption module;
the data acquisition module is used for: acquiring data of each business system of scientific research management of project management, scientific and technological achievement management, human resource management and financial management, and transmitting the acquired multi-source heterogeneous data of structured data, semi-structured data and unstructured data into a data preprocessing module;
the data preprocessing module is used for: converting various service system data acquired by the data acquisition module into a required format, correcting or eliminating abnormal data, and transmitting the processed data to the data characteristic extraction module;
the data characteristic extraction module is used for: extracting length distribution characteristics and character distribution characteristics of character data strings; extracting word vectors of character strings by using a natural language processing method, and identifying named entities to form data characteristic values of departments of scientific research management service
The data classification and grading module is used for: based on a federal transfer learning architecture, constructing a data classification hierarchical model, and constructing a data classification directory system according to domain, module and activity tertiary directories; classifying classified data according to influences caused by the destroyed safety attributes, and constructing a data sensitivity classification catalog system according to preset content sensitivity degrees of different data sets; the method comprises the steps of carrying out a first treatment on the surface of the In order to ensure the data privacy of each department, based on a federal transfer learning distributed architecture, a residual convolutional neural network is adopted, the data characteristic value of a scientific research management service system is taken as input, the data category in a data classification catalog is taken as a data classification model to be output, the data sensitivity level in the data sensitivity classification catalog is taken as a data classification model to be output, the data classification model and the data classification model are trained, and a differential privacy algorithm for adaptively distributing differential privacy budget is adopted to noise the model in the training process, so that the data privacy can be protected when the model is used for prediction;
the data desensitization decryption module: inputting the data to be tested into a data classification model and a data classification model to obtain corresponding classification and classification results, and according to the classification and classification results of different data, adopting different privacy presets to noise the trained data set and then publishing the data set to obtain the desensitized and demoistened data set.
6. The research data management system based on federal transition learning of claim 5, wherein the data classification module comprises:
a. data normalization
Dividing data in a data set into a training set and a testing set, and carrying out normalization processing on the data in the training set and the testing set by a maximum and minimum method, wherein a calculation formula is as follows:
Figure QLYQS_12
wherein data X k Difference from the minimum value Xmin of the column, divided by the difference Xmax-Xmin, all data are converted to [0,1 ]]To cancel order of magnitude differences between the dimensions;
b. construction of neural network model
The neuron excitation function of the BP neural network selects a linear rectification function ReLU, the loss function selects a distance loss function MSE, and the predicted value y is calculated i And true value
Figure QLYQS_13
The sum of squares of the distances between them is given by:
Figure QLYQS_14
the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes; the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of sensitivity grades;
c. inputting the training set obtained in the step a into the neural network model constructed in the step b for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:
Figure QLYQS_15
wherein θ t A parameter set of the neural network at the t-th iteration; i r The formula is that the network learning rate is as follows
Figure QLYQS_16
J(θ t ) After the iterative training is finished, inputting the data of the test set into a trained neural network, and judging whether the classification is correct or not according to whether the output prediction result is consistent with the actual or not;
d. after normalizing the data set of each department, setting privacy budgets with different intensities for training models of each scientific research management service department according to the data set, wherein the data set with small data quantity has large global sensitivity, and setting larger privacy budgets so as to reduce noise level; the global sensitivity of a data set with a large data volume is small and a smaller privacy budget is set.
7. The research data management system based on federal transfer learning of claim 5, wherein the data desensitization and decryption module: comprising the following steps:
(1) Initializing parameters of training model loss functions, noise scales and learning rates of various departments, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h j Batch size B j Noise scale sigma j Learning rate gamma j
(2) The initialization model trained by the jth department model is W j ,W j =h j Training with j-th department data set I with size of B j The formula for updating the gradient by the random gradient descent method is as follows:
Figure QLYQS_17
wherein, gamma k In order for the rate of learning to be high,
Figure QLYQS_18
training parameters f of the jth department in the h training j A loss function for the j-th department;
(3) After training, the gradient is normalized:
Figure QLYQS_19
wherein W is k For each department model to participate in training initial weight parameters,
Figure QLYQS_20
training parameters of the jth department model in the h training;
(4) And (3) adding noise to the gradient, wherein the formula is as follows:
Figure QLYQS_21
wherein,,
Figure QLYQS_22
for the training parameters of the jth client during the kth training, N (0, r #) 2 * I) For normal distributed noise, r is the number of department data sets participating in training, sigma is privacy budget, and I is the data set of the departments;
(5) After the training of the round is completed by each department model, the neural network parameters are uploaded to a central server, the central server aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to each department participating in the training of the round;
(6) Classifying and grading the data set by adopting a trained neural network model: the data with different sensitivity levels are matched with corresponding privacy budgets, and are released after being noisy.
CN202211579085.4A 2022-12-08 2022-12-08 Scientific research data management method and system based on federal migration learning Pending CN116305233A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211579085.4A CN116305233A (en) 2022-12-08 2022-12-08 Scientific research data management method and system based on federal migration learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211579085.4A CN116305233A (en) 2022-12-08 2022-12-08 Scientific research data management method and system based on federal migration learning

Publications (1)

Publication Number Publication Date
CN116305233A true CN116305233A (en) 2023-06-23

Family

ID=86782213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211579085.4A Pending CN116305233A (en) 2022-12-08 2022-12-08 Scientific research data management method and system based on federal migration learning

Country Status (1)

Country Link
CN (1) CN116305233A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117056979A (en) * 2023-10-11 2023-11-14 杭州金智塔科技有限公司 Service processing model updating method and device based on user privacy data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117056979A (en) * 2023-10-11 2023-11-14 杭州金智塔科技有限公司 Service processing model updating method and device based on user privacy data
CN117056979B (en) * 2023-10-11 2024-03-29 杭州金智塔科技有限公司 Service processing model updating method and device based on user privacy data

Similar Documents

Publication Publication Date Title
CN109783639B (en) Mediated case intelligent dispatching method and system based on feature extraction
CN103177114B (en) Based on the shift learning sorting technique across data field differentiating stream shape
CN110704694B (en) Organization hierarchy dividing method based on network representation learning and application thereof
CN109190698B (en) Classification and identification system and method for network digital virtual assets
CN116305233A (en) Scientific research data management method and system based on federal migration learning
CN112085086A (en) Multi-source transfer learning method based on graph convolution neural network
CN113869384B (en) Privacy protection image classification method based on field self-adaption
CN111985680A (en) Criminal multi-criminal name prediction method based on capsule network and time sequence
Zhang et al. A review on the construction of business intelligence system based on unstructured image data
Niu et al. Using image feature extraction to identification of ancient ceramics based on partial differential equation
CN113139603A (en) Federal learning method based on EMD distance fusion multi-source heterogeneous data
CN112464289A (en) Method for cleaning private data
CN106529601A (en) Image classification prediction method based on multi-task learning in sparse subspace
CN116051924A (en) Divide-and-conquer defense method for image countermeasure sample
CN115034762A (en) Post recommendation method and device, storage medium, electronic equipment and product
CN110880047A (en) Enterprise project service closed-loop information processing modeling method
Lv Cloud Computation-Based Clustering Method for Nonlinear Complex Attribute Big Data
CN109271593A (en) Military project group personal information labeling process based on cluster
Kanikar et al. Extracting actionable association rules from multiple datasets
CN114372559A (en) Construction method of deep adaptive network based on robust soft label
Zhang et al. [Retracted] Application of Artificial Neural Network Algorithm in Facial Biological Image Information Scanning and Recognition
Rahman Reframing in Clustering: An Introductory Survey
Ge et al. Research on Seafood Traceable Data Based on k-Modes Clustering Algorithm
CN117935936A (en) Single cell RNA-seq data clustering method, device, equipment and medium
Ghore Data Mining used of Neural Networks Approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination