CN116305233A

CN116305233A - Scientific research data management method and system based on federal migration learning

Info

Publication number: CN116305233A
Application number: CN202211579085.4A
Authority: CN
Inventors: 徐舒; 徐艺; 郭旭周; 顾勇; 许小伟; 张跃; 刘思娴; 朱鹏; 孙昊
Original assignee: Nanjing Panda Electronics Co Ltd; Nanjing Panda Information Industry Co Ltd
Current assignee: Nanjing Panda Electronics Co Ltd; Nanjing Panda Information Industry Co Ltd
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-06-23

Abstract

The invention discloses a scientific research data management method and system based on federal migration learning, comprising the steps that a data acquisition module acquires data of each service system; the data of each service system collected by the data preprocessing module is converted into a required format, and the data is cleaned, corrected and abnormal values are removed; the data feature extraction module performs hierarchical sampling according to the key attributes of the database table to form feature vectors of character type data; constructing a data classification grading model based on a federal transfer learning architecture; noise is added in the training process, and a deep learning model with differential privacy protection is obtained, so that data privacy can be protected when the model is used for prediction. The invention can improve the model effect and ensure the safety of data.

Description

Scientific research data management method and system based on federal migration learning

Technical Field

The invention relates to a scientific research data management method and system based on federal transfer learning, and belongs to the technical field of data management.

Background

The scientific research management work is a core means for promoting the scientific development of enterprises and the allocation of scientific research resources, the scientific research management data types of the enterprises are multiple, the data volume is large, and the enterprise covers various business fields such as project management, production and research integration, human resources, result conversion, finance and the like, and a large amount of structured, semi-structured and unstructured professional data is generated. The current enterprise scientific research management has complex challenges of asymmetric information, difficult data interconnection and intercommunication, incomplete data, large field span and the like of each service system, and seriously hinders the integrated sharing and innovative application of scientific research data. The scientific research management of enterprises needs to break through the barriers among systems, promote data management, data sharing and data precipitation, provide support for decision making and accelerate the digital transformation of enterprises.

Disclosure of Invention

The invention aims to: the scientific research data management method and system based on federal migration learning are provided, and the safety of data is ensured while the model effect is improved.

The technical scheme is as follows: a scientific research data management method based on federal transfer learning comprises the following steps:

step 1: the data acquisition module acquires data of each department of scientific research management service and transmits the data to the data preprocessing module;

step 2: the data preprocessing module cleans, corrects and removes abnormal values of the data of each department of the scientific research management service acquired by the data acquisition module, and transmits the processed data to the data characteristic extraction module;

step 3: the data feature extraction module performs hierarchical sampling according to key attributes of the database table, extracts length distribution features and character distribution features of character type data character strings, extracts word vectors of the character strings by using a natural language processing method, and performs named entity recognition to form data feature values of each department of scientific research management service;

step 4: constructing a data classification directory system according to the domain, the module and the activity three-level directory; classifying classified data according to influences caused by the destroyed safety attributes, and constructing a data sensitivity classification catalog system according to preset content sensitivity degrees of different data sets;

step 5: based on a federation transfer learning distributed architecture, a residual convolutional neural network is adopted, a data characteristic value of a scientific research management service system is taken as input, a data category in a data classification catalog is taken as a data classification model to be output, a data sensitivity level in the data sensitivity classification catalog is taken as a data classification model to be output, a data classification model and a data classification model are trained, and a differential privacy algorithm for adaptively distributing differential privacy budget is adopted to noise the model in the training process, so that the data privacy can be protected when the model is used for prediction;

step 6: inputting the data to be tested into a data classification model and a data classification model to obtain corresponding classification and classification results, and according to the classification and classification results of different data, adopting different privacy presets to noise the trained data set and then publishing the data set to obtain the desensitized and demoistened data set.

Further, the step 3 includes:

step 3.1: extracting character pattern distribution characteristics of the character strings, and matching whether the character strings conform to the regular expression or not by using a preset regular expression;

step 3.2: after the character string is segmented by using a natural language processing technology, a word vector is extracted by using an One-hot, TFIDF, word2Vec technology, and a text feature vector of the field is constructed.

Further, the step 5 includes:

step 5.1: data normalization

Dividing data in a data set into a training set and a testing set, and carrying out normalization processing on the data in the training set and the testing set by a maximum and minimum method, wherein a calculation formula is as follows:

wherein data X _k Difference from the minimum value Xmin of the column, divided by the difference Xmax-Xmin, all data are converted to [0,1 ]]To cancel order of magnitude differences between the dimensions;

step 5.2: construction of neural network model

The neuron excitation function of the BP neural network selects a linear rectification function ReLU, the loss function selects a distance loss function MSE, and the predicted value y is calculated _i And true value

The sum of squares of the distances between them is given by:

the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes; the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of sensitivity grades;

step 5.3: inputting the training set obtained in the step 5.1 into the neural network model constructed in the step 5.2 for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:

wherein θ _t A parameter set of the neural network at the t-th iteration; i _r The formula is that the network learning rate is as follows

J(θ _t ) After the iterative training is finished, inputting the data of the test set into a trained neural network, and judging whether the classification is correct or not according to whether the output prediction result is consistent with the actual or not;

step 5.4: the differential privacy algorithm for adaptively distributing the privacy budget is as follows: after normalizing the data in step 5.1, setting privacy budgets with different intensities for training models of each scientific research management business department according to the size of the data set, wherein the data set with small data volume has large global sensitivity, and setting larger privacy budgets so as to reduce noise level; the global sensitivity of a data set with a large data volume is small and a smaller privacy budget is set. The corresponding relation between the data set size sigma and the privacy budget epsilon is as follows: epsilon=2 ^-σ 。

Further, the step 6 includes:

step 6.1: initializing parameters of training model loss functions, noise scales and learning rates of various departments, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h ^j Batch size B _j Noise scale sigma _j Learning rate gamma _j ；

Step 6.2: the initialization model trained by the jth department model is W ^j ,W ^j ＝h ^j Training with j-th department data set I with size of B _j The formula for updating the gradient by the random gradient descent method is as follows:

wherein, gamma _k In order for the rate of learning to be high,

training parameters f of the jth department in the h training _j A loss function for the j-th department;

step 6.3: after training, the gradient is normalized:

wherein W is _k For each department model to participate in training initial weight parameters,

training parameters of the jth department model in the kth training are used;

step 6.4: and (3) adding noise to the gradient, wherein the formula is as follows:

wherein,,

for the training parameters of the jth client during the kth training, N (0, r #) ² * I) For normal distributed noise, r is the number of department data sets participating in training, sigma is privacy budget, and I is the data set of the departments;

step 6.5: after the training of the round is completed by each department model, the neural network parameters are uploaded to a central server, the central server aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to each department participating in the training of the round;

step 6.6: classifying and grading the data set by adopting a trained neural network model: the data with different sensitivity levels are matched with corresponding privacy budgets, and are released after being noisy. The privacy protection budget corresponding to each of the ultra-high sensitivity attribute group, the high sensitivity group, the medium sensitivity attribute group and the low sensitivity attribute group is 0.7,0.5,0.3,0.1.

A scientific research data management system based on federal migration learning comprises a data acquisition module, a data preprocessing module, a data characteristic extraction module, a data classification and classification module and a data desensitization and decryption module;

the data acquisition module is used for: acquiring data of each business system of scientific research management of project management, scientific and technological achievement management, human resource management and financial management, and transmitting the acquired multi-source heterogeneous data of structured data, semi-structured data and unstructured data into a data preprocessing module;

the data preprocessing module is used for: converting various service system data acquired by the data acquisition module into a required format, correcting or eliminating abnormal data, and transmitting the processed data to the data characteristic extraction module;

the data characteristic extraction module is used for: extracting length distribution characteristics and character distribution characteristics of character data strings; extracting word vectors of character strings by using a natural language processing method, and identifying named entities to form data characteristic values of departments of scientific research management service

The data classification and grading module is used for: based on a federal transfer learning architecture, constructing a data classification hierarchical model, and constructing a data classification directory system according to domain, module and activity tertiary directories; classifying classified data according to influences caused by the destroyed safety attributes, and constructing a data sensitivity classification catalog system according to preset content sensitivity degrees of different data sets; the method comprises the steps of carrying out a first treatment on the surface of the In order to ensure the data privacy of each department, based on a federal transfer learning distributed architecture, a residual convolutional neural network is adopted, the data characteristic value of a scientific research management service system is taken as input, the data category in a data classification catalog is taken as a data classification model to be output, the data sensitivity level in the data sensitivity classification catalog is taken as a data classification model to be output, the data classification model and the data classification model are trained, and a differential privacy algorithm for adaptively distributing differential privacy budget is adopted to noise the model in the training process, so that the data privacy can be protected when the model is used for prediction;

the data desensitization decryption module: inputting the data to be tested into a data classification model and a data classification model to obtain corresponding classification and classification results, and according to the classification and classification results of different data, adopting different privacy presets to noise the trained data set and then publishing the data set to obtain the desensitized and demoistened data set.

Further, the data classification and grading module comprises:

a. data normalization

b. construction of neural network model

The sum of squares of the distances between them is given by:

the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes; the number of nodes of the input layer of the data classification model is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of sensitivity grades; similarly, a data classification model based on a BP neural network is constructed, the number of nodes of an input layer of the network is data dimension for data classification, the number of nodes of an output layer is sensitivity level, and the nodes are divided into four levels of an ultra-high sensitivity level, a medium sensitivity level and a low sensitivity level.

c. Inputting the training set obtained in the step a into the neural network model constructed in the step b for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:

d. after normalizing the data set of each department, setting privacy budgets with different intensities for training models of each scientific research management service department according to the data set, wherein the data set with small data quantity has large global sensitivity, and setting larger privacy budgets so as to reduce noise level; the global sensitivity of a data set with a large data volume is small and a smaller privacy budget is set.

Further, the data desensitization decryption module: comprising the following steps:

(1) Initializing parameters of training model loss functions, noise scales and learning rates of various departments, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h ^j Batch size B _j Noise scale sigma _j Learning rate gamma _j ；

(2) The initialization model trained by the jth department model is W ^j ,W ^j ＝h ^j By the j-thTraining department data set I with data set size of B _j The formula for updating the gradient by the random gradient descent method is as follows:

wherein, gamma _k In order for the rate of learning to be high,

(3) After training, the gradient is normalized:

training parameters of the jth department model in the h training;

(4) And (3) adding noise to the gradient, wherein the formula is as follows:

wherein,,

(5) After the training of the round is completed by each department model, the neural network parameters are uploaded to a central server, the central server aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to each department participating in the training of the round;

(6) Classifying and grading the data set by adopting a trained neural network model: the data with different sensitivity levels are matched with corresponding privacy budgets, and are released after being noisy.

The beneficial effects are that:

1. the intelligent classification and classification of the scientific research management data are realized by adopting the neural network, the problem that the classification and classification of the scientific research management sensitive data depend on manpower at present is huge in consumption is solved, the data privacy is effectively protected, and the working efficiency is improved.

2. Based on the federal transfer learning architecture, a differential privacy calculation technology based on a random gradient descent method is adopted in the process of training a data classification hierarchical model, so that the privacy of training data is protected, and meanwhile, the risk of parameter leakage of a network model is reduced.

3. The privacy budget is distributed according to the size of the data set of each department, so that the influence of noise on the model is reduced, and the performance of the model is further improved while effective differential privacy assurance is provided.

4. Different privacy budgets are allocated for data sets with different sensitivity levels, so that privacy protection degrees with different intensities are provided, and the problems of low data availability and insufficient protection of sensitive attributes caused by average allocation of the privacy budgets are well solved.

Drawings

FIG. 1 is a schematic block diagram of a system according to the present invention

FIG. 2 is a flowchart of a scientific research data management algorithm based on federal transfer learning

FIG. 3 is a schematic diagram of a data classification hierarchical BP neural network topology

FIG. 4 is a federal transfer learning fusion architecture for various department models

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

The invention mainly comprises

1. Based on scientific research management business systems such as scientific research project management, scientific and technological achievement management, human resource management, financial management and the like, constructing a data classification catalog system according to domain, module and activity tertiary catalogs; and constructing a data hierarchical directory system according to the preset content sensitivity degrees of different data resources.

2. Based on the federal transfer learning architecture, a neural network model for classifying data among departments is constructed, the combined training of the model is realized on the premise that important privacy data of each department cannot be obtained locally by each department, and the safety of the data is ensured while the model effect is improved.

3. In order to fully ensure the data privacy of each department, the neural network model is noisy by adopting a self-adaptive gradient descent method based on differential privacy calculation during model training, and privacy budgets are distributed according to the size of each department data set, so that the data privacy can be effectively protected when the model is used for carrying out data classification hierarchical prediction.

4. And extracting the data characteristic vector of each data set by using methods such as natural language processing and the like. And after normalizing the feature vectors, respectively inputting the feature vectors into a trained data classification neural network and a trained data classification neural network to realize intelligent classification and classification of business data of each department.

5. And adopting a differential privacy encryption technology, distributing privacy budgets with different intensities for data with different levels according to the grading condition of data resources of each department, realizing privacy protection of the original data, and generating a noisy public data set.

As shown in fig. 1, the scientific research data management method and system based on federal migration learning provided by the invention comprise the following steps:

step 1: and acquiring data of each business system for scientific research management such as project management, human resources, finance and the like, wherein the data comprises structured data, semi-structured data and unstructured data.

Step 2: and converting various service system data acquired by the data acquisition module into a required format, and cleaning, correcting and removing abnormal values.

Step 3: and carrying out hierarchical sampling according to the key attributes of the database table, and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like.

Step 3.1: extracting character pattern distribution characteristics of the character strings, and matching whether the character strings conform to the regular expressions or not by using preset regular expressions such as mailboxes, mobile phone numbers, identity card numbers and the like.

Step 3.2: after the character string is segmented by using a natural language processing technology, word vectors are extracted by using technologies such as One-hot, TFIDF, word Vec and the like, and text feature vectors of the field are constructed.

Step 4: and constructing a data classification directory system according to the domain, the module and the active three-level directory.

Table 1 scientific research data classification identification table

Step 5: on the basis of the data classification, classifying the classified data according to the influence caused by the destroyed safety attribute to form a unified classification standard.

Table 2 scientific research data grading identification table

Step 6: in order to fully ensure the data privacy of each department, a federal transfer learning distributed architecture is adopted, a differential privacy encryption technology is adopted based on a residual convolutional neural network, the data characteristic values under each service scene of scientific research management are taken as input, the data category and the data sensitivity level are taken as output, and a data classification model are trained. Noise is added in the training process, and a deep learning model with differential privacy protection is obtained, so that data privacy can be protected when the model is used for prediction.

Step 6.1: training data classification model

Step 6.1.1: data normalization

Dividing the data in the data set into a training set and a test set by a maximum-minimum method x _k ＝(x _k -x _min )/(x _max -x _min ) Normalizing the data in the training set and the test set to convert all the data into [0,1 ]]To cancel order of magnitude differences between the dimensions.

Step 6.1.2: construction of neural network model

The number of nodes of the input layer of the network is the number of data dimensions for data classification, and the number of nodes of the output layer is the number of data classes. The neuron excitation function of the BP neural network selects a linear rectification function ReLU, namely phi (x) =max (0, x), and the loss function selects a distance loss function MSE, namely L (y, v) ^(m) )＝‖y-v ^(m) ‖ ² 。

Similarly, a data classification model based on a BP neural network is constructed. The number of nodes of the input layer of the network is the data dimension for data classification, and the number of nodes of the output layer is the data class number.

Step 6.1.3: inputting the training set obtained in the step 6.1.1 into the neural network model constructed in the step 6.1.2 for training, and adopting a gradient descent algorithm to update the network in an iterative manner, wherein the gradient descent algorithm has the formula:

J(θ _t ) As a loss function.

After the iterative training is finished, the data of the test set is input into a trained neural network, and whether the classification is correct or not is judged according to whether the output prediction result is consistent with the actual result or not.

Step 6.1.4: the data sets in different business fields of departments such as project management, human resources, finance, scientific quality and the like are different in size. For unequal amounts of data, the noise added by each department training model is different if the same degree of privacy protection is maintained, i.e., the privacy budget ε for each portion is maintained equal. The noise added by the part with small data volume is larger than the noise added by the department with large data volume, and the training of the federal learning model is inevitably negatively influenced.

Therefore, after normalizing the sizes of the data sets of the departments, privacy budgets with different intensities are set for the training model of each department according to the sizes of the data sets. The global sensitivity of a data set with small data volume is high, and the larger privacy budget is set, so that the noise level is reduced; the global sensitivity of a data set with a large data volume is small and a relatively small privacy budget can be set. The correspondence of data set size to privacy budget is shown in table 3.

Table 3 data set size and privacy budget correspondence table

Step 6.1.5: based on the federal transfer learning distributed architecture, a differential privacy algorithm is adopted to noise model parameters in the model training process.

(1) Initializing parameters such as a model loss function, a noise scale, a learning rate and the like of training of each department, wherein the parameters comprise: loss function L, dataset S (j), auxiliary model h ^j Batch size B _j Noise scale sigma _j Learning rate gamma _j 。

(2) The initialization model trained by the jth department model is W ^j ,W ^j ＝h ^j Training with j-th department data set I with size of B _j The formula for updating the gradient by the random gradient descent method is as follows:

wherein, gamma _k In order for the rate of learning to be high,

(3) After training, the gradient is normalized:

training parameters of the jth department model in the h training are obtained.

(4) And (3) adding noise to the gradient, wherein the formula is as follows:

wherein,,

the training parameters of the jth client side in the h training are r, the number of the data sets of departments participating in the training, sigma, the privacy budget and I, wherein the data sets of the departments are obtained.

(5) After the models of all departments complete the training of the round, the neural network parameters are uploaded to a central server. The central server side aggregates the received updated parameters of all clients participating in the training of the round, updates the global parameters and sends the updated parameters to all departments participating in the training of the round.

Step 6.2: training a data grading model:

similar to step 6.1, a federal transfer learning distributed architecture is adopted, a differential privacy encryption technology is adopted based on a residual convolutional neural network, data characteristic values under each service scene of scientific research management are used as input, and a data sensitivity level is used as an output training data grading model. Noise is added in the training process, and a deep learning model with differential privacy protection is obtained, so that data privacy can be protected when the model is used for prediction.

The sensitivity of the data sets of different levels is different, and the intensity of the data sets to be protected is also different, so that the data of different levels needs to be matched with corresponding privacy budgets. The privacy protection budget corresponding to each of the ultra-high sensitivity attribute group, the high sensitivity group, the medium sensitivity attribute group and the low sensitivity attribute group is 0.7,0.5,0.3,0.1.

For example: assuming that the financial statement belongs to an ultra-high sensitive group, the strength required to protect the financial statement is higher, the noise required to be added is larger, and the allocated smaller privacy budget is 0.1; the department organization architecture belongs to a low-sensitivity data group, the noise to be added is small, and a large privacy budget of 0.7 is allocated for the department organization architecture.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The scientific research data management method based on federal migration learning is characterized by comprising the following steps of:

2. The scientific research data management method based on federal transfer learning according to claim 1, wherein the step 3 includes:

3. The method for managing scientific research data based on federal transition learning according to claim 1, wherein the step 5 comprises:

step 5.1: data normalization

step 5.2: construction of neural network model

The sum of squares of the distances between them is given by:

wherein θ _t For the parameter set of the neural network at the t-th iteration；I _r The formula is that the network learning rate is as follows

step 5.4: the differential privacy algorithm for adaptively distributing the privacy budget is as follows: after normalizing the data in step 5.1, setting privacy budgets with different intensities for training models of each scientific research management business department according to the size of the data set, wherein the data set with small data volume has large global sensitivity, and setting larger privacy budgets so as to reduce noise level; the global sensitivity of a data set with a large data volume is small and a smaller privacy budget is set.

4. The method for managing scientific research data based on federal transition learning according to claim 1, wherein the step 6 comprises:

wherein, gamma _k In order for the rate of learning to be high,

step 6.3: after training, the gradient is normalized:

training parameters of the jth department model in the kth training are used;

wherein,,

step 6.6: classifying and grading the data set by adopting a trained neural network model: the data with different sensitivity levels are matched with corresponding privacy budgets, and are released after being noisy.

5. The scientific research data management system based on federal migration learning is characterized by comprising a data acquisition module, a data preprocessing module, a data characteristic extraction module, a data classification and classification module and a data desensitization and decryption module;

6. The research data management system based on federal transition learning of claim 5, wherein the data classification module comprises:

a. data normalization

b. construction of neural network model

The sum of squares of the distances between them is given by:

7. The research data management system based on federal transfer learning of claim 5, wherein the data desensitization and decryption module: comprising the following steps:

wherein, gamma _k In order for the rate of learning to be high,

(3) After training, the gradient is normalized:

training parameters of the jth department model in the h training;

(4) And (3) adding noise to the gradient, wherein the formula is as follows:

wherein,,