CN111062056A - Private data protection modeling method, system and device based on transfer learning - Google Patents

Private data protection modeling method, system and device based on transfer learning Download PDF

Info

Publication number
CN111062056A
CN111062056A CN201911284099.1A CN201911284099A CN111062056A CN 111062056 A CN111062056 A CN 111062056A CN 201911284099 A CN201911284099 A CN 201911284099A CN 111062056 A CN111062056 A CN 111062056A
Authority
CN
China
Prior art keywords
data
model
data set
storage device
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911284099.1A
Other languages
Chinese (zh)
Other versions
CN111062056B (en
Inventor
方文静
王力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN201911284099.1A priority Critical patent/CN111062056B/en
Publication of CN111062056A publication Critical patent/CN111062056A/en
Application granted granted Critical
Publication of CN111062056B publication Critical patent/CN111062056B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2107File encryption

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computer Security & Cryptography (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification discloses a modeling method, a system and a device for private data protection based on transfer learning. The method may be performed by one or more processors, comprising: an intermediate model may be retrieved from an intermediate storage device, the intermediate model obtained based on a first data set in a first data domain and stored in the intermediate storage device, the first data set comprising text data, voice data, or image data. The intermediate model may be updated to obtain a target model based on a second data set in the second data domain. The second data field is isolated from the first data field, and the second data set comprises data of a type corresponding to the first data set. The method disclosed in the specification can protect the security of private data of each party in multi-party secure computation.

Description

Private data protection modeling method, system and device based on transfer learning
Technical Field
The present application relates to the field of secure multiparty computing, and in particular, to a modeling method, system, and apparatus for private data protection based on transfer learning.
Background
In the big data era, data can be effectively utilized through utilization of data mining and machine learning so as to provide services of various aspects such as personalized recommendation, risk control and the like. On the one hand, these can provide convenient and secure services for users, and on the other hand, the issue of privacy protection of data in use is receiving more and more attention. In the process of data modeling, whether the data volume is enough is the key of success or failure of modeling. The data combination is beneficial to acquiring more comprehensive information and providing better model effect. How to obtain the best modeling effect on the premise of protecting the security of private data of each party is a problem to be solved urgently in an actual scene. Therefore, the method can prevent privacy disclosure and realize multi-party safe calculation so as to solve the problem of data shortage of machine learning models in various fields, and can greatly reduce the data acquisition difficulty in actual scenes.
Disclosure of Invention
One of the embodiments of the present specification provides a privacy protection modeling method based on transfer learning. The migration learning based privacy preserving modeling method is executed by one or more processors located in a second data domain, and comprises the following steps: acquiring an intermediate model from an intermediate storage device, training an initial model in a first data domain based on a first data set, and storing the intermediate model in the intermediate storage device, wherein the first data set comprises text data, voice data or image data; updating the intermediate model to obtain a target model based on a second data set in the second data domain; and data isolation is performed between the second data field and the first data field, and the type of data contained in the second data set corresponds to that of the first data set.
One of the embodiments of the present specification provides a modeling system for realizing private data protection based on transfer learning, where the system includes a model acquisition module and a model update module; the model acquisition module is used for acquiring an intermediate model from intermediate storage equipment, and the intermediate model is acquired by training an initial model in a first data domain based on a first data set and is stored in the intermediate storage equipment; the model updating module is used for updating the intermediate model to obtain a target model based on a second data set in the second data domain; data isolation between the second data field and the first data field.
One of the embodiments of the present specification provides a modeling system for implementing private data protection based on transfer learning, where the system includes a first processing device located in a first data domain, a second processing device located in a second data domain, and an intermediate storage device; the first processing device is configured to train an initial model based on a first data set in the first data domain to obtain one or more intermediate models, and to transmit the intermediate models to the intermediate storage device; the second processing device is used for acquiring an intermediate model from the intermediate storage device and updating the intermediate model to acquire a target model based on a second data set in the second data domain; wherein the first data domain is data isolated from the second data domain.
One of the embodiments of the present specification provides a modeling apparatus for implementing private data protection based on migration learning, including a processor, where the processor is configured to execute a modeling method for implementing private data protection based on migration learning.
Drawings
The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:
FIG. 1 is a schematic diagram of an application scenario of a modeling system for private data protection based on migration learning implementation, according to some embodiments of the present description;
FIG. 2 is an exemplary flow diagram of a modeling method for private data protection based on migration learning, according to some embodiments of the present description;
FIG. 3 is an exemplary flow diagram of a method of model training in accordance with some embodiments of the present description;
FIG. 4 is a block diagram of a modeling apparatus for private data protection based on migration learning, according to some embodiments of the present description.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.
It should be understood that "system", "device", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.
As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.
Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.
In some embodiments, a data protection scheme for multi-party federated modeling may be in the form of data isolation. Multi-party modeling for data isolation scenarios generally uses multi-party secure computation related techniques, such as secret sharing, or garbled circuits. The scheme based on multi-party security computing has high security level, and resource waste is caused by the security protection requirement of certain light weight level. Meanwhile, the deployment cost is high, the basic operator support is provided for the part of computing resources needing to be carried out in multiple parties, and the requirements on interactive transmission bandwidth and the like are also high according to the difference of algorithms. In addition, the scheme involves time-consuming processes such as encryption, decryption and multiple transmissions of data, and the running time is often long. The modeling method provided by some alternative embodiments in this specification only requires lightweight interaction, reduces the requirements for deployment and transmission, and provides protection for the original private data.
FIG. 1 is a schematic diagram of an application scenario of a modeling system for private data protection based on migration learning implementation, according to some embodiments of the present description. As shown in fig. 1, the application scenario 100 may include a first data field 110, a second data field 120, a storage device 130, and a network 140.
The data field may refer to a device or cluster of devices belonging to a party, such as a service provider, government agency, etc. In some embodiments, a data domain may also be a device or cluster of devices that implement some particular computing functionality. In some embodiments, data between different data domains can be interacted, so that data sharing is realized. In some embodiments, data between different data fields is isolated from each other. Data isolation is understood to mean that data domains do not interact with their own data resources. The data resources owned by each data domain may be data that is of value to use, relating to personal or business privacy security, such as data that may be used for model training to obtain a machine learning model with some particular predictive function. For example only, for a commodity sales platform, data such as personal profiles and consumption records of all consumers in the platform can be regarded as data resources of a data field of the commodity sales platform. In some embodiments, the data domain may own data resources as private data and protect it for information security.
A first processing device 110-1 and a first storage device 110-2 may be present in the first data field 110. In some embodiments, the first processing device 110-1 may retrieve data and/or instructions from other components in the first data domain 110, such as the first storage device 110-2, to implement at least one function described herein. For example, the first processing device 110-1 may train the initial model with private data belonging to the first data domain 110 stored in the first storage device 110-2 to obtain an intermediate model. The initial model may be a machine learning model, e.g., a decision tree or a neural network. The obtained intermediate model can be a partially or completely trained model. As another example, the first processing device 110-1 may transmit the intermediate model directly or indirectly to the storage device 130 for storage. As indicated by the dashed arrows in the first data field 110 and the storage device 130 shown in fig. 1, the first processing device 110-1 may communicate directly with the storage device 130, transferring the intermediate model to the storage device 130 for storage. As indicated by the implementation arrows between the first data domain 110, the storage device 130, and the network 140 shown in FIG. 1, the first processing device 110-1 and the storage device 130 may be respectively connected to the network 140 for data and/or information interaction, such as the first processing device 110-1 transmitting the intermediate model to the storage device 130 via the network 140. In some embodiments, the first processing device 110-1 may encrypt the intermediate model before transmitting it to the storage device 130 via the network 140.
A second processing device 120-1 and a second storage device 120-2 may be present in the second data field 20. In some embodiments, the second processing device 120-1 may retrieve data and/or instructions from other components in the second data domain 120, such as the second storage device 120-2, to implement at least one function described herein. For example, the second processing device 120-1 may retrieve the intermediate model from the storage device 130 based on the instructions. As indicated by the dashed arrows in the second data field 120 and the storage device 130 shown in fig. 1, the second processing device 120-1 may communicate directly with the storage device 130 to obtain the intermediate model. As indicated by the implementation arrows between the second data domain 120, the storage device 130, and the network 140 shown in FIG. 1, the second processing device 120-1 and the storage device 130 may be respectively connected to the network 140 for data and/or information interaction, such as the second processing device 120-1 retrieving the intermediate model from the storage device 130 via the network 140. In some embodiments, the intermediate model obtained by the second processing device 120-1 from the storage device 130 over the network 140 is encrypted, and the second processing device 120-1 may decrypt the encrypted intermediate model in the second data field. As another example, the second processing device 120-1 may update the intermediate model with private data stored in the second storage device 120-2 that belongs to the second data domain 120.
It should be noted that there is only model interaction between the first data domain 110 and the second data domain 120, and there is isolation of private data of both, and no data interaction. For example, the first data field 110 cannot obtain the second data set in the second data field 120, and vice versa. In some embodiments, there is no direct communication interaction between the two. The interaction of the models is unidirectional, i.e. an intermediate model is obtained in the first data field 110 and transferred to the storage device 130. The second data field 120 retrieves the intermediate model from the storage device 130.
In some embodiments, the first processing device 110-1 and the second processing device 120-1 may be a single server or a group of servers. The server group may be a centralized server group connected to the network 140 via an access point, or a distributed server group respectively connected to the network 140 via at least one access point. In some embodiments, the first processing device 110-1 and the second processing device 120-1 may be connected locally to the network 140 or remotely to the network 140. For example, first processing device 110-1 and second processing device 120-1 may each communicate with storage device 130 via network 140.
In some embodiments, the first processing device 110-1 and the second processing device 120-1 may include at least one processing unit (e.g., a single core processing engine or a multiple core processing engine). Exemplary processing devices may include Central Processing Units (CPUs), Application Specific Integrated Circuits (ASICs), application specific instruction set processors (ASIPs), Graphics Processing Units (GPUs), Physical Processing Units (PPUs), Digital Signal Processors (DSPs), Field Programmable Gate Arrays (FPGAs), Programmable Logic Devices (PLDs), controllers, micro-controller units, Reduced Instruction Set Computers (RISCs), microprocessors, the like, or any combinations thereof, as is known to those of skill in the art.
First storage device 110-2, second storage device 120-2, and storage device 130 may store data and/or instructions. For example, the first storage device 110-2 may store private data pertaining solely to the first data source 110, and instructions for execution by the first processing device 110-1. As another example, the second storage device 120-2 may store private data pertaining solely to the second data source 120, and instructions for execution by the second processing device 120-1. Also for example, the storage device 130 may store the intermediate model.
In some embodiments, first storage device 110-2, second storage device 120-2, and storage device 130 may include mass storage, removable storage, volatile read-write memory, read-only memory (ROM), and the like, or any combination thereof. Exemplary mass storage devices may include magnetic disks, optical disks, solid state disks, and the like. Exemplary removable memory may include flash drives, floppy disks, optical disks, memory cards, compact disks, magnetic tape, and the like. Exemplary volatile read and write memories can include Random Access Memory (RAM). Exemplary RAM may include Dynamic Random Access Memory (DRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Static Random Access Memory (SRAM), thyristor random access memory (T-RAM), zero capacitance random access memory (Z-RAM), and the like. Exemplary read-only memories may include mask read-only memory (MROM), programmable read-only memory (PROM), erasable programmable read-only memory (perrom), electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), digital versatile disc read-only memory, and the like. In some embodiments, first storage device 110-2, second storage device 120-2, and storage device 130 may be implemented on a cloud platform. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-tiered cloud, and the like, or any combination thereof.
The network 140 connects the components of the application scenario 100 such that communications may occur between the components to facilitate the exchange of information and/or data. In some embodiments, at least one component (e.g., first processing device 110-1, second processing device 120-1, storage device 130) in the application scenario 100 may send information and/or data to other components in the application scenario 100 via the network 140. For example, the first processing device 110-1 may send the intermediate model to the storage device 130 via the network 140 for storage, and the second processing device 120-1 may retrieve the intermediate model from the storage device 130 via the network 140.
In some embodiments, the network 140 may be any one or more of a wired network or a wireless network. For example, network 140 may include a cable network, a wireline network, a fiber optic network, a telecommunications network, an intranet, the Internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Public Switched Telephone Network (PSTN), a Bluetooth networkTM(Bluetooth), zigbee networkTM(ZigBee), Near Field Communication (NFC), intra-device bus, intra-device line, cable connection, etc. or any combination thereof. The network connection between each two parts may be in one of the above-mentioned ways, or in a plurality of ways.
The modeling system for private data protection based on migration learning disclosed in this specification may be a system that includes one or more components in the second data domain 120. It may retrieve the intermediate model from storage 130 and training the model with its own private data to retrieve the final target model. Data out of the domain does not exist in the whole process, and the safety of private data is effectively protected. The modeling system for private data protection based on migration learning disclosed in this specification may also be a system including one or more components in the first data domain 110, one or more components in the second data domain 120, and the storage device 130. The initial model may be trained in the first data field 110 using its own private data to obtain an intermediate model, which is uploaded to the storage device 130. Thereafter, the intermediate model may be downloaded into the second data domain 120 and training adjusted to obtain the final target model using private data owned by the second data domain 120. In the whole modeling process, private data isolation exists between the first data domain 110 and the second data domain 120, and only lightweight model transmission exists. This reduces the transmission requirements on the one hand and also protects the security of private data on the other hand.
It should be noted that the above description of the application scenario 100 is for illustration and explanation only, and does not limit the scope of applicability of the present description. Various modifications and alterations to modeling system 100 will be apparent to those skilled in the art in light of this description. However, such modifications and variations are intended to be within the scope of the present description.
FIG. 2 is an exemplary flow diagram of a modeling method for private data protection based on migration learning, according to some embodiments of the present description. In some embodiments, flow 200 may be performed by a processing device, such as processing device 400 (or second processing device 120-2 in second data field 120 as shown in FIG. 1). For example, the process 200 may be stored in a storage device in the form of a program or instructions that, when executed, implement the process 200. As shown in fig. 2, the process 200 may include the following operations.
At step 210, an intermediate model is retrieved from an intermediate storage device. This particular step may be performed by the acquisition module 410.
In some embodiments, the intermediate model may refer to a model that has undergone some training. The certain training may refer to the model being trained completely, or partially trained, for example, for a predetermined number of times. In some embodiments, the intermediate model may be obtained by training the initial model in a first data domain based on a first data set and stored in an intermediate storage device. The first data domain may refer to a device or cluster of devices that store private data of a party participating in multi-party security modeling. For example, for a scenario in which the network consumption platform is jointly modeled with a merchant residing on the platform, the first data domain may be a device or cluster of devices of the network consumption platform that stores and/or processes own private data. The first data set may refer to a training data set that facilitates model training by preprocessing private data within a first data domain. The first data set may comprise text data, speech data, or image data, among others. Exemplary preprocessing may include removing unique attributes, processing missing values, attribute encoding, data normalization regularization, feature selection, principal component analysis, and the like, or any combination thereof.
In some embodiments, the initial model may include a machine learning model, such as a regression model (linear or logistic regression), naive bayes, decision trees, random forests, GBDTs, SVMs, KNNs, neural networks, and the like. Exemplary neural network models can include AlexNet, VGG Net, GoogleNet, ResNet, ResNeXt, CNN, R-CNN, FCN, RNN, YOLO, SqueezeNet, SegNet, GAN, and the like. In some embodiments, the training of the initial model may be performed by a processing device in a first data domain, e.g., first processing device 110-1 in fig. 1. The first processing device 110-1 may obtain private data for the first data domain 110 stored in the first storage device 110-2 and train the initial model with the private data to obtain the intermediate model. Thereafter, the processing device in the first data domain may transfer the intermediate model to the intermediate storage device. The intermediate storage device may be a general storage device that can store various types of data. The intermediate storage device may also be a dedicated storage device, used only for storing models. By way of example, the intermediate storage device may be the storage device 130 as shown in FIG. 1. The first processing device 110-1 may transmit the intermediate model to the storage device 130 for storage. In some embodiments, the first processing device 110-1 may encrypt the model before transmitting it.
In some embodiments, the intermediate model stored in the intermediate storage device is of a plurality of kinds. Different types of intermediate models may be distinguished using identification data. In some embodiments, the intermediate model is encrypted. The processing device 400 (e.g., the second processing device 120-1) may determine an appropriate intermediate model based on the particular purpose of use that follows. Upon communication of the acquisition module 410 with the intermediate storage device (e.g., via the network 140), the intermediate model stored therein may be read in accordance with the identification data.
Step 220, updating the intermediate model to obtain the target model based on the second data set in the second data domain. This step in particular may be performed by the update module 420.
In some embodiments, the second data domain may refer to a device or cluster of devices that store private data of another party participating in the multi-party security modeling. With continued reference to the network consumption platform and the resident merchant example, the second data domain may be a device or cluster of devices of the resident merchant for storing and/or processing own private data. The second data set may refer to a training data set that facilitates model training by preprocessing private data within the second data domain. The type of data contained in the second data set corresponds to the first data set. The correspondence may mean that the data type of the first data set of the training initial model is the same as the data type of the updating of the intermediate model, e.g. text data, speech data, or image data. Exemplary preprocessing may include removing unique attributes, processing missing values, attribute encoding, data normalization regularization, feature selection, principal component analysis, and the like, or any combination thereof.
The updating of the intermediate model may include adjusting the structure of the intermediate model, or directly retraining the intermediate model, for example, training the intermediate model to obtain the target model using the second data set as training data. In some embodiments, after obtaining the intermediate model, the update module 420 may first adjust the intermediate model based on the training task and/or the second data set. Subsequently, the update module 420 may train the adjusted intermediate model based on the second data set to obtain the target model. And updating the intermediate model, so that the adjusted intermediate model is more adaptive to the final model task, and the training by using a second data set is more facilitated. The training task may refer to the purpose of use of the model that is ultimately trained, e.g., classification or probabilistic prediction. Assuming that the intermediate model is a classification model and the finally trained model needs to be used for probability prediction, the structure of the model needs to be adjusted.
In some embodiments, adjusting the structure of the intermediate model may include increasing or decreasing computational nodes of the intermediate model, or adjusting parameters of the intermediate model, based on the training task and/or the second data set. The operation node may refer to a structural unit for data processing in the intermediate model, such as an intermediate node of a tree model, and for example, a neuron of a neural network. As an example, assuming that the intermediate model is a tree model, such as a decision tree, a gradient lifting tree, a random forest, etc., increasing or decreasing the operation node may be increasing or decreasing the tree bifurcation, and the parameter adjustment may be adjusting the feature value of the tree bifurcation, and adjusting the weight value of the leaf node (e.g., adjusting the weight, pruning, expanding, etc.). As another example, assuming the intermediate model is a neural network, the adding or subtracting operational nodes may be adding or subtracting neural network layers, or neurons of a network layer. For example, the intermediate model is a model for risk level classification, and in the second data domain, a model for predicting risk probability needs to be obtained, a softmax layer may be added after an output layer of the intermediate model, so that the adjusted intermediate model is suitable for risk probability prediction. The parameter adjustment may be to replace the neural network layer, or to adjust the calculated parameters of the neuron.
In some embodiments, after the intermediate model is adjusted, the updating module 420 may train the adjusted intermediate model to obtain the target model based on the second data set. The target model may be a model required by a corresponding private data owner of the second data domain to perform a particular operation. Such as user classification, user risk assessment, etc. During the training process, the update module 420 may divide the second data set into training data and testing data. The training data may be used to train the adapted intermediate model, for example, to continue to grow tree models such as gradient-boosted trees and random forests, or to fine-tuning neural networks. For a detailed description of the process of training the adjusted intermediate model based on the second data set, reference may be made to fig. 3 of the present specification, and details are not repeated here.
It should be noted that the above description related to the flow 200 is only for illustration and description, and does not limit the applicable scope of the present specification. Various modifications and alterations to flow 200 will be apparent to those skilled in the art in light of this description. However, such modifications and variations are intended to be within the scope of the present description.
FIG. 3 is an exemplary flow diagram of a model training method in accordance with some embodiments shown herein. In some embodiments, flow 300 may be performed by a processing device, such as processing device 400 (or second processing device 120-2 in second data field 120 as shown in FIG. 1). For example, the process 300 may be stored in a memory device in the form of a program or instructions that, when executed, implement the process 300. In some embodiments, the flow 300 may be performed by the update module 420. As shown in fig. 3, the process 300 may include the following operations.
Step 310, the second data set is divided into a training subset and a testing subset.
In some embodiments, the second data set may be divided into a training subset and a testing subset. The training subset may be used to train the adjusted intermediate model, and the testing subset may be used to verify the trained adjusted intermediate model to ensure the accuracy of the model. In some embodiments, the partitioning of the second data set may be a proportional partitioning. For example, the ratio between the amount of data contained in the training subset and the test subset may be 8:2, 9:1, or any other ratio.
Step 320, training the adjusted intermediate model based on the training subset to obtain a candidate model.
In some embodiments, the training of the adjusted intermediate model may be any model training method, and the present specification does not limit this. For example, for tree models such as gradient boosting books or random forests, the training subsets may be used to train the tree models to increase the number of trees. For another example, for a neural network, the net-tuning may be performed on the network parameters based on the training subset, such that the model can more accurately complete the predictions associated with the second data domain model task. In some embodiments, when a training condition is satisfied, such as the number of training times reaching a training time or a value of the loss function being less than a training threshold, training may be stopped and the resulting model may be designated as the candidate model.
Step 330, evaluating the candidate model based on the test subset, and determining whether the candidate model meets a preset condition.
In some embodiments, evaluating the candidate model based on the test subset may refer to processing data in the test subset using the candidate model and determining whether a model output result meets a requirement. For example, whether the prediction accuracy exceeds an accuracy threshold. If the output result meets the requirement, the candidate model can be determined to meet the preset condition. Otherwise, it may be determined that the candidate model does not satisfy a preset condition.
In some embodiments, when the candidate model does not satisfy the predetermined condition, the process 300 may return to step 320 to continue training the candidate model based on the training subset. In some embodiments, the process 300 may proceed to step 340 when the candidate model satisfies a predetermined condition.
Step 340, designating the candidate model as the target model.
In some embodiments, when the candidate model satisfies a preset condition, the candidate model may be determined as the target model to perform a specific processing operation. In some embodiments, after the target model is specified, model parameters can be adjusted according to the online performance of the target model, so as to further optimize the target model. In some embodiments, model tuning may include weight decay (weight decay), regularization methods, dropout, batch regularization, LRN, and the like.
It should be noted that the above description of the process 300 is for illustration and description only and is not intended to limit the scope of the present disclosure. Various modifications and changes to flow 300 will be apparent to those skilled in the art in light of this description. However, such modifications and variations are intended to be within the scope of the present description.
The modeling method for private data protection based on transfer learning disclosed by the embodiment of the specification performs data isolation among all parties under participation of multiple parties, and effectively protects the security of private data of all parties. And only lightweight models are shared in the transmission process, so that the transmission requirement is reduced. Meanwhile, the selection of the model has diversity, and various models can be trained.
FIG. 4 is a block diagram of a modeling apparatus for private data protection based on migration learning, according to some embodiments of the present description.
As shown in fig. 4, the modeling apparatus 400 for private data protection based on migration learning implementation may include an obtaining module 410, and an updating module 420.
The retrieval module 410 may be used to retrieve the intermediate model from the intermediate storage device. In some embodiments, the intermediate model may be obtained by training the initial model in the first data domain based on the first data set. This may be a trained model or a model that has been partially trained, for example, for a predetermined number of training times. The intermediate model is stored in an intermediate storage device. In some embodiments, the training of the initial model may be performed by a processing device in a first data domain, e.g., first processing device 110-1 in fig. 1. The first processing device 110-1 may obtain private data for the first data domain 110 stored in the first storage device 110-2 and train the initial model with the private data to obtain the intermediate model. Thereafter, the processing device in the first data domain may transfer the intermediate model to the intermediate storage device. The intermediate storage device may be a general storage device that can store various types of data. The intermediate storage device may also be a dedicated storage device, used only for storing models. The retrieval module 410 may communicate with the intermediate storage device to retrieve the intermediate model.
The update module 420 may be configured to update the intermediate model to obtain a target model based on a second data set in the second data domain. Data isolation between the second data field and the first data field. In some embodiments, the updating the intermediate model may include adjusting a structure of the intermediate model, or directly retraining the intermediate model, for example, training the intermediate model to obtain the target model using the second data set as training data. In some embodiments, after obtaining the intermediate model, the update module 420 may first adjust the intermediate model based on the training task and/or the second data set. Subsequently, the update module 420 may train the adjusted intermediate model based on the second data set to obtain the target model. And updating the intermediate model, so that the adjusted intermediate model is more adaptive to the final model task, and the training by using a second data set is more facilitated. In some embodiments, after the intermediate model is adjusted, the updating module 420 may train the adjusted intermediate model to obtain the target model based on the second data set. The update module 420 may divide the second data set into a training subset and a testing subset. The training subset may be used to train the adjusted intermediate model, and the testing subset may be used to verify the trained adjusted intermediate model to ensure the accuracy of the model. After training, prediction, and evaluation, the target model may be determined.
For a detailed description of the modules in the device 400, reference may be made to the flow chart portion of this specification.
It should be understood that the system and its modules shown in FIG. 4 may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules in this specification may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of the above hardware circuits and software (e.g., firmware).
It should be noted that the above descriptions of the candidate item display and determination system and the modules thereof are only for convenience of description, and the description is not limited to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings. For example, the obtaining module 410 and the updating module 420 disclosed in fig. 4 may be different modules in a system, or may be a module that implements the functions of two or more modules described above. For another example, the updating module 420 may be divided into an adjusting unit and a training unit, respectively, for adjusting the intermediate model and training the adjusted intermediate model. For example, each module may share one memory module, or each module may have its own memory module. Such variations are within the scope of the present disclosure.
Some embodiments of the present specification disclose a modeling system for private data protection based on migration learning implementation. The system may include a first processing device located in a first data domain, a second processing device located in a second data domain, and an intermediate storage device.
The system may be more conveniently described in connection with fig. 1. The first data field may refer to a device or cluster of devices that store private data of a party participating in multi-party security modeling, such as the first data field 110 shown in FIG. 1. The first processing device (e.g., first processing device 110-1) may be configured to train an initial model based on a first data set in a first data domain to obtain one or more intermediate models, and to transmit the intermediate models to the intermediate storage device. The first data set may be stored in a storage device within the first data domain, such as the first storage device 110-2 described in FIG. 1. The initial model may include a machine learning model such as a regression model (linear or logistic regression), naive bayes, decision trees, random forests, GBDTs, SVMs, KNNs, neural networks, and the like. Exemplary neural network models can include AlexNet, VGG Net, GoogleNet, ResNet, ResNeXt, CNN, R-CNN, FCN, RNN, YOLO, SqueezeNet, SegNet, GAN, and the like. The intermediate storage device may be a general storage device that can store various types of data. The intermediate storage device may also be a dedicated storage device, used only for storing models. By way of example, the intermediate storage device may be the storage device 130 as shown in FIG. 1. The first processing device 110-1 may transmit the intermediate model to the storage device 130 for storage, e.g., via the network 140.
In some embodiments, the second data domain may refer to a device or cluster of devices that store private data of another party participating in the multi-party security modeling, such as the first data domain 120 shown in FIG. 1. The second processing device (e.g., second processing device 120-1) may be configured to retrieve an intermediate model from the intermediate storage device and update the intermediate model based on a second data set in the second data domain to retrieve a target model. The second data set may be stored in a storage device within the second data domain, such as the second storage device 120-2 depicted in FIG. 1. The second processing device may first adjust the intermediate model based on a training task and/or the second data set and train the adjusted intermediate model based on the second data set to obtain a target model.
In some embodiments, the first data field is data isolated from the second data field. Thus, in the system, the first processing device and the second processing device are isolated and access to the model is only made via the intermediate storage device. Data isolation between the first data set and the second data set. Therefore, the risk of private data leakage of each party in the multi-party modeling process can be effectively avoided.
The beneficial effects that may be brought by the embodiments of the present description include, but are not limited to: (1) a modeling method for private data protection based on transfer learning is designed, and data of each party are isolated during multi-party modeling, so that the protection requirement on private data can be met; (2) on the premise of avoiding private data leakage, requirements on deployment and transmission in the modeling process are greatly reduced through lightweight model sharing; (3) the selection of the model is flexible, and various models can be selected. It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.
Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.
Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.
Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.
The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.
Computer program code required for the operation of various portions of this specification may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).
Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.
Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.
Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.
For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.
Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims (10)

1. A modeling method for private data protection based on migration learning, wherein the method is performed by one or more processors and comprises:
obtaining an intermediate model from an intermediate storage device, the intermediate model being trained based on a first data set in a first data domain and stored in the intermediate storage device, the first data set comprising text data, speech data, or image data;
updating the intermediate model to obtain a target model based on a second data set in the second data domain; the second data field is isolated from the first data field, and the type of data contained in the second data set corresponds to the first data set.
2. The method of claim 1, wherein the initial model is a machine learning model, the method further comprising:
based on a training task and/or the second data set, adjusting a structure of the intermediate model to obtain an adjusted intermediate model.
3. The method of claim 2, wherein the adjusting the structure of the intermediate model based on the training task and/or the second data set comprises: and increasing or decreasing the operation nodes of the intermediate model.
4. The method of claim 2, wherein said updating the intermediate model to obtain the target model based on the second data set in the second data domain comprises:
training the adjusted intermediate model to obtain a target model based on the second data set.
5. A modeling system for realizing private data protection based on transfer learning comprises an acquisition module and an updating module;
the acquisition module is used for acquiring an intermediate model from the intermediate storage equipment, and the intermediate model is acquired in a first data domain based on a first data set training initial model and is stored in the intermediate storage equipment;
the updating module is used for updating the intermediate model to obtain a target model based on a second data set in the second data domain; at least a portion of the data between the second data field and the first data field is isolated from one another.
6. The system of claim 5, wherein the update module is further to;
adjusting a structure of the intermediate model based on a training task and/or the second data set.
7. The system of claim 6, wherein to adjust the structure of the intermediate model based on a training task and/or the second data set, the update module is to:
and increasing or decreasing the operation nodes of the intermediate model.
8. The system of claim 6, wherein to obtain a target model, the update module is to:
training the adjusted intermediate model to obtain a target model based on the second data set.
9. A modeling system for private data protection based on migration learning implementation, the system comprising a first processing device located in a first data domain, a second processing device located in a second data domain, and an intermediate storage device;
the first processing device is configured to train an initial model based on a first data set in the first data domain to obtain one or more intermediate models, and to transmit the intermediate models to the intermediate storage device, the first data set including text data, voice data, or image data;
the second processing device is used for acquiring an intermediate model from the intermediate storage device and updating the intermediate model to acquire a target model based on a second data set in the second data domain;
wherein at least part of data between the first data field and the second data field are isolated from each other, and the type of data included in the second data set corresponds to the first data set.
10. A modeling device for realizing private data protection based on transfer learning, comprising a processor, wherein the processor is used for executing the modeling method for realizing private data protection based on transfer learning according to any one of claims 1 to 4.
CN201911284099.1A 2019-12-13 2019-12-13 Private data protection modeling method, system and device based on transfer learning Active CN111062056B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911284099.1A CN111062056B (en) 2019-12-13 2019-12-13 Private data protection modeling method, system and device based on transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911284099.1A CN111062056B (en) 2019-12-13 2019-12-13 Private data protection modeling method, system and device based on transfer learning

Publications (2)

Publication Number Publication Date
CN111062056A true CN111062056A (en) 2020-04-24
CN111062056B CN111062056B (en) 2022-03-15

Family

ID=70301514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911284099.1A Active CN111062056B (en) 2019-12-13 2019-12-13 Private data protection modeling method, system and device based on transfer learning

Country Status (1)

Country Link
CN (1) CN111062056B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800466A (en) * 2021-02-10 2021-05-14 支付宝(杭州)信息技术有限公司 Data processing method and device based on privacy protection and server

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635462A (en) * 2018-12-17 2019-04-16 深圳前海微众银行股份有限公司 Model parameter training method, device, equipment and medium based on federation's study
US20190228099A1 (en) * 2018-01-21 2019-07-25 Microsoft Technology Licensing, Llc. Question and answer pair generation using machine learning
CN110399742A (en) * 2019-07-29 2019-11-01 深圳前海微众银行股份有限公司 A kind of training, prediction technique and the device of federation's transfer learning model
CN110503204A (en) * 2018-05-17 2019-11-26 国际商业机器公司 Identification is used for the migration models of machine learning task

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190228099A1 (en) * 2018-01-21 2019-07-25 Microsoft Technology Licensing, Llc. Question and answer pair generation using machine learning
CN110503204A (en) * 2018-05-17 2019-11-26 国际商业机器公司 Identification is used for the migration models of machine learning task
CN109635462A (en) * 2018-12-17 2019-04-16 深圳前海微众银行股份有限公司 Model parameter training method, device, equipment and medium based on federation's study
CN110399742A (en) * 2019-07-29 2019-11-01 深圳前海微众银行股份有限公司 A kind of training, prediction technique and the device of federation's transfer learning model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800466A (en) * 2021-02-10 2021-05-14 支付宝(杭州)信息技术有限公司 Data processing method and device based on privacy protection and server
CN112800466B (en) * 2021-02-10 2022-04-22 支付宝(杭州)信息技术有限公司 Data processing method and device based on privacy protection and server

Also Published As

Publication number Publication date
CN111062056B (en) 2022-03-15

Similar Documents

Publication Publication Date Title
US20220382564A1 (en) Aggregate features for machine learning
CN110084377B (en) Method and device for constructing decision tree
CN111537945B (en) Intelligent ammeter fault diagnosis method and equipment based on federal learning
WO2021047535A1 (en) Method, apparatus and system for secure vertical federated learning
US20220083690A1 (en) Obtaining jointly trained model based on privacy protection
DE112021004908T5 (en) COMPUTER-BASED SYSTEMS, COMPUTATION COMPONENTS AND COMPUTATION OBJECTS SET UP TO IMPLEMENT DYNAMIC OUTLIVER DISTORTION REDUCTION IN MACHINE LEARNING MODELS
CN111737749A (en) Measuring device alarm prediction method and device based on federal learning
CN109478263A (en) System and equipment for architecture assessment and strategy execution
Wu et al. Federated unlearning: Guarantee the right of clients to forget
CN112529101B (en) Classification model training method and device, electronic equipment and storage medium
Galtier et al. Substra: a framework for privacy-preserving, traceable and collaborative machine learning
CN113033825B (en) Model training method, system and device for privacy protection
CN113011895B (en) Associated account sample screening method, device and equipment and computer storage medium
CN115659408B (en) Method, system and storage medium for sharing sensitive data of power system
US20230032848A1 (en) Leveraging Blockchain Based Machine Learning Modeling For Expense Categorization
CN113032835B (en) Model training method, system and device for privacy protection
CN111062056B (en) Private data protection modeling method, system and device based on transfer learning
CN116150663A (en) Data classification method, device, computer equipment and storage medium
US20220253544A1 (en) System for secure obfuscation of electronic data with data format preservation
Mageshkumar et al. An improved secure file deduplication avoidance using CKHO based deep learning model in a cloud environment
Bhowmik et al. mTrust: call behavioral trust predictive analytics using unsupervised learning in mobile cloud computing
US20230196136A1 (en) Machine learning model predictions via augmenting time series observations
CN111737319B (en) User cluster prediction method, device, computer equipment and storage medium
CN114723012A (en) Computing method and device based on distributed training system
CN112784990A (en) Training method of member inference model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40028526

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant