CN117034000A - Modeling method and device for longitudinal federal learning, storage medium and electronic equipment - Google Patents

Modeling method and device for longitudinal federal learning, storage medium and electronic equipment Download PDF

Info

Publication number
CN117034000A
CN117034000A CN202310310963.0A CN202310310963A CN117034000A CN 117034000 A CN117034000 A CN 117034000A CN 202310310963 A CN202310310963 A CN 202310310963A CN 117034000 A CN117034000 A CN 117034000A
Authority
CN
China
Prior art keywords
encryption
encrypted
provider
common
positive sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310310963.0A
Other languages
Chinese (zh)
Other versions
CN117034000B (en
Inventor
高雅
潘峰
赵立超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Mingri Data Intelligence Co ltd
Original Assignee
Zhejiang Mingri Data Intelligence Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Mingri Data Intelligence Co ltd filed Critical Zhejiang Mingri Data Intelligence Co ltd
Priority to CN202310310963.0A priority Critical patent/CN117034000B/en
Publication of CN117034000A publication Critical patent/CN117034000A/en
Application granted granted Critical
Publication of CN117034000B publication Critical patent/CN117034000B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Storage Device Security (AREA)

Abstract

The invention discloses a modeling method and device for longitudinal federal learning, a storage medium and electronic equipment. The method comprises the following steps: the public key and the partial encryption positive sample ID are sent to each provider, so that each provider can be matched according to the partial encryption positive sample ID and the respective full ID of each provider, an encryption candidate set is obtained, and the encryption candidate set is sent to a coordinator; determining a common encryption candidate set according to all the encryption candidate sets; determining a common encrypted positive sample ID according to the encrypted positive sample ID and the common encryption candidate set; and obtaining a common encrypted negative sample ID according to the common encryption candidate set and the common encrypted positive sample ID, and sending the common encrypted negative sample ID to each provider. The invention solves the technical problem that the business party cannot create the logistic regression model under the condition of missing the negative sample.

Description

Modeling method and device for longitudinal federal learning, storage medium and electronic equipment
Technical Field
The invention relates to the field of computers, in particular to a modeling method and device for longitudinal federal learning, a storage medium and electronic equipment.
Background
When a business party makes a classification model in longitudinal federal learning, common practice is as follows: the service party hosts positive and negative sample ID data to a coordinator, and the coordinator generates a public key and a private key and sends the public key to each data provider; each data provider encrypts local ID data by using a public key and uploads the local ID data to a coordinator; the coordinator encrypts positive and negative sample data managed by the business party by using a public key, and then carries out privacy calculation and intersection with the collected encrypted ID to obtain an aligned common encrypted ID; and returning the common encrypted ID data to each data provider, and carrying out feature matching and model construction on the data by each data provider. The method has two defects, namely, a service party can provide positive and negative sample data when making a classification model, different data providers match features according to the positive and negative sample data, but in many industries, the service party only has positive samples and does not know negative sample data, and the alignment of the positive and negative sample data of different data providers becomes a problem under the conditions that the negative sample is missing and the safety is ensured; secondly, different data providers need to encrypt local full-quantity ID data and upload the encrypted full-quantity ID data to a coordinator for multiparty data interchange. When the data volume is large, on the one hand, the speed of uploading the data can be very slow, and on the other hand, when the data has problems in the transmission process, the leakage of the whole data can be caused, and the risk is very high.
Disclosure of Invention
The embodiment of the invention provides a modeling method, a device, a storage medium and electronic equipment for longitudinal federal learning, which are used for at least solving the technical problem that a business party cannot create a logistic regression model under the condition of missing a negative sample.
According to an aspect of an embodiment of the present invention, there is provided a modeling method of longitudinal federal learning, including: transmitting the public key and the partial encryption positive sample ID to each provider so that each provider can be matched according to the partial encryption positive sample ID and the respective full ID of each provider to obtain an encryption candidate set and transmit the encryption candidate set to the coordinator; determining a common encryption candidate set according to all the encryption candidate sets; determining a common encrypted positive sample ID according to the encrypted positive sample ID and the common encryption candidate set; and obtaining a common encryption negative sample ID according to the common encryption candidate set and the common encryption positive sample ID, and sending the common encryption negative sample ID to each provider.
According to another aspect of an embodiment of the present invention, there is provided a modeling apparatus for longitudinal federal learning, including: the sending module is used for sending the public key and the partial encryption positive sample ID to each provider so that each provider can be matched with the corresponding full ID of each provider according to the partial encryption positive sample ID to obtain an encryption candidate set and send the encryption candidate set to the coordinator; the first processing module is used for determining a common encryption candidate set according to all the encryption candidate sets; the second processing module is used for determining a common encrypted positive sample ID according to the encrypted positive sample ID and the common encryption candidate set; and the third processing module is used for obtaining a common encryption negative sample ID according to the common encryption candidate set and the common encryption positive sample ID, and sending the common encryption negative sample ID to each provider.
As an alternative example, the above apparatus further includes: an obtaining module, configured to obtain a positive sample ID of a service party before sending the public key and the partially encrypted positive sample ID to each provider; the generation module is used for generating the public key and the private key; the encryption module is used for encrypting the positive sample ID by using the public key to obtain the encrypted positive sample ID; and the determining module is used for determining the first N bits of the encrypted positive sample ID as the partially encrypted positive sample ID.
As an optional example, the third processing module includes: and the first processing unit is used for removing the common encryption positive sample ID from the common encryption candidate set to obtain the common encryption negative sample ID.
As an alternative example, the above apparatus further includes: the creation module is used for initializing a logistic regression model; the training module is used for training the logistic regression model and comprises the following steps: the following steps are executed until the recognition rate of the logistic regression model reaches the target threshold value: receiving an encryption inner product sent by each provider, wherein the encryption inner product is obtained by calculating by each provider according to local data and parameters of the logistic regression model and encrypting by using the public key; processing each encryption inner product by using a semi-homomorphic encryption technology to obtain encryption residual data; transmitting the encrypted residual data to each provider, so that each provider calculates a first encryption gradient according to the encrypted residual data, and adds the first encryption gradient to the random number to obtain a second encryption gradient; after receiving the second encryption gradients sent by each provider, decrypting each second encryption gradient through a private key to obtain a first gradient corresponding to each second encryption gradient; and sending each first gradient to a provider corresponding to the first gradient, so that each provider subtracts the respective random number from the corresponding first gradient to obtain a second gradient, and updating the parameters according to the second gradient.
As an alternative example, the training module includes: the second processing unit is used for processing each encryption inner product by using a semi-homomorphic encryption technology to obtain encryption residual data; and a third processing unit, configured to calculate the sum of all the encrypted inner products, the common encrypted positive sample ID, and the common encrypted negative sample ID by using the semi-homomorphic encryption technique, so as to obtain the encrypted residual data.
As an alternative example, the above apparatus further includes: a receiving module, configured to receive, after the logistic regression model training is completed, encryption score data sent by each provider, where the encryption score data includes an encryption score and an encryption ID, where the encryption score is obtained by each provider using the logistic regression model to predict a respective full ID and using the public key to encrypt the respective full ID, and the encryption ID is obtained by each provider using the public key to encrypt the respective full ID; and a fourth processing module, configured to process all the encrypted score data by using the semi-homomorphic encryption technology, so as to obtain a fusion score corresponding to each provider.
As an optional example, the fourth processing module includes: a fourth processing unit, configured to perform semi-homomorphic encryption summation on each encryption score in the encryption score data to obtain a final encryption score corresponding to each encryption score; and the decryption unit is used for decrypting each final encryption score by using the private key to obtain a fusion score corresponding to each final encryption score.
According to yet another aspect of an embodiment of the present invention, there is also provided a storage medium having stored therein a computer program, wherein the computer program when executed by a processor performs the modeling method of longitudinal federal learning as described above.
According to yet another aspect of an embodiment of the present invention, there is also provided an electronic device including a memory, in which a computer program is stored, and a processor configured to execute the modeling method of longitudinal federal learning described above by the computer program.
The modeling method of the longitudinal federal learning can be used in the federal learning process of the privacy calculation technology. In the embodiment of the invention, the public key and the partial encryption positive sample ID are sent to each provider, so that each provider is matched according to the partial encryption positive sample ID and the respective full ID of each provider to obtain an encryption candidate set and send the encryption candidate set to the coordinator; determining a common encryption candidate set according to all the encryption candidate sets; determining a common encrypted positive sample ID according to the encrypted positive sample ID and the common encryption candidate set; according to the method for obtaining the common encryption negative sample ID according to the common encryption candidate set and the common encryption positive sample ID and sending the common encryption negative sample ID to each provider, in the method, under the condition that a business side only provides the positive sample ID, the partial encryption positive sample ID according to the positive sample ID is matched with the full quantity ID of each provider, and the common encryption negative sample ID is obtained through algorithm processing, so that a logistic regression model can be created, the aim of uploading a small amount of data under the condition that the business side only has the positive sample and completing the alignment work of the positive and negative sample IDs of different providers is fulfilled, and the technical problem that the business side cannot create a logistic regression model under the condition that the business side lacks the negative sample is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow chart of an alternative modeling method for longitudinal federal learning according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an alternative longitudinal federal learning modeling apparatus in accordance with an embodiment of the present application;
fig. 3 is a schematic diagram of an alternative electronic device according to an embodiment of the application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to a first aspect of an embodiment of the present invention, there is provided a modeling method for longitudinal federal learning, optionally, as shown in fig. 1, the method includes:
s102, sending the public key and the partial encryption positive sample ID to each provider so that each provider can be matched with the respective full quantity ID of each provider according to the partial encryption positive sample ID, obtaining an encryption candidate set and sending the encryption candidate set to a coordinator;
s104, determining a common encryption candidate set according to all the encryption candidate sets;
s106, determining a common encrypted positive sample ID according to the encrypted positive sample ID and the common encrypted candidate set;
s108, obtaining a common encryption negative sample ID according to the common encryption candidate set and the common encryption positive sample ID, and sending the common encryption negative sample ID to each provider.
Optionally, in this embodiment, federal learning is a distributed machine learning technology, and the core idea is to construct a global model based on virtual fusion data by performing distributed model training between a plurality of data sources having local data, and only exchanging model parameters or intermediate results on the premise of not exchanging local sample data. The longitudinal federation learning is applicable to federation learning scenes formed by participants with the same sample space and different feature spaces on the data set. The logistic regression model is a classification model which aims at the problem of linear separability and is easy to realize and excellent in performance, and the logistic regression is internally and is actually linear regression, so that the logistic regression model is very suitable for federal learning. The public key and the private key are a key pair (i.e., a public key and a private key) obtained by an algorithm, one of which is disclosed to the outside, called a public key, and the other of which is reserved by itself, called a private key.
Alternatively, in this embodiment, encryption refers to encrypting the local full data using a public key to become encrypted data, and the same data is encrypted using the same public key, so that the obtained encrypted values are the same. The candidates are for security, the data provider does not know the specific value of the ID of the positive sample, and they can only obtain the first N bits of data according to the encryption of the positive sample, and the data provider obtains an N-bit ID, possibly matches a plurality of first N bits of identical IDs in own data, and the first N bits of IDs form a set, namely a candidate set, for example: after having been encrypted with a positive sample ID: abcdefg, N takes 3, then each data provider can only get abc, the first three bits of ID, and each data provider matches abc with the first three bits of data local to itself, possibly matching many data, such as: abcccc, abcddd, abcdefg. Each data provider has a candidate set, the candidate sets of different data providers are subject to intersection, and the IDs appearing in all candidate sets are common.
Optionally, in this embodiment, the service party hosts the positive sample ID to the coordinator, the coordinator generates the public key and the private key, the coordinator encrypts the positive sample ID using the public key to obtain an encrypted positive sample ID, and transmits the first N bits of the encrypted positive sample ID, that is, a part of the encrypted positive sample ID, to each data provider. After each data provider receives the partial encrypted positive sample ID, encrypting the local total ID data by using a public key, matching the front N bits of all the encrypted IDs with the same front N bits of the encrypted positive sample ID, matching out an encryption candidate set, and uploading the encryption candidate set to a coordinator. And the coordinator performs intersection on the encryption candidate set uploaded by each data provider to obtain a common encryption candidate set, and performs intersection on the encryption positive sample ID and the common encryption candidate set to obtain the common encryption positive sample ID of each data party. And finally, removing the matched encrypted positive sample ID data from the shared encryption candidate set, sampling the correlation magnitude from the shared encryption candidate set to obtain a shared encryption negative sample ID, and finally, returning the shared encryption negative sample ID to each data provider. And each data provider locally performs feature matching on the common encryption negative sample ID and the common encryption positive sample ID, matches out corresponding features, and initializes a logistic regression model.
Optionally, in this embodiment, under the condition that the service party only provides the positive sample ID, the partial encrypted positive sample ID according to the positive sample ID is matched with the full ID of each provider, and the shared encrypted negative sample ID is obtained through algorithm processing, so that a logistic regression model can be created, thereby realizing that uploading of a small amount of data can be performed under the condition that the service party only has the positive sample, and completing the alignment work of the positive and negative sample IDs of different providers, and further solving the technical problem that the service party cannot create the logistic regression model under the condition that the negative sample is missing.
As an alternative example, before sending the public key and the partially encrypted positive sample ID to each provider, the method further comprises:
acquiring a positive sample ID of a service party;
generating a public key and a private key;
encrypting the positive sample ID by using the public key to obtain an encrypted positive sample ID;
the first N bits of the encrypted positive sample ID are determined to be the partially encrypted positive sample ID.
Optionally, in this embodiment, the service party hosts the positive sample ID to the coordinator, the coordinator generates the public key and the private key, the coordinator encrypts the positive sample ID using the public key to obtain an encrypted positive sample ID, and transmits the first N bits of the encrypted positive sample ID, that is, a part of the encrypted positive sample ID, to each data provider.
As an alternative example, deriving the common encrypted negative sample ID from the common encrypted candidate set and the common encrypted positive sample ID comprises:
and removing the common encrypted positive sample ID from the common encryption candidate set to obtain the common encrypted negative sample ID.
Optionally, in this embodiment, the common encryption candidate set is removed from the matched encrypted positive sample ID data, and the correlation magnitude is sampled from the common encryption candidate set to obtain the common encryption negative sample ID.
As an alternative example, after sending the common encrypted negative sample ID to the each provider, the method further includes:
initializing a logistic regression model;
training the logistic regression model, including: the following steps are executed until the recognition rate of the logistic regression model reaches a target threshold:
receiving an encryption inner product sent by each provider, wherein the encryption inner product is obtained by calculating by each provider according to local data and parameters of a logistic regression model and encrypting by using a public key;
processing each encryption inner product by using a semi-homomorphic encryption technology to obtain encryption residual data;
sending the encrypted residual data to each provider, so that each provider calculates a first encryption gradient according to the encrypted residual data, and adding the first encryption gradient and the random number to obtain a second encryption gradient;
After receiving the second encryption gradients sent by each provider, decrypting each second encryption gradient through a private key to obtain a first gradient corresponding to each second encryption gradient;
and sending each first gradient to a provider corresponding to the first gradient, so that each provider subtracts the respective random number from the corresponding first gradient to obtain a second gradient, and updating parameters according to the second gradient.
Optionally, in this embodiment, each provider performs feature matching on the common encrypted negative sample ID, initializes a logistic regression model, and trains the logistic regression model: the following steps are performed: each data provider calculates an inner product by using the local data and the parameters of the model, encrypts the inner product by using a public key, and uploads the inner product to the coordinator. And the coordinator collects all the encryption inner products, uses a semi-homomorphic encryption technology to combine all the encryption inner products, and calculates the sum of all the encryption inner products, the shared encryption positive sample ID and the shared encryption negative sample ID to obtain encryption residual data. The coordinator returns the encrypted residual data to each data provider, and each data provider calculates a first encrypted gradient of a corresponding parameter according to the encrypted residual data, and adds the encrypted gradient and the random number to obtain a second encrypted gradient, and then transmits the second encrypted gradient to the coordinator. The coordinator decrypts each second encryption gradient through the private key to obtain corresponding first gradients, and then returns each first gradient to the corresponding data provider. Each data provider subtracts the random number from the first gradient to obtain a decrypted second gradient, and updates parameters of the logistic regression model according to the second gradient, so that training of the logistic regression model is completed, and the steps are circulated until the recognition rate of the logistic regression model reaches a target threshold.
As an alternative example, processing each encrypted inner product using a semi-homomorphic encryption technique, resulting in encrypted residual data includes:
and calculating the sum of all the encrypted inner products, the common encrypted positive sample ID and the common encrypted negative sample ID by using a semi-homomorphic encryption technology to obtain encrypted residual data.
Optionally, in this embodiment, the coordinator collects all the encrypted inner products, uses a semi-homomorphic encryption technique to combine the encrypted inner products, and calculates with the encrypted positive sample ID and the shared encrypted negative sample ID to obtain the encrypted residual data.
As an alternative example, after the logistic regression model training is completed, the method further includes:
receiving encryption sent by each provider, wherein encryption score data comprises encryption scores and encryption IDs, the encryption scores are obtained by each provider by means of predicting the respective full IDs by using a logistic regression model and encrypting the respective full IDs by using a public key, and the encryption IDs are obtained by each provider by means of encrypting the respective full IDs by using the public key;
and processing all the encrypted fraction data by using a semi-homomorphic encryption technology to obtain a fusion fraction corresponding to each provider.
Optionally, in this embodiment, each data provider grasps a part of parameters of the whole logistic regression model, predicts the respective full-size ID data by using the trained logistic regression model, obtains the scores of different device IDs, encrypts the full-size ID and the score by using the public key, obtains the encrypted scores of the encrypted IDs, and uploads the encrypted scores to the coordinator. And the coordinator aggregates the data of each data provider, sums each score by using semi-homomorphic encryption, so as to obtain a final encryption score, and decrypts the final encryption score by using a private key, so that a fusion score corresponding to each ID can be obtained.
As an alternative example, using a semi-homomorphic encryption technique, processing all the encrypted score data to obtain a fused score corresponding to each provider includes:
carrying out semi-homomorphic encryption summation on each encryption score in the encryption score data to obtain a final encryption score corresponding to each encryption score;
and decrypting each final encryption score by using the private key to obtain a fusion score corresponding to each final encryption score.
Optionally, in this embodiment, the coordinator aggregates the data of each data provider, sums each score with semi-homomorphic encryption, and then decrypts the score with the private key, so as to obtain the fusion score corresponding to each ID.
Optionally, under the condition that the service side lacks a negative sample, the application can not only reduce the cost of data transmission by adopting a front N-bit encryption matching mode, but also accurately match positive samples, select the required negative sample from the data of unmatched positive samples, thereby realizing the rapid alignment of the positive and negative samples of different data providers; and the whole data is in an encryption state, so that the risk of data leakage is avoided.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
According to another aspect of the embodiment of the present application, there is also provided a modeling apparatus for longitudinal federal learning, as shown in fig. 2, including:
a sending module 202, configured to send the public key and the partial encrypted positive sample ID to each provider, so that each provider matches the partial encrypted positive sample ID with the respective full ID of each provider, and obtains an encryption candidate set and sends the encryption candidate set to the coordinator;
A first processing module 204, configured to determine a common encryption candidate set according to all the encryption candidate sets;
a second processing module 206, configured to determine a common encrypted positive sample ID according to the encrypted positive sample ID and the common encryption candidate set;
the third processing module 208 is configured to obtain a common encrypted negative sample ID according to the common encrypted candidate set and the common encrypted positive sample ID, and send the common encrypted negative sample ID to each provider.
Optionally, in this embodiment, federal learning is a distributed machine learning technology, and the core idea is to construct a global model based on virtual fusion data by performing distributed model training between a plurality of data sources having local data, and only exchanging model parameters or intermediate results on the premise of not exchanging local sample data. The longitudinal federation learning is applicable to federation learning scenes formed by participants with the same sample space and different feature spaces on the data set. The logistic regression model is a classification model which aims at the problem of linear separability and is easy to realize and excellent in performance, and the logistic regression is internally and is actually linear regression, so that the logistic regression model is very suitable for federal learning. The public key and the private key are a key pair (i.e., a public key and a private key) obtained by an algorithm, one of which is disclosed to the outside, called a public key, and the other of which is reserved by itself, called a private key.
Alternatively, in this embodiment, encryption refers to encrypting the local full data using a public key to become encrypted data, and the same data is encrypted using the same public key, so that the obtained encrypted values are the same. The candidates are for security, the data provider does not know the specific value of the ID of the positive sample, and they can only obtain the first N bits of data according to the encryption of the positive sample, and the data provider obtains an N-bit ID, possibly matches a plurality of first N bits of identical IDs in own data, and the first N bits of IDs form a set, namely a candidate set, for example: after having been encrypted with a positive sample ID: abcdefg, N takes 3, then each data provider can only get abc, the first three bits of ID, and each data provider matches abc with the first three bits of data local to itself, possibly matching many data, such as: abcccc, abcddd, abcdefg. Each data provider has a candidate set, the candidate sets of different data providers are subject to intersection, and the IDs appearing in all candidate sets are common.
Optionally, in this embodiment, the service party hosts the positive sample ID to the coordinator, the coordinator generates the public key and the private key, the coordinator encrypts the positive sample ID using the public key to obtain an encrypted positive sample ID, and transmits the first N bits of the encrypted positive sample ID, that is, a part of the encrypted positive sample ID, to each data provider. After each data provider receives the partial encrypted positive sample ID, encrypting the local total ID data by using a public key, matching the front N bits of all the encrypted IDs with the same front N bits of the encrypted positive sample ID, matching out an encryption candidate set, and uploading the encryption candidate set to a coordinator. And the coordinator performs intersection on the encryption candidate set uploaded by each data provider to obtain a common encryption candidate set, and performs intersection on the encryption positive sample ID and the common encryption candidate set to obtain the common encryption positive sample ID of each data party. And finally, removing the matched encrypted positive sample ID data from the shared encryption candidate set, sampling the correlation magnitude from the shared encryption candidate set to obtain a shared encryption negative sample ID, and returning the shared encryption negative sample ID to each data provider. And each data provider locally performs feature matching on the common encryption negative sample ID and the common encryption positive sample ID, matches out corresponding features, and initializes a logistic regression model.
Optionally, in this embodiment, under the condition that the service party only provides the positive sample ID, the partial encrypted positive sample ID according to the positive sample ID is matched with the full ID of each provider, and the shared encrypted negative sample ID is obtained through algorithm processing, so that a logistic regression model can be created, thereby realizing that uploading of a small amount of data can be performed under the condition that the service party only has the positive sample, and completing the alignment work of the positive and negative sample IDs of different providers, and further solving the technical problem that the service party cannot create the logistic regression model under the condition that the negative sample is missing.
As an alternative example, the above apparatus further includes:
an obtaining module, configured to obtain a positive sample ID of a service party before sending the public key and the partially encrypted positive sample ID to each provider;
the generation module is used for generating a public key and a private key;
the encryption module is used for encrypting the positive sample ID by using the public key to obtain an encrypted positive sample ID;
and the determining module is used for determining the first N bits of the encrypted positive sample ID as the partial encrypted positive sample ID.
Optionally, in this embodiment, the service party hosts the positive sample ID to the coordinator, the coordinator generates the public key and the private key, the coordinator encrypts the positive sample ID using the public key to obtain an encrypted positive sample ID, and transmits the first N bits of the encrypted positive sample ID, that is, a part of the encrypted positive sample ID, to each data provider.
As an alternative example, the third processing module includes:
and the first processing unit is used for removing the common encryption positive sample ID from the common encryption candidate set to obtain the common encryption negative sample ID.
Optionally, in this embodiment, the common encryption candidate set is removed from the matched encrypted positive sample ID data, and the correlation magnitude is sampled from the common encryption candidate set to obtain the common encryption negative sample ID.
As an alternative example, the above apparatus further includes:
the creation module is used for initializing a logistic regression model;
the training module is used for training the logistic regression model and comprises the following steps:
the following steps are executed until the recognition rate of the logistic regression model reaches a target threshold:
receiving an encryption inner product sent by each provider, wherein the encryption inner product is obtained by calculating by each provider according to local data and parameters of a logistic regression model and encrypting by using a public key;
processing each encryption inner product by using a semi-homomorphic encryption technology to obtain encryption residual data;
sending the encrypted residual data to each provider, so that each provider calculates a first encryption gradient according to the encrypted residual data, and adding the first encryption gradient and the random number to obtain a second encryption gradient;
After receiving the second encryption gradients sent by each provider, decrypting each second encryption gradient through a private key to obtain a first gradient corresponding to each second encryption gradient;
and sending each first gradient to a provider corresponding to the first gradient, so that each provider subtracts the respective random number from the corresponding first gradient to obtain a second gradient, and updating parameters according to the second gradient.
Optionally, in this embodiment, each provider performs feature matching on the common encrypted negative sample ID, initializes a logistic regression model, and trains the logistic regression model: the following steps are performed: each data provider calculates an inner product by using the local data and the parameters of the model, encrypts the inner product by using a public key, and uploads the inner product to the coordinator. And the coordinator collects all the encryption inner products, uses a semi-homomorphic encryption technology to combine all the encryption inner products, and calculates the sum of all the encryption inner products, the shared encryption positive sample ID and the shared encryption negative sample ID to obtain encryption residual data. The coordinator returns the encrypted residual data to each data provider, and each data provider calculates a first encrypted gradient of a corresponding parameter according to the encrypted residual data, and adds the encrypted gradient and the random number to obtain a second encrypted gradient, and then transmits the second encrypted gradient to the coordinator. The coordinator decrypts each second encryption gradient through the private key to obtain corresponding first gradients, and then returns each first gradient to the corresponding data provider. Each data provider subtracts the random number from the first gradient to obtain a decrypted second gradient, and updates parameters of the logistic regression model according to the second gradient, so that training of the logistic regression model is completed, and the steps are circulated until the recognition rate of the logistic regression model reaches a target threshold.
As an alternative example, the training module includes:
the second processing unit is used for processing each encryption inner product by using a semi-homomorphic encryption technology to obtain encryption residual data;
and the third processing unit is used for calculating the sum of all the encrypted inner products, the shared encrypted positive sample ID and the shared encrypted negative sample ID by using a semi-homomorphic encryption technology to obtain encrypted residual data.
Optionally, in this embodiment, the coordinator collects all the encrypted inner products, uses a semi-homomorphic encryption technique to combine the encrypted inner products, and calculates with the encrypted positive sample ID and the shared encrypted negative sample ID to obtain the encrypted residual data.
As an alternative example, the above apparatus further includes:
the receiving module is used for receiving encryption score data sent by each provider after the logistic regression model training is completed, wherein the encryption score data comprises encryption scores and encryption IDs, the encryption scores are obtained by each provider by means of predicting the respective full IDs by using the logistic regression model and encrypting the full IDs by using a public key, and the encryption IDs are obtained by each provider by means of encrypting the respective full IDs by using the public key;
and the fourth processing module is used for processing all the encrypted fraction data by using a semi-homomorphic encryption technology to obtain the fusion fraction corresponding to each provider.
Optionally, in this embodiment, each data provider grasps a part of parameters of the whole logistic regression model, predicts the respective full-size ID data by using the trained logistic regression model, obtains the scores of different device IDs, encrypts the full-size ID and the score by using the public key, obtains the encrypted scores of the encrypted IDs, and uploads the encrypted scores to the coordinator. And the coordinator aggregates the data of each data provider, sums each score by using semi-homomorphic encryption, so as to obtain a final encryption score, and decrypts the final encryption score by using a private key, so that a fusion score corresponding to each ID can be obtained.
As an alternative example, the fourth processing module includes:
the fourth processing unit is used for carrying out semi-homomorphic encryption summation on each encryption score in the encryption score data to obtain a final encryption score corresponding to each encryption score;
and the decryption unit is used for decrypting each final encryption score by using the private key to obtain a fusion score corresponding to each final encryption score.
Optionally, in this embodiment, the coordinator aggregates the data of each data provider, sums each score with semi-homomorphic encryption, and then decrypts the score with the private key, so as to obtain the fusion score corresponding to each ID.
Optionally, under the condition that the service side lacks a negative sample, the application can not only reduce the cost of data transmission by adopting a front N-bit encryption matching mode, but also accurately match positive samples, select the required negative sample from the data of unmatched positive samples, thereby realizing the rapid alignment of the positive and negative samples of different data providers; and the whole data is in an encryption state, so that the risk of data leakage is avoided.
For other examples of this embodiment, please refer to the above examples, and are not described herein.
Fig. 3 is a schematic diagram of an alternative electronic device, according to an embodiment of the application, as shown in fig. 3, including a processor 302, a communication interface 304, a memory 306, and a communication bus 308, wherein the processor 302, the communication interface 304, and the memory 306 communicate with each other via the communication bus 308, wherein,
a memory 306 for storing a computer program;
the processor 302 is configured to execute the computer program stored in the memory 306, and implement the following steps:
the public key and the partial encryption positive sample ID are sent to each provider, so that each provider can be matched according to the partial encryption positive sample ID and the respective full ID of each provider, an encryption candidate set is obtained, and the encryption candidate set is sent to a coordinator;
Determining a common encryption candidate set according to all the encryption candidate sets;
determining a common encrypted positive sample ID according to the encrypted positive sample ID and the common encryption candidate set;
and obtaining a common encrypted negative sample ID according to the common encryption candidate set and the common encrypted positive sample ID, and sending the common encrypted negative sample ID to each provider.
Alternatively, in the present embodiment, the above-described communication bus may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 3, but not only one bus or one type of bus. The communication interface is used for communication between the electronic device and other devices.
The memory may include RAM or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
As an example, the memory 306 may include, but is not limited to, the sending module 202, the first processing module 204, the second processing module 206, and the third processing module 208 in the modeling apparatus including the longitudinal federal learning. In addition, other module units in the modeling apparatus for longitudinal federal learning may be included, but are not limited to, and are not described in detail in this example.
The processor may be a general purpose processor and may include, but is not limited to: CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field-Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the structure shown in fig. 3 is only illustrative, and the device implementing the modeling method of vertical federal learning may be a terminal device, and the terminal device may be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palm computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 3 does not limit the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 3, or have a different configuration than shown in FIG. 3.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, etc.
According to yet another aspect of embodiments of the present invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program, when executed by a processor, performs the steps in the modeling method of longitudinal federal learning described above.
Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (10)

1. A modeling method for longitudinal federal learning, applied to a coordinator, comprising:
sending the public key and the partial encryption positive sample ID to each provider so that each provider can be matched with the respective full quantity ID of each provider according to the partial encryption positive sample ID, so as to obtain an encryption candidate set and send the encryption candidate set to the coordinator;
Determining a common encryption candidate set according to all the encryption candidate sets;
determining a common encrypted positive sample ID according to the encrypted positive sample ID and the common encryption candidate set;
and obtaining a common encryption negative sample ID according to the common encryption candidate set and the common encryption positive sample ID, and sending the common encryption negative sample ID to each provider.
2. The method of claim 1, wherein prior to sending the public key and the partially encrypted positive sample ID to each provider, the method further comprises:
acquiring a positive sample ID of a service party;
generating the public key and the private key;
encrypting the positive sample ID by using the public key to obtain the encrypted positive sample ID;
the first N bits of the encrypted positive sample ID are determined to be the partially encrypted positive sample ID.
3. The method of claim 1, wherein deriving the common encrypted negative sample ID from the common encrypted candidate set and the common encrypted positive sample ID comprises:
and removing the common encryption positive sample ID from the common encryption candidate set to obtain the common encryption negative sample ID.
4. The method of claim 1, wherein after sending the common encrypted negative sample ID to the each provider, the method further comprises:
Initializing a logistic regression model;
training the logistic regression model, including: the following steps are executed until the recognition rate of the logistic regression model reaches a target threshold value:
receiving an encryption inner product sent by each provider, wherein the encryption inner product is obtained by calculating by each provider according to local data and parameters of the logistic regression model and encrypting by using the public key;
processing each encryption inner product by using a semi-homomorphic encryption technology to obtain encryption residual data;
sending the encrypted residual data to each provider, so that each provider calculates a first encryption gradient according to the encrypted residual data, and adding the first encryption gradient and the random number to obtain a second encryption gradient;
after receiving the second encryption gradients sent by each provider, decrypting each second encryption gradient through a private key to obtain a first gradient corresponding to each second encryption gradient;
and sending each first gradient to a provider corresponding to the first gradient, so that each provider subtracts the corresponding first gradient by the respective random number to obtain a second gradient, and updating the parameters according to the second gradient.
5. The method of claim 4, wherein said processing each of said encrypted inner products using semi-homomorphic encryption techniques to obtain encrypted residual data comprises:
and calculating the sum of all the encrypted inner products, the common encrypted positive sample ID and the common encrypted negative sample ID by using the semi-homomorphic encryption technology to obtain the encrypted residual data.
6. The method of claim 4, wherein after the logistic regression model training is complete, the method further comprises:
receiving encryption score data sent by each provider, wherein the encryption score data comprises an encryption score and an encryption ID, the encryption score is obtained by each provider by using the logistic regression model to predict the respective full-size ID and using the public key to encrypt, and the encryption ID is obtained by each provider by using the public key to encrypt the respective full-size ID;
and processing all the encrypted fraction data by using the semi-homomorphic encryption technology to obtain the fusion fraction corresponding to each provider.
7. The method of claim 6, wherein said processing all of said encrypted score data using said semi-homomorphic encryption technique to obtain a corresponding fused score for each provider comprises:
Carrying out semi-homomorphic encryption summation on each encryption score in the encryption score data to obtain a final encryption score corresponding to each encryption score;
and decrypting each final encryption score by using the private key to obtain a fusion score corresponding to each final encryption score.
8. A modeling apparatus for vertical federal learning, for use with a coordinator, comprising:
the sending module is used for sending the public key and the partial encryption positive sample ID to each provider so that each provider can be matched with the corresponding full ID of each provider according to the partial encryption positive sample ID, so as to obtain an encryption candidate set and send the encryption candidate set to the coordinator;
a first processing module, configured to determine a common encryption candidate set according to all the encryption candidate sets;
the second processing module is used for determining a common encrypted positive sample ID according to the encrypted positive sample ID and the common encryption candidate set;
and the third processing module is used for obtaining a common encryption negative sample ID according to the common encryption candidate set and the common encryption positive sample ID, and sending the common encryption negative sample ID to each provider.
9. A computer-readable storage medium, having stored thereon a computer program, characterized in that the computer program, when executed by a processor, performs the method of any of claims 1 to 7.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.
CN202310310963.0A 2023-03-22 2023-03-22 Modeling method and device for longitudinal federal learning, storage medium and electronic equipment Active CN117034000B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310310963.0A CN117034000B (en) 2023-03-22 2023-03-22 Modeling method and device for longitudinal federal learning, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310310963.0A CN117034000B (en) 2023-03-22 2023-03-22 Modeling method and device for longitudinal federal learning, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN117034000A true CN117034000A (en) 2023-11-10
CN117034000B CN117034000B (en) 2024-06-25

Family

ID=88621340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310310963.0A Active CN117034000B (en) 2023-03-22 2023-03-22 Modeling method and device for longitudinal federal learning, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117034000B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299728A (en) * 2018-08-10 2019-02-01 深圳前海微众银行股份有限公司 Federal learning method, system and readable storage medium storing program for executing
CN111523134A (en) * 2020-07-03 2020-08-11 支付宝(杭州)信息技术有限公司 Homomorphic encryption-based model training method, device and system
WO2021022717A1 (en) * 2019-08-02 2021-02-11 深圳前海微众银行股份有限公司 Method and apparatus for analyzing feature correlation in federated learning, and readable storage medium
CN113051586A (en) * 2021-03-10 2021-06-29 北京沃东天骏信息技术有限公司 Federal modeling system and method, and federal model prediction method, medium, and device
CN113239391A (en) * 2021-07-13 2021-08-10 深圳市洞见智慧科技有限公司 Third-party-free logistic regression federal learning model training system and method
CN114021017A (en) * 2021-11-05 2022-02-08 光大科技有限公司 Information pushing method and device and storage medium
CN114239863A (en) * 2022-02-24 2022-03-25 腾讯科技(深圳)有限公司 Training method of machine learning model, prediction method and device thereof, and electronic equipment
WO2022121026A1 (en) * 2020-12-10 2022-06-16 广州广电运通金融电子股份有限公司 Collaborative learning method that updates central party, storage medium, terminal and system
CN115630713A (en) * 2022-08-31 2023-01-20 暨南大学 Longitudinal federated learning method, device and medium under condition of different sample identifiers

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299728A (en) * 2018-08-10 2019-02-01 深圳前海微众银行股份有限公司 Federal learning method, system and readable storage medium storing program for executing
WO2021022717A1 (en) * 2019-08-02 2021-02-11 深圳前海微众银行股份有限公司 Method and apparatus for analyzing feature correlation in federated learning, and readable storage medium
CN111523134A (en) * 2020-07-03 2020-08-11 支付宝(杭州)信息技术有限公司 Homomorphic encryption-based model training method, device and system
WO2022121026A1 (en) * 2020-12-10 2022-06-16 广州广电运通金融电子股份有限公司 Collaborative learning method that updates central party, storage medium, terminal and system
CN113051586A (en) * 2021-03-10 2021-06-29 北京沃东天骏信息技术有限公司 Federal modeling system and method, and federal model prediction method, medium, and device
CN113239391A (en) * 2021-07-13 2021-08-10 深圳市洞见智慧科技有限公司 Third-party-free logistic regression federal learning model training system and method
CN114021017A (en) * 2021-11-05 2022-02-08 光大科技有限公司 Information pushing method and device and storage medium
CN114239863A (en) * 2022-02-24 2022-03-25 腾讯科技(深圳)有限公司 Training method of machine learning model, prediction method and device thereof, and electronic equipment
CN115630713A (en) * 2022-08-31 2023-01-20 暨南大学 Longitudinal federated learning method, device and medium under condition of different sample identifiers

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YANG, KUIHE 等: "Model Optimization Method Based on Vertical Federated Learning", 《2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS)》, pages 1 - 5 *
何雯;白翰茹;李超;: "基于联邦学习的企业数据共享探讨", 信息与电脑(理论版), no. 08, pages 177 - 180 *

Also Published As

Publication number Publication date
CN117034000B (en) 2024-06-25

Similar Documents

Publication Publication Date Title
US11038679B2 (en) Secure multi-party computation method and apparatus, and electronic device
CN113159327B (en) Model training method and device based on federal learning system and electronic equipment
CN109002861B (en) Federal modeling method, device and storage medium
US20200068394A1 (en) Authentication of phone caller identity
CN110610093B (en) Methods, systems, and media for distributed training in parameter data sets
CN110874646B (en) Exception handling method and device for federated learning and electronic equipment
CN110826420B (en) Training method and device of face recognition model
KR20150123823A (en) Privacy-preserving ridge regression using masks
CN111428887B (en) Model training control method, device and system based on multiple computing nodes
CN110912682B (en) Data processing method, device and system
CN111931241B (en) Linear regression feature significance testing method and device based on privacy protection
CN108306891B (en) Method, apparatus and system for performing machine learning using data to be exchanged
CN112818369A (en) Combined modeling method and device
CN112929349A (en) Method and device for sharing private data based on block chain and electronic equipment
CN110213202A (en) Mark encryption matching process and device, identification processing method and device
CN112668016B (en) Model training method and device and electronic equipment
CN117034000B (en) Modeling method and device for longitudinal federal learning, storage medium and electronic equipment
CN114726524B (en) Target data sorting method and device, electronic equipment and storage medium
CN113254989B (en) Fusion method and device of target data and server
CN115277031B (en) Data processing method and device
CN112019642B (en) Audio uploading method, device, equipment and storage medium
CN112506881B (en) Method and device for processing bid evaluation expert information based on block chain
CN114117428A (en) Method and device for generating detection model
CN112434064A (en) Data processing method, device, medium and electronic equipment
Narayana et al. Medical image cryptanalysis using adaptive, lightweight neural network based algorithm for IoT based secured cloud storage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant