CN114880383A - Data alignment method, system and related equipment in multi-party federal learning - Google Patents

Data alignment method, system and related equipment in multi-party federal learning Download PDF

Info

Publication number
CN114880383A
CN114880383A CN202210801464.7A CN202210801464A CN114880383A CN 114880383 A CN114880383 A CN 114880383A CN 202210801464 A CN202210801464 A CN 202210801464A CN 114880383 A CN114880383 A CN 114880383A
Authority
CN
China
Prior art keywords
data
self
model
vector
ids
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210801464.7A
Other languages
Chinese (zh)
Inventor
黄一珉
王湾湾
何浩
姚明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dongjian Intelligent Technology Co ltd
Original Assignee
Shenzhen Dongjian Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dongjian Intelligent Technology Co ltd filed Critical Shenzhen Dongjian Intelligent Technology Co ltd
Priority to CN202210801464.7A priority Critical patent/CN114880383A/en
Publication of CN114880383A publication Critical patent/CN114880383A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the application discloses a data alignment method, a system and related equipment in multi-party federal learning, wherein the system comprises a model initiator and k data providers, and the method comprises the following steps: determining the ID intersection between the model initiator and each data provider in the k data providers through the model initiator to obtain k ID intersections; respectively sending the local serial numbers of the model initiators corresponding to the IDs in the k ID intersections to each data provider; generating self-increasing sequence IDs through each data provider to obtain k first self-increasing sequence IDs; performing replacement operation on the corresponding first self-increasing sequence ID according to the local sequence number of each model initiator to obtain k second self-increasing sequence IDs; generating a marker vector according to the k second self-increasing sequence IDs; determining a Boolean vector according to the mark vector; and determining the reserved sample and the corresponding characteristic thereof corresponding to each data party by each data provider according to the Boolean vector. By adopting the embodiment of the application, the data value can be improved.

Description

Data alignment method, system and related equipment in multi-party federal learning
Technical Field
The application relates to the technical field of privacy computation and the technical field of computers, in particular to a data alignment method, a system and related equipment in multi-party federal learning.
Background
With the development of artificial intelligence, the value of data is more and more emphasized. And the data in different fields often have great complementarity, and different organizations have great data fusion requirements. However, it is difficult to aggregate data directly between organizations based on factors such as privacy protection, self-interest, and policy administration. This data islanding problem presents a significant challenge to human intelligence researchers.
In recent years, the academia and industry have begun to use federal learning solutions to address such problems. Federal learning aims to improve the actual effect of the AI model on the premise of ensuring data privacy safety and legal compliance by using knowledge such as cryptography, machine learning and the like. In the longitudinal federal learning, as the number of data providers participating in modeling increases, the samples of intersection are usually less and less in the process of continuous data intersection because of sample user differences among the data providers. In addition, due to the rarity of the tag data of the model initiator, the reduction of the sample dimension is brought while the feature dimension is expanded, so that the great data value waste is brought, and the problem of how to improve the data value needs to be solved urgently.
Disclosure of Invention
The embodiment of the application provides a data alignment method, a system and related equipment in multi-party federal learning, and the data value can be improved.
In a first aspect, an embodiment of the present application provides a data alignment method in multi-party federal learning, which is applied to a multi-party computing system, where the multi-party computing system includes a model initiator and k data providers, the model initiator is a party with a tag, and k is a positive integer, and the method includes:
determining, by the model initiator, an ID intersection between the model initiator and each of the k data providers to obtain k ID intersections;
respectively sending the local serial numbers of the model initiators corresponding to the IDs in the k ID intersections to each data provider through the model initiators;
generating self-increasing sequence IDs through the data providers to obtain k first self-increasing sequence IDs;
replacing the corresponding first self-increasing sequence ID by each data provider according to the local serial number of each model initiator to obtain k second self-increasing sequence IDs;
generating, by the respective data provider, a token vector from the k second self-increasing IDs;
determining, by the model initiator, a Boolean vector from the token vector;
and determining the reserved sample corresponding to each data party and the corresponding characteristics thereof by each data provider according to the Boolean vector.
In a second aspect, embodiments of the present application provide a multi-party computing system, which includes a model initiator and k data providers, where the model initiator is a tagged party and k is a positive integer, and where,
the model initiator is used for determining the ID intersection between the model initiator and each data provider in the k data providers to obtain k ID intersections; respectively sending the local serial numbers of the model initiators corresponding to the IDs in the k ID intersections to each data provider;
each data provider is used for generating self-increasing sequence IDs to obtain k first self-increasing sequence IDs; replacing the corresponding first self-increasing sequence IDs according to the local sequence numbers of the respective model initiators to obtain k second self-increasing sequence IDs; and generating a marker vector from the k second self-increasing IDs;
the model initiator is used for determining a Boolean vector according to the mark vector;
and each data provider is used for determining the reserved sample corresponding to each data provider and the corresponding characteristics of the reserved sample according to the Boolean vector.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for executing the steps in the first aspect of the embodiment of the present application.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program makes a computer perform some or all of the steps described in the first aspect of the embodiment of the present application.
In a fifth aspect, embodiments of the present application provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps as described in the first aspect of the embodiments of the present application. The computer program product may be a software installation package.
The embodiment of the application has the following beneficial effects:
it can be seen that the data alignment method, system and related device in multi-party federal learning described in the embodiments of the present application are applied to a multi-party computing system, where the multi-party computing system includes a model initiator and k data providers, the model initiator is a party with a label, k is a positive integer, and an ID intersection between the model initiator and each data provider in the k data providers is determined by the model initiator to obtain k ID intersections; respectively sending the local serial numbers of the model initiators corresponding to the IDs in the k ID intersections to each data provider through the model initiators; generating self-increasing sequence IDs through each data provider to obtain k first self-increasing sequence IDs; replacing the corresponding first self-increasing sequence ID by each data provider according to the local serial number of each model initiator to obtain k second self-increasing sequence IDs; generating a tag vector according to the k second self-increasing IDs by each data provider; determining a Boolean vector according to the marked vector by the model initiator; and determining the reserved sample and the corresponding characteristics of each data provider by each data provider according to the Boolean vector, and abstracting a Boolean vector according to the alignment result of the model initiator and each data provider. Based on the Boolean vector, the final model entering sample and the corresponding characteristics thereof can be flexibly calculated, when the method is applied to the federal learning, the problem that the number of samples is obviously reduced after data alignment when the method is applied to multi-party combined modeling can be well solved, the model effect of the federal learning in practical application can be greatly improved, and the data value is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a structural diagram of federated learning of a two-party computing system provided by an embodiment of the present application;
FIG. 2 is an architectural diagram of a multi-party computing system for implementing a data alignment method in multi-party federated learning provided by an embodiment of the present application;
FIG. 3 is a flow chart illustrating a data alignment method in multi-party federated learning according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a data alignment method in multi-party federated learning according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The computing node described in this embodiment of the application may be an electronic device, and the electronic device may include a smart Phone (e.g., an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a vehicle data recorder, a server, a notebook computer, a Mobile Internet device (MID, Mobile Internet Devices), or a wearable device (e.g., a smart watch, a bluetooth headset), which are merely examples, but are not exhaustive, and include but are not limited to the foregoing electronic device, and the electronic device may also be a cloud server, or the electronic device may also be a computer cluster. In the embodiment of the application, both the result side and the sender side can be the electronic device.
In the embodiment of the present application, the sample may be understood as: for each row of data in the raw data, the samples contain: ID. Features and labels. Identification (ID) is a unique identifier for each row of data, which is different for each row, such as: identity card number, mobile phone number or self-increment serial number; the label is the object of modeling, and the sample is to predict a certain result, such as a default (label) of a borrower (ID is identification number) and no default (label) of another borrower. Wherein, the sample comprises ID and label, the label is a description of the ID and is also the object of one-time modeling.
The data alignment in the embodiment of the application refers to "sample" alignment, that is, the IDs are aligned, sample intersections of the participants are obtained after the IDs are aligned, and corresponding features and labels are taken out according to the intersections for modeling.
The following describes embodiments of the present application in detail.
In the related art, in the vertical federal learning, the sample alignment is usually performed between organizations, and the academic name is "private set intersection" (PSI). PSI is a secure multi-party computational encryption technique that allows participants holding a collection to compare encrypted versions of respective sample IDs to compute intersections. In this case, each participant can only get the final correct intersection, but not any information in the other set outside the intersection. After sample alignment, each participant can perform subsequent federal feature engineering and federal learning model training based on the aligned ID, and finally obtain a federal learning model result, as shown in fig. 1, fig. 1 is a flow of performing federal learning modeling by two parties, a model initiator can extract sample features and labels, initiate a series of processes such as model training, sample alignment, feature engineering, model training and model results, and a data provider needs to perform a series of processes such as sample features, sample alignment, feature engineering, model training and model results.
However, when the number of data providers increases, the number of samples that intersect after the multiple samples are aligned may be small, so that many data outside the intersection cannot be used, and the effect of the data source increase model decreases instead. This is because the information outside the intersection is usually very important, especially the label data of the model initiator is usually very precious (rare or requiring a large amount of manual labeling), and on the other hand, when the number of intersections is small, the trained model effect and generalization capability are not ideal. In the face of this problem, modelers usually relax the intersection sample condition to model, such as model with a sample set of (FA) U (FB) U (FC), where FA, FB, FC are 3 different sample sets, U designates a union, and existing solutions of federal learning cannot support this way to flexibly configure modeling data sets.
Based on the above-mentioned drawbacks of the related art, please refer to fig. 2, and fig. 2 is a schematic structural diagram of a multi-party computing system for implementing a data alignment method in multi-party federal learning according to an embodiment of the present application, as shown in the drawing, the multi-party computing system includes a model initiator and at least one data provider (e.g., k data providers), where the model initiator is a party with a label, and k is a positive integer, and based on the multi-party computing system, the following functions may be implemented:
the model initiator is used for determining the ID intersection between the model initiator and each data provider in the k data providers to obtain k ID intersections; respectively sending the local serial numbers of the model initiators corresponding to the IDs in the k ID intersections to each data provider;
each data provider is used for generating self-increasing sequence IDs to obtain k first self-increasing sequence IDs; performing replacement operation on the corresponding first self-increasing sequence ID according to the local sequence number of each model initiator to obtain k second self-increasing sequence IDs; and generating a tag vector from the k second self-increasing IDs;
the model initiator is used for determining a Boolean vector according to the mark vector;
and each data provider is used for determining the reserved sample corresponding to each data provider and the corresponding characteristics of the reserved sample according to the Boolean vector.
Optionally, the system is further specifically configured to:
configuring the number of participants on the minimum alignment required, wherein the number of the participants is less than or equal to k;
in the aspect of determining, by the model initiator, a boolean vector from the token vector, the method includes:
summing the marked vectors by the model initiator to obtain a sum vector;
and determining the Boolean vector according to the sum vector and the number of the participants.
Optionally, the obtaining k second self-increasing sequence IDs by performing, by the data providers, a replacement operation on the corresponding first self-increasing sequence ID according to the local sequence number of the respective model initiator includes:
and replacing the corresponding position sequence in the corresponding first self-increasing sequence ID by elements in the corresponding ID intersection through each data provider according to the local sequence number of each model initiator, and keeping other positions unchanged to obtain the k second self-increasing sequence IDs.
Optionally, the system is further specifically configured to:
determining the data quality evaluation value of each data provider to obtain k data quality evaluation values;
and when the target data quality evaluation value is smaller than a preset threshold value, setting a flag vector of at least one data provider except the target data provider corresponding to the target data quality evaluation value to be 1, and limiting the Boolean vector, wherein the target data quality evaluation value is any one of the k data quality evaluation values.
Optionally, the system is further specifically configured to:
filling, by the respective data providers, features of non-real IDs in the k second self-increasing IDs to which the boolean vector really corresponds as null values.
Optionally, the system is further specifically configured to:
synchronously sending the local sample number of the model initiator to each data provider;
the obtaining k first self-increasing sequence IDs by generating the self-increasing sequence IDs by the respective data providers includes:
and generating self-increasing sequence IDs by the data providers according to the sample number to obtain k first self-increasing sequence IDs.
Aiming at federal learning, the embodiment of the application provides a flexible data alignment mode in the process of federal learning modeling, a sample set which is finally used for modeling by an initiator can be flexibly calculated by configuring the number of participants needing least alignment, and null value filling is carried out on the sample characteristics of the participants which are not in the sample set, so that a complete multi-party federal learning model can be established, the characteristics of prediction samples in each party are obtained through the same filling logic in the prediction stage, and prediction can be completed by inputting the model.
According to the embodiment of the application, the problem that the number of samples is obviously reduced after multi-party data are aligned can be well solved, and the using effect of federal learning in practical application is greatly improved. Meanwhile, a final modeling sample set can be flexibly configured, and the usability of federal learning is remarkably improved.
Referring to fig. 3, fig. 3 is a schematic flowchart of a data alignment method in multi-party federal learning according to an embodiment of the present application, and is applied to the multi-party computing system shown in fig. 2, where the multi-party computing system includes a model initiator and k data providers, the model initiator is a party with a label, and k is a positive integer, and as shown in the figure, the data alignment method in multi-party federal learning includes:
301. and determining the ID intersection between the model initiator and each data provider in the k data providers by the model initiator to obtain k ID intersections.
In a specific implementation, k is a positive integer, for example, k may be 1, or k may be 2, or k may be 3, and so on. Specifically, the ID intersection between the model initiator and each of the k data providers may be determined according to the PSI algorithm, so as to obtain k ID intersections.
Specifically, for example, assuming that F is the model initiator and A, B, C is the data provider, the ID intersections of F and A, F and B, F and C can be calculated by PSI technique, and can be recorded as the ID intersections of F and A, F and B, F and C, respectively
Figure 491725DEST_PATH_IMAGE001
Figure 750668DEST_PATH_IMAGE002
Figure 504997DEST_PATH_IMAGE003
For example, in a two-party scenario, a bank has financial data for a portion of people, and a label for whether a breach is made; the e-commerce platform has the online shopping data of the part of people, so that the bank can establish a machine learning model by combining the data of the bank and the data of the e-commerce platform to evaluate the credit of the user, and in the process, the two parties need to align the user and then use the data of the two parties for modeling.
By way of example, in a three-party scenario, a bank has financial data of a part of people and a label of whether the financial data violates a contract; the e-commerce platform has online shopping data of the part of people; if the mobile phone manufacturer has user data of the part of people, for example, the activity degree of APP, a bank can establish a machine learning model by combining the data of the bank, the e-commerce platform and the data of the mobile phone manufacturer to evaluate the credit of the user, and in the process, three parties are required to align the user and then use the respective data to perform modeling.
302. And respectively sending the local serial numbers of the model initiators corresponding to the IDs in the k ID intersections to each data provider through the model initiators.
In a specific implementation, the model initiator may send the local serial numbers of the model initiators corresponding to the IDs in the k ID intersections to each data provider, respectively.
Specifically, model initiator F may intersect
Figure 318974DEST_PATH_IMAGE001
Figure 842359DEST_PATH_IMAGE004
Figure 459154DEST_PATH_IMAGE003
Local serial number of model initiator corresponding to medium ID
Figure 497518DEST_PATH_IMAGE005
Figure 38220DEST_PATH_IMAGE006
Figure 681691DEST_PATH_IMAGE007
Respectively to A, B, C.
303. And generating self-increasing sequence IDs by the data providers to obtain k first self-increasing sequence IDs.
In a specific implementation, each data provider may obtain the data number, generate the self-increasing order IDs based on the data number, and obtain k first self-increasing order IDs, assuming that the data number may be written as fn.
For example, three parties A, B, C may generate self-increasing IDs from 1 to fn, (1, 2,3, …, fn) as
Figure 79175DEST_PATH_IMAGE008
Optionally, the method may further include the following steps:
synchronously sending the local sample number of the model initiator to each data provider;
the step 303, generating self-increment IDs by the data providers to obtain k first self-increment IDs, may be implemented as follows:
and generating self-increasing sequence IDs by the data providers according to the sample number to obtain k first self-increasing sequence IDs.
In a specific implementation, the model initiator may synchronize the local sample number fn to each data provider, and of course, each data provider may already know the sample number of the model initiator in most PSI algorithm execution processes. Further, k first self-increasing sequence IDs may be obtained by each data provider generating a self-increasing sequence ID according to the number of samples.
304. And replacing the corresponding first self-increasing sequence ID by each data provider according to the local sequence number of each model initiator to obtain k second self-increasing sequence IDs.
In specific implementation, the local serial number of each model initiator may replace the element at the position corresponding to the corresponding first self-increasing order ID with the element in the ID intersection according to the ID intersection corresponding to each data provider, so as to obtain k second self-increasing order IDs.
Optionally, in step 304, the data providers perform a replacement operation on the corresponding first self-increasing sequential IDs according to the local serial numbers of the respective model initiators to obtain k second self-increasing sequential IDs, which may be implemented as follows:
and replacing the corresponding position sequence in the corresponding first self-increasing sequence ID by elements in the corresponding ID intersection through each data provider according to the local sequence number of each model initiator, and keeping other positions unchanged to obtain the k second self-increasing sequence IDs.
In a specific implementation, each data provider can replace the corresponding position sequence in the corresponding first self-increasing sequence ID with an element in the corresponding ID intersection according to the local sequence number of each model initiator, and other positions can be kept unchanged to obtain k second self-increasing sequence IDs. In the federal learning algorithm, after the real sample ID is replaced, the corresponding features and labels can be indexed according to the real sample ID, and then subsequent federal learning modeling is carried out, so that the federal learning efficiency can be improved, and if the number is only a figure, the corresponding features are difficult to find, and the federal learning efficiency is delayed.
For example, the A, B, C parties may each receive a sequence number
Figure 808096DEST_PATH_IMAGE005
Figure 949228DEST_PATH_IMAGE006
Figure 447205DEST_PATH_IMAGE009
Will be provided with
Figure 953273DEST_PATH_IMAGE008
In the corresponding position sequence is replaced by
Figure 231807DEST_PATH_IMAGE010
Figure 114313DEST_PATH_IMAGE004
Figure 466797DEST_PATH_IMAGE003
If the other index remains unchanged, the third party generates a new sample ID, which is recorded as
Figure 474591DEST_PATH_IMAGE011
Figure 912526DEST_PATH_IMAGE012
Figure 395460DEST_PATH_IMAGE013
They are characterized in that the aggregate size is fn, the ID value aligned with the ID of the initiator is the local real ID of each initiator, and the ID not aligned is the self-increasing ID
Figure 992663DEST_PATH_IMAGE008
A value of (1).
305. Generating, by the respective data provider, a token vector from the k second self-increasing IDs.
In a specific implementation, the model initiator may generate a column of tagged vectors according to the generated new ID features, respectively. The ID position value on alignment is 1, and the position value on misalignment is 0.
In a specific implementation, the sample position value on alignment is 1, and the position value on non-alignment is 0. That is, the position where the replacement is made is 1, and the position where the replacement is not made is 0, for example, three parties A, B, C respectively generate a list of label vectors according to the generated new ID features
Figure 840534DEST_PATH_IMAGE014
. In the mark vector, the value of the aligned ID position is 1 (of the aligned ID position) ((
Figure 828081DEST_PATH_IMAGE015
Figure 52389DEST_PATH_IMAGE004
Figure 379465DEST_PATH_IMAGE003
) The value is a local reserved sample, the value of the position on the misalignment is 0, (
Figure 726133DEST_PATH_IMAGE016
Figure 138660DEST_PATH_IMAGE002
Figure 901079DEST_PATH_IMAGE003
Has a value of
Figure 144979DEST_PATH_IMAGE008
ID in) and sends the token vector to the model initiator.
306. Determining, by the model initiator, a Boolean vector from the token vector.
In a specific implementation, the model initiator may mark a vector summation vector, and determine a boolean vector by the sum vector and the number of participants. Boolean vectors may also be referred to as cool vectors.
Optionally, the method may further include the following steps:
configuring the number of participants on the minimum alignment required, wherein the number of the participants is less than or equal to k;
the step 306, determining, by the model initiator, a boolean vector according to the token vector, may include the following steps:
61. summing the marked vectors by the model initiator to obtain a sum vector;
62. and determining the Boolean vector according to the sum vector and the number of the participants.
In this embodiment, the number of participants that need to be aligned at least may be configured to be a, for example, a may range from 0 to the number of data providers, and for example, a may be 3. Assuming A, B, C three data providers as an example, the model initiator may apply the received tag vectors
Figure 600231DEST_PATH_IMAGE014
Summing, and comparing a to obtain the pool vectoralignThe value of a is True when the value is more than or equal to a, otherwise, the value is False. The initiator synchronizes the pool vector to the other participants A, B, C.
307. And determining the reserved sample corresponding to each data party and the corresponding characteristics thereof by each data provider according to the Boolean vector.
In specific implementation, each data provider can determine the reserved sample corresponding to each data provider and the corresponding characteristics of the reserved sample according to the Boolean vector, so that the characteristics and the label can be flexibly selected for model training.
According to the method and the device, the problem that the number of samples is obviously reduced after data alignment during multi-party combined modeling can be well solved, the using effect of federal learning in practical application is greatly improved, and further, when the method and the device are applied to federal learning, a final modeling sample set can be flexibly configured, and the usability of federal learning is obviously improved.
By way of example, each participant (F, A, B, C) is based on the pool vectoralignAnd locally selecting the real ID corresponding to the cool vector and the characteristics corresponding to the real ID. In addition, the model initiator F needs to select tag data corresponding to the real ID.
Optionally, the method may further include the following steps:
a1, determining the data quality evaluation value of each data provider to obtain k data quality evaluation values;
and A2, when the target data quality evaluation value is smaller than a preset threshold value, setting the flag vector of at least one data provider except the target data provider corresponding to the target data quality evaluation value to be 1, and limiting the Boolean vector, wherein the target data quality evaluation value is any one of the k data quality evaluation values.
Wherein, the preset threshold value can be preset or default by the system.
In the embodiment of the application, the data quality evaluation may be of many dimensions, and may be determined empirically in a general case, for example, the feature contribution degree of data of a data provider in a past modeling process, a model effect, the public credibility of the data provider, and the like, and of course, some quantitative indexes may also be used, the feature quantity of the data exceeds a certain dimension (e.g., 200), the number of users, and the like, exceeds a certain quantity (e.g., 1 hundred million), and the missing value proportion of the data is smaller than a certain threshold (e.g., 50%).
In a specific implementation, the data quality evaluation value of each data provider may be determined, and k data quality evaluation values are obtained, and when the target data quality evaluation value is smaller than a preset threshold, the flag vector of at least one data provider other than the target data provider corresponding to the target data quality evaluation value may be set to 1, and a boolean vector may be restricted, where the target data quality evaluation value is any one of the k data quality evaluation values.
In a specific implementation, if it is considered that the data quality of some data providers may not be good, a model built only by the features of the data source cannot meet the requirement, and the above-mentioned pool vector (may be the vector: (b) (b))align) Further limiting, if the quality of the C side is considered not good, when a is 1, then divide the abovealignOut of True, re-limit
Figure 500054DEST_PATH_IMAGE017
And
Figure 128481DEST_PATH_IMAGE018
at least one of which is True, i.e.
Figure 164571DEST_PATH_IMAGE019
At least one high-quality data provider tag vector has a value of True, that is, logical or operation is performed between high-quality data provider tag vectors in order to ensure that the finally-left sample has characteristics in the high-quality data provider.
Optionally, the method may further include the following steps:
filling, by the respective data providers, features of non-real IDs in the k second self-increasing IDs to which the boolean vector really corresponds as null values.
In a specific implementation, taking the data provider A, B, C as an example, the data provider (A, B, C) sets the boul vector to True but corresponding to True
Figure 855971DEST_PATH_IMAGE011
Figure 243089DEST_PATH_IMAGE012
Figure 347312DEST_PATH_IMAGE013
The feature for the non-real ID is filled with a null value.
In the embodiment of the application, in the longitudinal federal learning, as the number of data providers participating in modeling increases, the samples at the intersection are generally less and less in the process of constantly performing data intersection because of sample user differences among the data providers. In addition, due to the fact that the tag data of the model initiator is precious, the characteristic dimension is expanded, the sample dimension is reduced, and great data value waste is brought.
Aiming at the problem, the embodiment of the application provides a flexible data alignment mode, a sample set finally used for modeling by a model initiator can be flexibly calculated, the dimensionality is expanded, modeling samples are reserved as far as possible, the existing data value is utilized to the maximum extent, and the actual application effect of federal learning is improved.
The specific implementation process comprises the following steps:
for example, as shown in fig. 4, taking a model initiator F (a tagged party), and data providers a, B, and C as examples, where XF, XA, XB, and XC represent different features, Y represents a tag, and nan represents a null value, the implementation flow is as follows:
1. configuring the number of the participants needing to be aligned at least as a (the value range is from 0 to the number of the data providers, which is 3 in this example);
2. respectively calculating ID intersections of F, A, F, B, F and C by using the traditional PSI technology, and respectively recording the ID intersections as
Figure 565803DEST_PATH_IMAGE016
Figure 362858DEST_PATH_IMAGE004
Figure 971694DEST_PATH_IMAGE003
3. Initiator F will intersect
Figure 207503DEST_PATH_IMAGE015
Figure 218185DEST_PATH_IMAGE004
Figure 186141DEST_PATH_IMAGE003
Initiator local sequence number corresponding to medium ID
Figure 344590DEST_PATH_IMAGE005
Figure 790614DEST_PATH_IMAGE006
,
Figure 983698DEST_PATH_IMAGE007
Separately to A, B, C;
4. optionally, the initiator F also synchronizes the local sample number fn to A, B, C, optionally because the A, B, C third party already knows the sample number of F during most of the PSI algorithm execution;
5. a, B, C tripartite local generation of 1 to fn self-incrementing IDs, (1, 2,3, …, fn), noted
Figure 122556DEST_PATH_IMAGE008
6. A, B, C three parties respectively according to received serial numbers
Figure 705984DEST_PATH_IMAGE005
Figure 80333DEST_PATH_IMAGE006
Figure 65607DEST_PATH_IMAGE009
Will be provided with
Figure 437682DEST_PATH_IMAGE008
In the corresponding position sequence is replaced by
Figure 508406DEST_PATH_IMAGE015
Figure 561813DEST_PATH_IMAGE004
Figure 466840DEST_PATH_IMAGE003
If the rest of the index remains unchanged, the third party generates a new sample ID, which is recorded as
Figure 947500DEST_PATH_IMAGE011
Figure 239941DEST_PATH_IMAGE012
Figure 159355DEST_PATH_IMAGE013
They are characterized in that the aggregate size is fn, the ID value aligned with the ID of the initiator is the local real ID of each initiator, and the ID not aligned is the self-increasing ID
Figure 853642DEST_PATH_IMAGE008
A value of (1).
7. A, B, C three parties respectively generate a list of label vectors according to the characteristics of the generated new ID
Figure 567520DEST_PATH_IMAGE014
. ID position on alignment in the mark vector: (
Figure 612836DEST_PATH_IMAGE016
Figure 8045DEST_PATH_IMAGE002
Figure 619155DEST_PATH_IMAGE003
Value of local true ID) value of 1, position on misalignment: (
Figure 707197DEST_PATH_IMAGE001
Figure 36547DEST_PATH_IMAGE004
Figure 969868DEST_PATH_IMAGE003
Has a value of
Figure 701064DEST_PATH_IMAGE008
ID) value of 0 and sends a token vector to initiator F.
8. The initiator will receive the tagged vector
Figure 960007DEST_PATH_IMAGE020
Summing and comparing with a configured in step 1 to obtain a pool vectoralignThe characteristic is that the value of a is True, otherwise the value is False. The initiator synchronizes the pool vector to the other participants A, B, C.
9. Each participant (F, A, B, C) is based on the pool vectoralignAnd locally selecting the real ID corresponding to the cool vector and the characteristics corresponding to the real ID. In addition, the initiator F needs to select tag data corresponding to the real ID;
10. optionally, if it is considered that the data quality of some data providers may not be good, and the model built only by the characteristics of the data source cannot meet the requirement, the above-mentioned pool vector (c) may be usedalign) Further limiting, if the quality of the C side is considered not good, when a is 1, then divide the abovealignOut of True, re-limit
Figure 511074DEST_PATH_IMAGE017
In
Figure 513665DEST_PATH_IMAGE018
At least one is 1, i.e.
Figure 99367DEST_PATH_IMAGE019
11. The data provider (A, B, C) puts the pool vector True but corresponding
Figure 263632DEST_PATH_IMAGE011
Figure 505258DEST_PATH_IMAGE021
Figure 99488DEST_PATH_IMAGE022
Filling the characteristics of the non-real ID with null values;
12. based on the aligned feature data of each participant, subsequent federal feature engineering, modeling of a federal learning algorithm and the like can be carried out, and after modeling is completed, the feature data can be directly used for federal prediction.
In a specific implementation, in the configuration of step 1, when a is configured to be 0, it is actually local modeling of the model initiator F, and when a is configured to be 3, it is traditional 4-party federal learning modeling. The number of samples finally entering the module can be flexibly selected through the configuration a, so that the waste of data value is reduced.
To sum up, the embodiment of the present application designs a data alignment method, which only needs to configure the data number of the participants on the model initiator sample ID that need to be aligned at least, and abstracts a pool vector according to the alignment result of the model initiator and each data provider. Based on the pool vector, the final in-mold sample and the corresponding features thereof can be flexibly calculated. According to the method and the device, the problem that the number of samples is remarkably reduced after data alignment during multi-party combined modeling can be well solved, and the model effect of federal learning in practical application can be greatly improved.
For example, in a two-dimensional data set, if the data set is divided into columns, the general columns can be divided into 3 types: sample ID (ID in fig. 4, which is different for every person, may be identification number, mobile phone number, etc.), feature corresponding to each sample (XF, XA, XB in the diagram, which may be age, gender, income, etc.), and label corresponding to the sample (Y in the diagram, i.e., modeling target value, whether it may be default, whether it is interesting, whether it is purchasing, etc., this list is only available to the originator).
It can be seen that the data alignment method in multi-party federal learning described in the embodiment of the present application is applied to a multi-party computing system, the multi-party computing system includes a model initiator and k data providers, the model initiator is a party with tags, k is a positive integer, and a tag intersection between the model initiator and each data provider in the k data providers is determined by the model initiator to obtain k tag intersections; respectively sending the local serial numbers of the model initiators corresponding to the labels in the k label intersections to each data provider through the model initiators; generating self-increasing sequence IDs through each data provider to obtain k first self-increasing sequence IDs; replacing the corresponding first self-increasing sequence ID by each data provider according to the local serial number of each model initiator to obtain k second self-increasing sequence IDs; generating a tag vector according to the k second self-increasing IDs by each data provider; determining a Boolean vector according to the marked vector by the model initiator; and determining the reserved sample and the corresponding characteristics of each data provider by each data provider according to the Boolean vector, and abstracting a Boolean vector according to the alignment result of the model initiator and each data provider. Based on the Boolean vector, the final model entering sample and the corresponding characteristics thereof can be flexibly calculated, when the method is applied to the federal learning, the problem that the number of samples is obviously reduced after data alignment when the method is applied to multi-party combined modeling can be well solved, the model effect of the federal learning in practical application can be greatly improved, and the data value is further improved.
In keeping with the above embodiments, please refer to fig. 5, where fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, and as shown in the drawing, the electronic device includes a processor, a memory, a communication interface, and one or more programs, and is used in a multi-party computing system, the multi-party computing system includes a model initiator and k data providers, the model initiator is a tagged party, k is a positive integer, and the one or more programs are stored in the memory and configured to be executed by the processor, where in an embodiment of the present application, the programs include instructions for performing the following steps:
determining, by the model initiator, a label intersection between the model initiator and each of the k data providers to obtain k label intersections;
respectively sending the local serial numbers of the model initiators corresponding to the tags in the k tag intersections to each data provider through the model initiators;
generating self-increasing sequence IDs through the data providers to obtain k first self-increasing sequence IDs;
replacing the corresponding first self-increasing sequence ID by each data provider according to the local serial number of each model initiator to obtain k second self-increasing sequence IDs;
generating, by the respective data provider, a token vector from the k second self-increasing IDs;
determining, by the model initiator, a Boolean vector from the token vector;
and determining the reserved sample corresponding to each data party and the corresponding characteristics thereof by each data provider according to the Boolean vector.
Optionally, the program further includes instructions for performing the following steps:
configuring the number of participants on the minimum alignment required, wherein the number of the participants is less than or equal to k;
in said determining, by said model initiator, a boolean vector from said token vector, the above program comprising instructions for performing the steps of:
summing the marked vectors by the model initiator to obtain a sum vector;
and determining the Boolean vector according to the sum vector and the number of the participants.
Optionally, in the aspect that the k second self-increasing sequential IDs are obtained by performing, by the data providers, a replacement operation on the corresponding first self-increasing sequential IDs according to the local serial numbers of the respective model initiators, the program includes instructions for executing the following steps:
and replacing the corresponding position sequence in the corresponding first self-increasing sequence ID by elements in the corresponding label intersection through each data provider according to the local sequence number of each model initiator, and keeping other positions unchanged to obtain the k second self-increasing sequence IDs.
Optionally, the program further includes instructions for performing the following steps:
determining the data quality evaluation value of each data provider to obtain k data quality evaluation values;
and when the target data quality evaluation value is smaller than a preset threshold value, setting a flag vector of at least one data provider except the target data provider corresponding to the target data quality evaluation value to be 1, and limiting the Boolean vector, wherein the target data quality evaluation value is any one of the k data quality evaluation values.
Optionally, the program further includes instructions for performing the following steps:
filling, by the respective data providers, features of non-real IDs in the k second self-increasing IDs to which the boolean vector really corresponds as null values.
Optionally, the program further includes instructions for performing the following steps:
synchronously sending the local sample number of the model initiator to each data provider;
in the generating of the self-incrementing IDs by the respective data providers to obtain the k first self-incrementing IDs, the program includes instructions for:
and generating self-increasing sequence IDs by the data providers according to the sample number to obtain k first self-increasing sequence IDs.
Embodiments of the present application further provide a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, the computer program enables a computer to execute part or all of the steps of any one of the methods as described in the above method embodiments, and the computer includes an electronic device.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods as described in the above method embodiments. The computer program product may be a software installation package, the computer comprising an electronic device.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A data alignment method in multi-party federal learning is characterized in that the method is applied to a multi-party computing system, the multi-party computing system comprises a model initiator and k data providers, the model initiator is a party with labels, and k is a positive integer, the method comprises the following steps:
determining, by the model initiator, an ID intersection between the model initiator and each of the k data providers to obtain k ID intersections;
respectively sending the local serial numbers of the model initiators corresponding to the IDs in the k ID intersections to each data provider through the model initiators;
generating self-increasing sequence IDs through the data providers to obtain k first self-increasing sequence IDs;
replacing the corresponding first self-increasing sequence ID by each data provider according to the local serial number of each model initiator to obtain k second self-increasing sequence IDs;
generating, by the respective data provider, a token vector from the k second self-increasing IDs;
determining, by the model initiator, a Boolean vector from the token vector;
and determining the reserved sample corresponding to each data party and the corresponding characteristics thereof by each data provider according to the Boolean vector.
2. The method of claim 1, further comprising:
configuring the number of participants on the minimum alignment required, wherein the number of the participants is less than or equal to k;
determining, by the model initiator, a Boolean vector from the token vector, comprising:
summing the marked vectors by the model initiator to obtain a sum vector;
and determining the Boolean vector according to the sum vector and the number of the participants.
3. The method according to claim 1 or 2, wherein the obtaining k second self-increasing IDs by the respective data providers by replacing the corresponding first self-increasing IDs according to the local serial numbers of the respective model initiators comprises:
and replacing the corresponding position sequence in the corresponding first self-increasing sequence ID by elements in the corresponding ID intersection through each data provider according to the local sequence number of each model initiator, and keeping other positions unchanged to obtain the k second self-increasing sequence IDs.
4. The method according to claim 1 or 2, characterized in that the method further comprises:
determining the data quality evaluation value of each data provider to obtain k data quality evaluation values;
and when the target data quality evaluation value is smaller than a preset threshold value, setting a flag vector of at least one data provider except the target data provider corresponding to the target data quality evaluation value to be 1, and limiting the Boolean vector, wherein the target data quality evaluation value is any one of the k data quality evaluation values.
5. The method according to claim 1 or 2, characterized in that the method further comprises:
filling, by the data providers, a null value in a feature of a non-true ID of the k second self-increasing IDs to which the boolean vector is true.
6. The method according to claim 1 or 2, characterized in that the method further comprises:
synchronously sending the local sample number of the model initiator to each data provider;
the obtaining k first self-increasing sequence IDs by generating the self-increasing sequence IDs by the respective data providers includes:
and generating self-increasing sequence IDs by the data providers according to the sample number to obtain k first self-increasing sequence IDs.
7. A multi-party computing system, comprising a model initiator and k data providers, the model initiator being a tagged party, k being a positive integer, wherein,
the model initiator is used for determining the ID intersection between the model initiator and each data provider in the k data providers to obtain k ID intersections; respectively sending the local serial numbers of the model initiators corresponding to the IDs in the k ID intersections to each data provider;
each data provider is used for generating self-increasing sequence IDs to obtain k first self-increasing sequence IDs; performing replacement operation on the corresponding first self-increasing sequence ID according to the local sequence number of each model initiator to obtain k second self-increasing sequence IDs; and generating a marker vector from the k second self-increasing IDs;
the model initiator is used for determining a Boolean vector according to the mark vector;
and each data provider is used for determining the reserved sample corresponding to each data provider and the corresponding characteristics of the reserved sample according to the Boolean vector.
8. The system of claim 7, wherein the system is further specific to:
configuring the number of participants on the minimum alignment required, wherein the number of the participants is less than or equal to k;
in the aspect of determining, by the model initiator, a boolean vector from the token vector, the method includes:
summing the marked vectors by the model initiator to obtain a sum vector;
and determining the Boolean vector according to the sum vector and the number of the participants.
9. An electronic device comprising a processor, a memory for storing one or more programs and configured for execution by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-6.
10. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-6.
CN202210801464.7A 2022-07-08 2022-07-08 Data alignment method, system and related equipment in multi-party federal learning Pending CN114880383A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210801464.7A CN114880383A (en) 2022-07-08 2022-07-08 Data alignment method, system and related equipment in multi-party federal learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210801464.7A CN114880383A (en) 2022-07-08 2022-07-08 Data alignment method, system and related equipment in multi-party federal learning

Publications (1)

Publication Number Publication Date
CN114880383A true CN114880383A (en) 2022-08-09

Family

ID=82682611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210801464.7A Pending CN114880383A (en) 2022-07-08 2022-07-08 Data alignment method, system and related equipment in multi-party federal learning

Country Status (1)

Country Link
CN (1) CN114880383A (en)

Similar Documents

Publication Publication Date Title
Aledhari et al. Federated learning: A survey on enabling technologies, protocols, and applications
CN110189192B (en) Information recommendation model generation method and device
CN113159327B (en) Model training method and device based on federal learning system and electronic equipment
US10891161B2 (en) Method and device for virtual resource allocation, modeling, and data prediction
JP2021121922A (en) Multi-model training method and apparatus based on feature extraction, electronic device, and medium
CN111784001A (en) Model training method and device and computer readable storage medium
TWI722746B (en) Information reading and writing method and device based on blockchain
CN109003192A (en) A kind of insurance underwriting method and relevant device based on block chain
CN114818000B (en) Privacy protection set confusion intersection method, system and related equipment
US11500992B2 (en) Trusted execution environment-based model training methods and apparatuses
CN111553744A (en) Federal product recommendation method, device, equipment and computer storage medium
CN112465627A (en) Financial loan auditing method and system based on block chain and machine learning
CN111709051A (en) Data processing method, device and system, computer storage medium and electronic equipment
CN112184444A (en) Method, apparatus, device and medium for processing information based on information characteristics
CN113591097A (en) Service data processing method and device, electronic equipment and storage medium
CN115913537A (en) Data intersection method and system based on privacy protection and related equipment
CN114880383A (en) Data alignment method, system and related equipment in multi-party federal learning
CN115203487A (en) Data processing method based on multi-party security graph and related device
CN114493850A (en) Artificial intelligence-based online notarization method, system and storage medium
CN115796305B (en) Tree model training method and device for longitudinal federal learning
US20230031624A1 (en) Methods, systems, apparatuses and devices for facilitating capitalizing on a portfolio of pre-selected multi level marketing companies
KR102455933B1 (en) Method, device and system for providing block chain/nft-based freelance history information management platform service
CN111797126B (en) Data processing method, device and equipment
US20240121080A1 (en) Cryptographic key generation using machine learning
Singh et al. Introduction to the Special Issue on Integrity of Multimedia and Multimodal Data in Internet of Things

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination