CN113095514A - Data processing method, device, equipment, storage medium and program product - Google Patents

Data processing method, device, equipment, storage medium and program product Download PDF

Info

Publication number
CN113095514A
CN113095514A CN202110454684.2A CN202110454684A CN113095514A CN 113095514 A CN113095514 A CN 113095514A CN 202110454684 A CN202110454684 A CN 202110454684A CN 113095514 A CN113095514 A CN 113095514A
Authority
CN
China
Prior art keywords
feature
data
matrix
sample
participant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110454684.2A
Other languages
Chinese (zh)
Inventor
魏文斌
范涛
陈天健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202110454684.2A priority Critical patent/CN113095514A/en
Publication of CN113095514A publication Critical patent/CN113095514A/en
Priority to PCT/CN2021/140955 priority patent/WO2022227644A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Abstract

The application provides a data processing method, a device, equipment, a storage medium and a program product, wherein the method comprises the following steps: constructing a virtual characteristic correlation matrix based on the first sample characteristic data and a pre-trained safety calculation model; determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix; determining target characteristics from the characteristics corresponding to the first sample characteristic data based on the co-linear quantization factor; deleting the feature data of the target feature from the first sample feature data to obtain first training data for performing combined training; and the characteristic data of the target characteristic and the characteristic data of at least one of the other characteristics have a linear relation. Therefore, on the premise of protecting data privacy, data with collinearity in feature data held by each participant is screened and eliminated, accuracy and stability of the obtained federated model of the joint training can be improved, and the modeling effect of the federated model is improved.

Description

Data processing method, device, equipment, storage medium and program product
Technical Field
The present application relates to the field of artificial intelligence technology, and relates to, but is not limited to, a data processing method, apparatus, device, storage medium, and program product.
Background
Machine learning is the science of how to use computers to simulate or implement human learning activities, and is one of the most intelligent features in artificial intelligence, the most advanced research fields. The study of machine learning is mainly divided into two categories of study directions: the first type is the research of traditional machine learning, which mainly researches the learning mechanism and focuses on exploring the learning mechanism of a dummy; the second type is the study of machine learning in big data environment, which mainly studies how to effectively utilize information, and focuses on obtaining hidden, effective and understandable knowledge from huge data.
The federal learning technology is a novel privacy protection technology, and can effectively combine data of all parties to carry out model training on the premise that the data cannot be out of the local. For a plurality of business problems in the field of big data, the problems can be solved through a corresponding machine learning model. And the elimination of collinear data is the key to training a good model. Under the condition of protecting data privacy, the collinearity of multi-party data in federal learning cannot be quantified in the related technology, and training data with the collinearity cannot be efficiently screened and rejected, so that the accuracy rate of a model obtained by training is low, and the stability is poor.
Disclosure of Invention
Embodiments of the present application provide a data processing method, an apparatus, a device, a computer-readable storage medium, and a computer program product, which can remove data with collinearity in linear federal modeling, improve accuracy and stability of a federal model, and improve modeling effect of the model.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a data processing method, which is applied to a first participant of federal learning, and comprises the following steps:
constructing a virtual feature correlation matrix based on first sample feature data held by the first participant and a pre-trained safety calculation model, wherein the safety calculation model is obtained by pre-training the first participant and other participants of federal learning based on safety multi-party calculation;
determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix;
determining a target feature from the features corresponding to the first sample feature data based on the co-linear quantization factor;
deleting the feature data of the target feature from the first sample feature data to obtain first training data for joint training of the first participant and the other participants;
wherein the feature data of the target feature has a linear relationship with the feature data of at least one of the other features, the other features including features held by the first party other than the target feature and features held by the other parties.
The embodiment of the application provides a data processing device, is applied to the first party of federal study, the device includes:
the building module is used for building a virtual characteristic correlation matrix based on first sample characteristic data held by the first participant and a pre-trained safety calculation model, and the safety calculation model is obtained by pre-training the first participant and other participants of federal learning based on safety multi-party calculation;
a first determining module, configured to determine a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix;
a second determining module, configured to determine, based on the collinear quantization factor, a target feature from features corresponding to the first sample feature data;
a deleting module, configured to delete the feature data of the target feature from the first sample feature data, so as to obtain first training data for performing joint training between the first participant and the other participants;
wherein the feature data of the target feature has a linear relationship with the feature data of at least one of the other features, the other features including features held by the first party other than the target feature and features held by the other parties.
An embodiment of the present application provides a data processing apparatus, where the apparatus includes:
a memory for storing executable instructions;
and the processor is used for realizing the method provided by the embodiment of the application when executing the executable instructions stored in the memory.
Embodiments of the present application provide a computer-readable storage medium, where executable instructions are stored on the computer-readable storage medium, and when the computer-readable storage medium is executed by a processor, the computer-readable storage medium implements a method provided by embodiments of the present application.
Embodiments of the present application provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the method provided by the embodiments of the present application.
The embodiment of the application has the following beneficial effects:
in the data processing method provided by the embodiment of the application, when data processing is carried out, firstly, a first participant and other participants of federal learning are trained in advance based on safe multiparty computation to obtain a safe computation model; then the first participant acquires first sample characteristic data held by the first participant, and a virtual characteristic correlation matrix is constructed based on the first sample characteristic data and a pre-trained safety calculation model; then determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix; determining target features in the features corresponding to the first sample feature data by the co-linear quantization factor, wherein the feature data of the target features has a linear relation with the feature data of at least one feature in other features, and the other features include not only the features, other than the target features, held by the first participant but also all the features held by other participants; and after the target characteristics are determined, deleting the characteristic data of the target characteristics from the first sample characteristic data to obtain first training data for the first participant to perform joint training with other participants. Therefore, on the premise of protecting data privacy, data with collinearity in feature data held by each participant can be screened and eliminated to obtain training data without linear relation, so that when the joint training is carried out, each participant utilizes the training data without linear relation to carry out the joint training, the accuracy and stability of the federated model can be improved, and the modeling effect of the federated model is improved.
Drawings
Fig. 1 is a schematic network architecture diagram of a data processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a component structure of a data processing device according to an embodiment of the present disclosure;
fig. 3 is a schematic flow chart of an implementation of a data processing method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of another implementation of a data processing method according to an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a calculation flow of a variance inflation factor under a longitudinal federal condition according to an embodiment of the present application;
fig. 6 is a schematic diagram of a calculation process of determinant of correlation matrix according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first \ second \ third" are only used to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where permissible, so that the embodiments of the present application described herein can be practiced in other than the order shown or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) In Vertical federal Learning (Vertical fed Learning), under the condition that the users of two data sets overlap more and the user features overlap less, the data sets are segmented according to the Vertical direction (namely feature dimension), and partial data which are the same for both users but have not the same user features are taken out for training machine Learning.
2) The Variance expansion Factor (VIF), also called Variance expansion Factor, is a numerical value that characterizes the degree of complex collinearity between observed values of independent variables, and is used to measure the severity of complex (multiple) collinearity in a multiple linear regression model.
3) Homomorphic Encryption (Homomorphic Encryption), which is a cryptographic technique based on the theory of computational complexity of mathematical puzzles. The homomorphic encrypted data is processed to produce an output, which is decrypted, the result being the same as the output obtained by processing the unencrypted original data in the same way.
An exemplary application of the apparatus implementing the embodiment of the present application is described below, and the apparatus provided in the embodiment of the present application may be implemented as a terminal device. In the following, exemplary applications covering terminal devices when the apparatus is implemented as a terminal device will be explained.
Fig. 1 is a schematic diagram of a network architecture of a data processing method according to an embodiment of the present disclosure, as shown in fig. 1, the network architecture at least includes a first party 100, a second party 200, and a network 300. To enable support of one exemplary application, first participant 100 and second participant 200 may be participants in longitudinal federal learning who jointly train machine learning models. The first participant 100 and the second participant 200 may be clients, for example, participant devices such as banks or hospitals, which store user characteristic data, and the clients may be devices with model training functions such as a laptop, a tablet computer, a desktop computer, and a special training device. The first participant 100 is connected to the second participant 200 via a network 300, which may be a wide area network or a local area network, or a combination of both, and the data transmission is accomplished using wireless or wired links.
The first participant 100 first obtains first sample feature data from its own data, and processes the first sample feature data to obtain processed first sample feature data. The second participant 200 obtains second sample feature data from its own data, and processes the second sample feature data to obtain processed second sample feature data. The first sample characteristic data and the second sample characteristic data have the same identification, that is, the first sample characteristic data and the second sample characteristic data are data of different characteristics of the same batch of samples held by the first participant 100 and the second participant 200, respectively. The first participant 100 and the second participant 200 determine a first matrix E using the processed first sample feature data and the processed second sample feature data based on the secure multi-party computation, the first matrix E being held by both the first participant 100 and the second participant 200. Based on secure multiparty computation, the first participant 100 can only obtain the first matrix E, and cannot know the processed second sample feature data held by the second participant 200; similarly, the second participant 200 can only obtain the first matrix E, and cannot know the processed first sample feature data held by the first participant 100.
After the first participant 100 obtains the first matrix E, a virtual feature correlation matrix is constructed according to the first matrix E and the processed first sample feature data; and then calculating the determinant of the feature correlation matrix and the determiners of the respective syndromes corresponding to the feature correlation matrix, and determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the determinant of the feature correlation matrix and the determiners of the respective syndromes, wherein the co-linear quantization factor may be a variance expansion coefficient used for quantizing the co-linearity of each feature with all other features, wherein all other features include not only the features of the first participant 100 except the feature quantized this time, but also all features of the second participant 200. The first participant 100 determines which feature data of the features have collinearity according to the collinearity quantization factor of each feature, determines the features as target features, and finally deletes the feature data of the target features from the first sample feature data to obtain first training data for the first participant 100 and the second participant 200 to perform joint training. In the same way, the second participant 200 obtains second training data for the joint training. Therefore, when the joint training is performed, the first participant 100 and the second participant 200 perform the joint training by using the first training data and the second training data which do not have the collinearity, and a federal model with high accuracy and good stability can be obtained.
By the method provided by the embodiment of the application, on the premise of protecting data privacy, co-linear data in characteristic data held by each participant can be screened and eliminated to obtain training data without linear relation, so that when the co-training is carried out, each participant uses the training data without linear relation to carry out the co-training, the accuracy and stability of the federal model can be improved, and the modeling effect of the federal model is improved.
The apparatus provided in the embodiments of the present application may be implemented as hardware or a combination of hardware and software, and various exemplary implementations of the apparatus provided in the embodiments of the present application are described below.
According to the exemplary structure of the data processing device 10 shown in fig. 2, where the data processing device 10 is illustrated as a device applied to a first party of federal learning, other exemplary structures of the data processing device 10 may be foreseen, and thus the structure described herein should not be seen as a limitation, for example, some components described below may be omitted, or components not described below may be added to suit the special needs of some applications.
The data processing apparatus 10 shown in fig. 2 includes: at least one processor 110, memory 140, at least one network interface 120, and a user interface 130. Each of the components in data processing device 10 are coupled together by a bus system 150. It will be appreciated that the bus system 150 is used to enable communications among the components of the connection. The bus system 150 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 150 in fig. 2.
The user interface 130 may include a display, a keyboard, a mouse, a touch-sensitive pad, a touch screen, and the like.
The memory 140 may be either volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM). The volatile Memory may be Random Access Memory (RAM). The memory 140 described in embodiments herein is intended to comprise any suitable type of memory.
The memory 140 in embodiments of the present application is capable of storing data to support the operation of the data processing apparatus 10. Examples of such data include: any computer program for operation on data processing device 10, such as an operating system and application programs. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.
As an example of the method provided by the embodiment of the present application implemented by software, the method provided by the embodiment of the present application may be directly embodied as a combination of software modules executed by the processor 110, the software modules may be located in a storage medium located in the memory 140, and the processor 110 reads executable instructions included in the software modules in the memory 140, and completes the method provided by the embodiment of the present application in combination with necessary hardware (for example, including the processor 110 and other components connected to the bus 150).
By way of example, the Processor 110 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.
The data processing method provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the device provided by the embodiment of the present application.
Fig. 3 is a schematic implementation flow diagram of a data processing method provided in an embodiment of the present application, which is applied to a first participant of the network architecture shown in fig. 1, and will be described with reference to the steps shown in fig. 3.
Step S301, a virtual feature correlation matrix is constructed based on first sample feature data held by a first participant and a pre-trained safety calculation model.
In the embodiment of the application, among the features corresponding to the first sample feature data, the feature data of different features may have a linear relationship, and if a data training model having collinearity is adopted, the trained model has low accuracy and poor stability. In order to remove data with a linear relation, a first participant constructs a virtual feature correlation matrix which is marked as H according to the first sample feature data and a pre-trained safety calculation model.
The safety calculation model is obtained by pre-training a first participant and other participants of federal learning based on safety multiparty calculation, and can perform matrix addition and subtraction and matrix multiplication under privacy. In the embodiment of the present application, the data processing method is described by taking federal learning of two participants as an example. In practical application, the data processing method can also be applied to three or more participants for federal learning. On the premise that data held by each participant is not local and data privacy is guaranteed, the first participant and the second participant are trained on the basis of a privacy protection technology to obtain a trained security computation model, and the trained security computation model is held by each participant.
In this embodiment of the present application, because the first participant and the second participant perform joint training, the training data needs to come from the same user to perform joint training, that is, the first sample feature data used by the first participant to perform joint training is the same as the second sample feature data used by the second participant to perform joint training, and their respective corresponding user Identifiers (IDs) are the same. Therefore, before the first participant acquires the first sample feature data, it is first necessary to determine the common users held by the participants, then screen out the target users participating in the training from the common users, and use the feature data of the target users as the first sample feature data.
Step S302, based on the feature correlation matrix, determining a co-linear quantization factor of each feature corresponding to the first sample feature data.
In practical applications, methods for determining a co-linear quantitative factor between feature data of a plurality of features include a variance inflation factor method, a feature root analysis method, and a condition number method. In addition, an intuitive judgment method capable of qualitatively analyzing the collinear degree among feature data of a plurality of features exists, and the intuitive judgment method is generally used for preliminary judgment and cannot be used for quantitative analysis.
The first sample feature data held by the first participant and the second sample feature data held by the second participant correspond to data of different columns in the feature correlation matrix, and whether colinearity exists between each column of data and other columns of data in the feature correlation matrix is determined, that is, whether colinearity exists between the feature data of each feature in the first sample feature and the feature data of other features can be determined. Here, the existence of collinearity between the feature data of different features includes the following possibilities: the feature data of one feature is present as a multiple of the feature data of another feature; the feature data of one feature is equal to the feature data of another feature plus a constant term; the feature data of one feature is equal to the feature data of the other two features.
In the embodiment of the application, according to the feature correlation matrix, the co-linear quantization factor of each feature corresponding to the first sample feature data is determined, and according to the co-linear quantization factor of each feature corresponding to the first sample feature data, which features have feature data collinear with the feature data can be determined.
Step S303, based on the collinearity quantization factor, determines a target feature from the features corresponding to the first sample feature data.
And determining the target characteristics with the collinearity according to the collinearity quantization factor. Taking the variance expansion factor as an example, in the case where there is no multicollinearity, the variance expansion factor is close to 1, and the stronger the multicollinearity, the larger the variance expansion factor. In practice, multiple collinearity always exists between data more or less, so it is not practical to use the variance expansion factor equal to 1 as the criterion for evaluating the collinearity, and generally, a boundary value may be preset according to the actual application scenario, and in the embodiment of the present application, the preset boundary value may be 10. And judging whether the variance expansion factor of each feature is larger than 10, and when the variance expansion factor of a certain feature is larger than 10, considering that the feature data of the feature and the feature data of other features have stronger collinearity, and determining the feature as the target feature.
Step S304, deleting the feature data of the target feature from the first sample feature data to obtain first training data for the first participant to perform joint training with other participants.
And the characteristic data of the target characteristic and the characteristic data of at least one of other characteristics have a linear relation, wherein the other characteristics comprise the characteristics which are held by the first party except the target characteristic and the characteristics which are held by other parties.
In the embodiment of the application, a variance expansion factor of each feature in first sample feature data of a first participant can be calculated based on the feature correlation matrix, a target feature of which the variance expansion factor is larger than a preset boundary value is determined, and then the feature data of the target feature in the first sample feature data is deleted, so that feature data without collinearity is obtained and used as first training data. Therefore, when the joint training is carried out, the first participant carries out the joint training by using the first training data without colinearity, and a federal model with high accuracy and good stability can be obtained.
The data processing method provided by the embodiment of the application is applied to a first party of federal learning, and comprises the following steps: constructing a virtual characteristic correlation matrix based on first sample characteristic data held by a first participant and a pre-trained safety calculation model, wherein the safety calculation model is obtained by pre-training the first participant and other participants of federal learning based on safety multi-party calculation; determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix; determining target characteristics from the characteristics corresponding to the first sample characteristic data based on the colinear quantization factor; deleting the feature data of the target feature from the first sample feature data to obtain first training data for joint training of the first participant and other participants; and the characteristic data of the target characteristic and the characteristic data of at least one of other characteristics have a linear relation, and the other characteristics comprise the characteristics which are held by the first participant and are other than the target characteristic and the characteristics which are held by other participants. By the method, on the premise of protecting data privacy, co-linear data in the feature data held by each participant can be screened and eliminated to obtain training data without linear relation, so that each participant performs joint training by using the training data without linear relation during joint training, the accuracy and stability of the federated model can be improved, and the modeling effect of the federated model is improved.
In some embodiments, before step S301 in the embodiment shown in fig. 3, the first participant obtains the first sample feature data from the feature data held by the first participant.
Because the first participant and the second participant perform joint training, the training data need to come from the same user to perform joint training, that is, the first sample feature data used by the first participant to perform joint training and the second sample feature data used by the second participant to perform joint training have the same ID. When the first sample characteristic data is obtained, firstly, common users held by all participants need to be determined, then target users participating in the training are screened from the common users, and the characteristic data of the target users are used as the first sample characteristic data.
The following illustrates the process of the first party obtaining the first sample characteristic data.
The first party and the second party respectively hold feature data of different features of a batch of users, for example, the first party holds feature data of several features of birthdays AA, ages BB, weight CC and deposit DD of the batch of users, and the second party holds feature data of several features of ages BB, consuming capacities EE and hobbies FF of the batch of users. The database table of the first party and the database table of the second party are shown in the following tables 1 and 2, respectively:
table 1 database table of a first party
ID AA BB CC DD
01 a1 b1 c1 d1
02 a2 b2 c2 d2
03 a3 b3 c3 d3
04 a4 b4 c4 d4
06 a6 b6 c6 d6
Table 2 database table of second participant
ID BB EE FF
01 b1 e1 f1
03 b3 e3 f3
04 b4 e4 f4
05 b5 e5 f5
06 b6 e6 f6
Obtaining the feature data held by the first party according to the database table of the first party
Figure BDA0003040139320000121
Obtaining the characteristic data held by the second party according to the database table of the second party
Figure BDA0003040139320000122
On the premise of privacy protection, since each participant cannot know which user data is stored by other participants, the first participant and the second participant determine the common user based on the privacy protection technology in the embodiment of the application. Determining a co-user based on privacy preserving techniques may be implemented as: the first participant and the second participant respectively acquire the identification of the own user; and then, based on the privacy protection technology, calculating an intersection of the identifier of the user held by the first participant and the identifier of the user held by the second participant, wherein the obtained result is the common user. The method determines that the common users held by the first participant and the second participant are users with the IDs of 01,03,04 and 06, and the common users are recorded as S ═ {01,03,04,06 }.
After the common user S is obtained, the first participant determines target users participating in the training, for example, randomly selects random users from the common users as the target users participating in the training, and then obtains the characteristic data of the target users from the data stored in the first participant as first sample characteristic data. And the first participant sends the ID of the screened target user to the second participant, and the second participant acquires the characteristic data of the target users from the data stored by the second participant as second sample characteristic data. Alternatively, the second participant may determine the target user participating in the training, and send the ID of the determined target user to the first participant. For example, the first participant randomly selects 3 users with IDs 01,03, and 04 from S as target users, which are denoted as S1 ═ 01,03,04}, and the first participant selects feature data of the target user S1 from the feature data owned by the first participant to obtain first sample feature data
Figure BDA0003040139320000123
On the second party side, the second party screens out the feature data of the target user S1 from the feature data owned by the second party, and obtains second sample feature data
Figure BDA0003040139320000131
Through the steps, on the premise that data of all participants cannot be local and privacy protection is achieved, the first participant obtains first sample characteristic data, and the second participant obtains second sample characteristic data.
In some embodiments, the step S301 "building a virtual feature correlation matrix based on the first sample feature data held by the first participant and the pre-trained security computation model" in the embodiment shown in fig. 3 can be implemented by:
step S3011 is to determine feature data of each feature corresponding to the first sample feature data and the number of samples corresponding to the first sample feature data based on the first sample feature data.
Still referring to the above exemplary data, as shown in table 3, the first sample feature data corresponds to the following features: AA. BB, CC and DD, the feature data of feature AA are { a1, a3 and a4}, the feature data of feature BB are { b1, b3 and b4}, the feature data of feature CC are { c1, c3 and c4}, the feature data of feature DD are { d1, d3 and d4}, and it can be seen that the feature data of each feature corresponds to a column of data of A.
The samples corresponding to the first sample feature data are: { a1, b1, c1, d1}, { a3, b3, c3, d3} and { a4, b4, c4, d4}, and determining the number n of samples to be 3.
Step S3012, respectively calculate the mean and standard deviation corresponding to the feature data of each feature.
According to
Figure BDA0003040139320000132
Calculating the mean value corresponding to the characteristic data of the characteristics AA, BB, CC and DD according to
Figure BDA0003040139320000133
And calculating standard deviations corresponding to the feature data of the features AA, BB, CC and DD, wherein x is a, b, c, d and n is 3.
Step S3013, determining processed first sample feature data based on the feature data of each feature, the mean value corresponding to the feature data of each feature, the standard deviation corresponding to the feature data of each feature and the number of samples;
in the embodiment of the present application, the mean value corresponding to the feature data of each feature can be obtained
Figure BDA0003040139320000134
Standard deviation σ corresponding to feature data of each featurexAnd the number n of samples is used for updating the feature data x of each feature to obtain each processed feature data x. The processing formula of each feature data x after processing here may be
Figure BDA0003040139320000141
Thereby obtaining processed first sample characteristic data
Figure BDA0003040139320000142
Wherein the content of the first and second substances,
Figure BDA0003040139320000143
Figure BDA0003040139320000144
x=a,b,c,d,n=3。
step S3014, input the processed first sample feature data to the secure computation model to obtain a first matrix.
In the embodiment of the application, each participant trains based on a secure multi-party computing protocol, such as an SPDZ protocol or a MASCOT protocol, to obtain a trained secure computing model. The trained model can perform matrix addition and subtraction and matrix multiplication under privacy. Secure Multi-Party computing (MPC) research is mainly directed to the problem of how to securely compute an agreed function without a trusted third Party. If m participants participate in the federal learning training, the trained safety calculation model can be expressed as y1,…,m=f(x1,…,m) Wherein m is more than or equal to 2. After the training is completed, the m participants all hold the security computation model, and each participant inputs its own private input value to obtain a corresponding output value, for example, the ith participant inputs xiOutput is yiWherein i is more than or equal to 1 and less than or equal to m.
In the embodiment of the application, two participants are taken as an example, and the trained safety calculation model is represented as y1,2=f(x1,2) The first participant and the second participant both hold the model, and the first participant inputs the processed first sample characteristic data C into y1,2=f(x1,2) Obtaining a first matrix E, where E is equal to CTAnd D, wherein D is the processed second sample characteristic data. Here, the first matrix E ═ CTD is merely to illustrate that the value of E is equal to CTThe value of D is such that,does not indicate that E is according to CTD is calculated, the first party is according to y1,2=f(x1,2) And E, the first participant does not need to acquire the processed second sample characteristic data D from the second participant. From CTD, the first matrix E is a4 × 3 matrix.
Correspondingly, on the side of the second participant, the second participant inputs the processed second sample characteristic data D into y1,2=f(x1,2) Obtaining a first matrix E, where E is equal to CTD, wherein CTIs a transposed matrix of the processed first sample feature data. Here, the first matrix E ═ CTD is merely to illustrate that the value of E is equal to CTThe value of D does not indicate that E is according to CTD is calculated, the second party is according to y1,2=f(x1,2) And E is obtained through calculation, and the second party does not need to obtain the transposition matrix C of the processed first sample characteristic data from the first partyT
In the embodiment of the application, the first matrix E is determined based on a secure multiparty computing protocol, and each participant cannot acquire private data of other participants, so that privacy and security of data of each participant can be ensured.
Step S3015, a virtual feature correlation matrix is constructed according to the processed first sample feature data and the first matrix.
In the embodiment of the present application, "constructing a virtual feature correlation matrix" may be implemented by the following steps:
in step S30151, a first symmetric matrix is determined according to the processed first sample feature data.
The Symmetric matrix (Symmetric Matrices) is a square matrix having a main diagonal as a symmetry axis and equal elements. In this embodiment of the application, the first symmetric matrix F is constructed according to the processed first sample feature data C, and the transposed matrix F may be constructed according to the processed first sample feature data C
Figure BDA0003040139320000151
And the processed first sample data
Figure BDA0003040139320000152
The determined first symmetric matrix F, the formula F of the calculation of F ═ CTThe dimensions of C and F are 4 x 4.
In step S30152, an empty matrix is generated, where the number of rows and the number of columns are equal to the number of columns of the obtained first matrix.
When a feature correlation matrix is constructed according to a feature held by a first participant and a feature held by a second participant, the first participant can only know the feature held by the second participant and cannot obtain feature data of each feature held by the second participant on the premise that data of each participant is not local.
From the above step S3014, the first matrix E is a4 × 3 matrix with 3 columns, and thus the generated empty matrix is a3 × 3 matrix, which is denoted as G'. A second symmetric matrix G, corresponding to the empty matrix G', is determined by the second participant from the second sample feature data B.
Step S30153, a virtual feature correlation matrix is constructed according to the first symmetric matrix, the first matrix, the transposed matrix of the first matrix, and the null matrix.
According to the first symmetric matrix F, the first matrix E and the transposed matrix E of the first matrixTAnd a null matrix G' constructing a virtual eigen-correlation matrix H, denoted as
Figure BDA0003040139320000161
Wherein the first symmetric matrix F has a dimension of 4 x 4, the first matrix E has a dimension of 4 x 3, and the first matrix is a transpose matrix ETIs 3 x 4, and the empty matrix G' is 3 x 3, so the feature correlation matrix H is 7 x 7 in dimension.
According to the method provided by the embodiment of the application, the first participant trains with other participants to obtain the safe calculation model based on the safe multi-party calculation protocol, and the problem that the co-linearity of multi-party data in federal learning cannot be quantified can be converted into the problem that the block matrix determinant can be processed on the premise that the data held by each participant cannot be local and the privacy of the data is guaranteed, so that the co-linearity factor of the multi-party data in federal learning can be quantified based on the characteristic correlation matrix, and a basis is provided for eliminating training data with co-linearity and obtaining the federal model with accuracy and stability.
In some embodiments, the step S302 "determining the co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix" in the embodiment shown in fig. 3 may be implemented by:
step S3021, determining a determinant of the feature correlation matrix.
The determinant of the feature correlation matrix H is denoted by | H |, and in the embodiment of the present application, determining the determinant of the feature correlation matrix may be implemented by steps S30211 to S30214:
step S30211, a first random matrix with a determinant as a preset value and a dimension the same as that of the empty matrix is generated.
When the determinant of the characteristic correlation matrix is calculated, since data of each participant cannot be local, in order to ensure that private data is not leaked out, a first random matrix for confusing original data is generated in the embodiment of the application, so that privacy protection is realized.
In the embodiment of the present application, a first random matrix generated by a first participant satisfies: the determinant is a matrix with preset value and dimension identical to that of the empty matrix G'. For convenience of calculation, the preset value may be set to 1, and it can be known from step S30152 that the empty matrix G' is a matrix of 3 × 3, so that the first random matrix generated is a matrix with a determinant of 1 and a dimension of 3 × 3, which is denoted as R1.
Step S30212, inputting the feature correlation matrix and the first random matrix into the security computation model to obtain a second matrix.
The well-trained security computation model of the first and second participants is represented as y1,2=f(x1,2) The first participant inputs the feature correlation matrix H and the first random matrix R1 into the model to obtain a second matrix J, which is R1 (D-CA)-1B)R2,Wherein A is-1The first sample feature data a is an inverse of the first sample feature data a, D is the processed second sample feature data, and R2 is a second random matrix with a determinant of 1 and a dimension of 4 × 4 (the dimension is the same as the first symmetric matrix F) generated by the second participant. Here, the second matrix J ═ R1 (D-CA)-1B) R2 is only for the purpose of illustrating that the value of the second matrix J is equal to R1 (D-CA)-1B) The value of R2 does not indicate that the second matrix J is according to R1 (D-CA)-1B) R2, the first party is calculated according to y1,2=f(x1,2) The first participant does not need to obtain the processed second sample feature data D and the second random matrix R2 from the second participant with the calculated J.
In the embodiment of the application, the determinant of R1 and R2 is 1, when the determinant of R1 or R2 is not 1,
in the embodiment of the application, the second matrix J is determined based on a secure multiparty computing protocol, and each participant uses the generated random matrix to confuse respective data, so that each participant cannot know private data of other participants, and therefore privacy and security of data of each participant can be ensured.
Step S30213, determinants of the first symmetric matrix and the second matrix are calculated, respectively.
Obtaining a first symmetric matrix F according to the step S30151, obtaining a second matrix J according to the step S30212, and calculating the determinants of the first symmetric matrix F and the second matrix J respectively to obtain the determinant | F | of the first symmetric matrix and the determinant | J | of the second matrix.
Here, when the determinant | R1| of the first random matrix or the determinant | R1| of the second random matrix generated is not 1, the determinant | J | of the second matrix needs to be multiplied when calculating the determinant | J |
Figure BDA0003040139320000181
Namely, it is
Figure BDA0003040139320000182
|R1(D-CA-1B)R2|。
Step S30214, multiplying the determinant of the first symmetric matrix and the determinant of the second symmetric matrix to obtain a determinant of the feature correlation matrix.
And multiplying the determinant | F | of the first symmetric matrix with the determinant | J | of the second matrix to obtain the determinant | H | ═ F | -J | of the characteristic correlation matrix.
In the embodiment of the application, each participant calculates the determinant of the virtual characteristic correlation matrix based on the safety calculation model, and the whole calculation process does not need to acquire the private data of other participants, so that the data safety of each participant can be ensured.
And step S3022, deleting the ith row and ith column data of the characteristic correlation matrix to obtain each residue type corresponding to the characteristic correlation matrix.
Wherein, i is 1, 2, …, m1Wherein m is1The number of features corresponding to the first sample feature data. As described above, the first sample feature data A has a feature number of 4, m14. Since the first participant can only determine the co-linear quantization factors of the features corresponding to the first sample feature data owned by the first participant, only the 1 st, 2 nd, 3 rd and 4 th residue types corresponding to the feature correlation matrix need to be obtained here.
Deleting feature correlation matrix H7*7The remaining data, without changing the original order, form a matrix with dimension 6 x 6, which is the ith residue Hii corresponding to the ith feature of the first sample feature data, for example, when i is 2, the obtained data is the 2 nd residue H corresponding to the feature BB22
In step S3023, determinants of respective residue formulas are determined.
In the embodiment of the present application, the determinant for each residue may be determined in the same manner as the determinant for the feature correlation matrix in step S3021. For example, determining the determinant of the i-th feature corresponding to the remainder formula may be implemented as: generating a random matrix with a determinant as a preset value and the same dimensionality as the empty matrix; inputting the ith remainder formula and the random matrix into a safety calculation model to obtain a matrix corresponding to the ith remainder formula; respectively calculating the determinant of the first symmetric matrix and the matrix corresponding to the ith remainder; multiplying the determinant of the first symmetric matrix and the determinant of the matrix corresponding to the ith residue to obtain the determinant | Hii | of the ith residue.
Step S3024, determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the determinant of the feature correlation matrix and the determinants of the respective residue.
In the embodiment of the present application, the variance expansion factor VIF of the ith feature, the collinearity quantization factor, is determined by taking the variance expansion factor as an exampleiThe calculation formula can be determined according to the determinant | H | of the characteristic correlation matrix and the determinant | Hii | of the ith residue, as shown in formula (1):
Figure BDA0003040139320000191
wherein when i is 1, the resulting VIF1VIF obtained for colinear quantization factor of characteristic AA, i ═ 22VIF obtained for feature BB with i equal to 33VIF obtained for co-linear quantization factor of characteristic CC, i ═ 44Is a co-linear quantization factor of the feature DD.
According to the method provided by the embodiment of the application, the first participant calculates the determinant of the characteristic correlation matrix and the determinant of the residue of each characteristic corresponding to the first sample characteristic data based on the safe multi-party calculation protocol, so that the variance expansion factor of each characteristic corresponding to the first sample characteristic data is obtained, and a foundation is provided for eliminating the training data with co-linearity and obtaining the federal model with accuracy and stability.
In some embodiments, the step S303 "of determining the target feature from the features corresponding to the first sample feature data based on the co-linear quantization factor" in the embodiment shown in fig. 3 may be implemented by:
step S3031, determining whether the collinearity quantization factor of each feature corresponding to the first sample feature data is greater than a preset boundary value.
Here, the co-linear quantization factor and the preset edge of each feature corresponding to the first sample feature data are judgedThe magnitude of the threshold value, i.e. determining the VIF1、VIF2、VIF3And VIF4And the magnitude of the predetermined boundary value.
Step S3032, determining the feature of the collinearity quantization factor greater than the preset boundary value as the target feature.
The predetermined boundary value is 10 as an example, and the VIF is judged to be obtained1And VIF2If greater than 10, the VIF will be1Corresponding features AA and VIF2The corresponding feature BB is determined as the target feature.
In combination with the above example, the target feature AA of the first participant is birthday, the target feature BB of the first participant is age, and the feature BB of the second participant is age, and it is determined that there is a linear relationship between birthday and age (age equals this year minus birth year), and a linear relationship between age and age (equal, i.e. multiple is 1). That is, the feature data of the target feature AA of the first party has a linear relationship with the feature data of the feature BB of the first party and the feature data of the feature BB of the second party, and the feature data of the target feature BB of the first party has a linear relationship with the feature data of the feature AA of the first party and the feature data of the feature BB of the second party.
After obtaining the target feature, step S304 is executed to delete the feature data of the target feature from the first sample feature data, i.e. delete the first sample feature data
Figure BDA0003040139320000201
The first training data obtained from the 1 st column data (feature data of feature AA) and the 2 nd column data (feature data of feature BB)
Figure BDA0003040139320000202
In the same manner as the above-described step of determining the target feature among the features corresponding to the first sample feature data held by the first party, the target feature among the features corresponding to the second sample feature data held by the second party is determined as the feature BB of the second party, and the feature data of the feature BB is deleted from the second sample feature data, that is, the feature data of the feature BB is deleted from the second sample feature dataCharacteristic data
Figure BDA0003040139320000203
Middle 1 st column data (feature data of feature BB), and second training data
Figure BDA0003040139320000211
And then, the first participant and the second participant perform joint training according to the first training data A 'and the second training data B', and the characteristic data with collinear relationship is deleted from the first training data A 'and the second training data B', so that when the joint training is performed, the first participant and the second participant perform the joint training by using the first training data and the second training data without linear relationship, the obtained federated model has high accuracy and good stability, and the modeling effect of the federated model can be improved.
Based on the foregoing embodiments, a data processing method is further provided in an embodiment of the present application, and fig. 4 is a schematic flow chart of a further implementation of the data processing method provided in the embodiment of the present application, which is applied to the network architecture shown in fig. 1, and as shown in fig. 4, the data processing method includes the following steps:
in step S401, the first party and the second party determine a common user held by the first party and the second party based on a secure multiparty computing protocol.
Because the first participant and the second participant perform joint training, the training data need to come from the same user to perform joint training, that is, the first sample feature data used by the first participant to perform joint training and the second sample feature data used by the second participant to perform joint training have the same ID. When the first sample characteristic data is obtained, firstly, common users held by all participants need to be determined, then target users participating in the training are screened from the common users, and the characteristic data of the target users are used as the first sample characteristic data.
On the premise of privacy protection, since each participant cannot know which user data is stored by other participants, the first participant and the second participant determine the common user based on the privacy protection technology in the embodiment of the application. Determining a co-user based on privacy preserving techniques may be implemented as: the first participant and the second participant respectively acquire the identification of the own user; and then, based on the privacy protection technology, calculating an intersection of the identifier of the user held by the first participant and the identifier of the user held by the second participant, wherein the obtained result is the common user.
Step S402, the first participant determines the target users participating in the training.
After obtaining the common users, the first participant determines the target users participating in the training, and for example, randomly selects a random number of users from the common users as the target users participating in the training.
In step S403, the first party sends the identifier of the target user to the second party.
In other embodiments, the second participant may also determine the target user participating in the training, and in this case, step S402 and step S403 may be replaced with:
step S402', the second participant determines the target users participating in the training.
Step S403', the second party sends the identification of the target user to the first party.
In step S404, the first participant acquires first sample feature data owned by the first participant.
The first participant acquires the characteristic data of the target user from the data stored by the first participant as first sample characteristic data.
Step S405, the first participant constructs a virtual first feature correlation matrix based on the first sample feature data and a pre-trained safety calculation model.
In some embodiments, the first participant constructing the first feature correlation matrix may be implemented as: determining feature data of each feature corresponding to the first sample feature data and the number of samples corresponding to the first sample feature data based on the first sample feature data; respectively calculating the mean value and the standard deviation corresponding to the characteristic data of each characteristic; determining processed first sample feature data based on the feature data of each feature, the mean value corresponding to the feature data of each feature, the standard deviation corresponding to the feature data of each feature and the number of samples; inputting the processed first sample characteristic data into a safety calculation model to obtain a first matrix; determining a first symmetric matrix according to the processed first sample characteristic data; generating a first empty matrix having a number of rows and columns equal to the number of columns of the first matrix; and constructing a virtual first characteristic correlation matrix according to the first symmetric matrix, the first matrix, the transposed matrix of the first matrix and the first empty matrix.
For example, the first sample characteristic data is a; according to
Figure BDA0003040139320000221
Processing each characteristic data x in the A to obtain a first processed sample characteristic data C; inputting the processed first sample characteristic data C into a safety calculation model y trained in advance1,…,m=f(x1,…,m) Obtaining a first matrix E; transpose matrix C according to the processed first sample feature dataTAnd the processed first sample characteristic data C, and determining a first symmetric matrix F ═ CTC; a first empty matrix generated by the first participant is G'; obtaining a first feature correlation matrix constructed by a first participant
Figure BDA0003040139320000231
In step S406, the first participant determines a co-linear quantization factor of each feature corresponding to the first sample feature data based on the first feature correlation matrix.
In some embodiments, the determining, by the first party, the co-linear quantization factor for each feature corresponding to the first sample feature data may be implemented as: determining a determinant of a first feature correlation matrix; deleting the ith row and ith column data of the first characteristic correlation matrix to obtain each remainder formula corresponding to the first characteristic correlation matrix, wherein i is 1, 2, …, m1,m1The number of the characteristics corresponding to the first sample characteristic data; determining a determinant of each residue; determinant based on first feature correlation matrixAnd determining a co-linear quantization factor of each characteristic corresponding to the first sample characteristic data according to the determinant of each residue.
Deleting the data of the ith row and the ith column of the first characteristic correlation matrix H to obtain an ith residue Hii corresponding to the ith characteristic of the first sample characteristic data, calculating the determinant | H | of the first characteristic correlation matrix and the determinant | Hii | of the ith residue, and determining the collinearity quantization factor (taking the square difference expansion factor as an example) of each characteristic corresponding to the first sample characteristic data
Figure BDA0003040139320000232
Wherein determining the determinant | H | of the first feature correlation matrix may be implemented as: generating a first random matrix with a determinant as a preset value and the dimension same as that of the first empty matrix; inputting the first characteristic correlation matrix and the first random matrix into a safety calculation model to obtain a second matrix; calculating determinants of the first symmetric matrix and the second matrix respectively; and multiplying the determinant of the first symmetric matrix and the determinant of the second symmetric matrix to obtain the determinant of the first characteristic correlation matrix.
Similarly, determining the determinant | Hii | of the ith remainder formula corresponding to the ith feature may be implemented as: generating a random matrix with a determinant as a preset value and the dimension same as that of the first empty matrix; inputting the ith remainder formula and the random matrix into a safety calculation model to obtain a matrix corresponding to the ith remainder formula; respectively calculating the determinant of the first symmetric matrix and the matrix corresponding to the ith remainder; multiplying the determinant of the first symmetric matrix and the determinant of the matrix corresponding to the ith residue to obtain the determinant | Hii | of the ith residue.
In step S407, the first participant determines a target feature from the features corresponding to the first sample feature data based on the collinear quantization factor.
In some embodiments, the determining of the target feature by the first party from the co-linear quantization factor may be implemented as: judging whether the co-linear quantization factor of each feature corresponding to the first sample feature data is larger than a preset boundary value or not; and determining the characteristic of the colinearity quantization factor larger than the preset boundary value as the target characteristic.
Step S408, the first participant deletes the feature data of the target feature from the first sample feature data to obtain first training data for the first participant to perform joint training with other participants.
And the characteristic data of the target characteristic in the first participant has a linear relation with the characteristic data of at least one characteristic in other characteristics, wherein the other characteristics comprise the characteristics which are held by the first participant except the target characteristic and all the characteristics which are held by the second participant.
In step S409, the second participant obtains second sample feature data owned by the second participant.
And the second participant acquires the characteristic data of the target user from the data stored by the second participant as second sample characteristic data.
And step S410, the second participant constructs a virtual second feature correlation matrix based on the second sample feature data and a pre-trained safety calculation model.
In some embodiments, the second participant constructing the second feature correlation matrix may be implemented as: determining feature data of each feature corresponding to the second sample feature data and the number of samples corresponding to the second sample feature data based on the second sample feature data; respectively calculating the mean value and the standard deviation corresponding to the characteristic data of each characteristic; determining processed second sample feature data based on the feature data of each feature, the mean value corresponding to the feature data of each feature, the standard deviation corresponding to the feature data of each feature and the number of samples; inputting the processed second sample characteristic data into a safety calculation model to obtain a first matrix; determining a second symmetric matrix according to the processed second sample characteristic data; generating a second empty matrix having a number of rows and columns equal to the number of columns of the first matrix; and constructing a virtual first characteristic correlation matrix according to the second symmetric matrix, the first matrix, the transposed matrix of the first matrix and the second empty matrix.
For example, the second sample characteristic data is B; according to
Figure BDA0003040139320000241
For each feature in BProcessing the data x to obtain processed second sample characteristic data D; inputting the processed second sample characteristic data D into a safety calculation model y trained in advance1,…,m=f(x1,…,m) Obtaining a first matrix E; transpose matrix D according to the processed second sample feature dataTAnd the processed second sample characteristic data D, and determining a second symmetric matrix G ═ DTD; a second empty matrix generated by the second participant is F'; obtaining a second characteristic correlation matrix constructed by a second participant
Figure BDA0003040139320000251
In step S411, the second participant determines a co-linear quantization factor of each feature corresponding to the second sample feature data based on the feature correlation matrix.
In some embodiments, the determining, by the second party, the co-linear quantization factor of the features corresponding to the second sample feature data may be implemented as: determining a determinant of a second feature correlation matrix; deleting the ith + m of the second eigen-correlation matrix1Line, i + m1Column data (because the first m1 of the second eigen correlation matrix H' is the feature of the first sample eigen data, m1 needs to be added when determining the co-linear quantization factor of each feature corresponding to the second sample eigen data), and each remainder formula corresponding to the second eigen correlation matrix is obtained, where i is 1, 2, …, and m is equal to 12,m2The number of the features corresponding to the second sample feature data is obtained; determining a determinant of each residue; and determining a co-linear quantization factor of each feature corresponding to the second sample feature data based on the determinant of the second feature correlation matrix and the determinants of each remainder.
Deleting the i + m th of the second eigen-correlation matrix H1Line, i + m1Obtaining ith remainder formula H 'corresponding to ith characteristic of second sample characteristic data from the data of the row'i+m1i+m1Calculating determinant | H ' | of the second feature correlation matrix and determinant | H ' of the ith residue sub-formula 'i+m1i+m1Determining a co-linear quantization factor of each feature corresponding to the first sample feature data (using a variance expansion factor as an example)
Figure BDA0003040139320000252
Wherein determining the determinant | H' | of the second feature correlation matrix may be implemented as: generating a second random matrix with a determinant as a preset value and the dimension same as that of the second empty matrix; inputting the second characteristic correlation matrix and the second random matrix into a safety calculation model to obtain a second matrix; calculating the determinants of the second symmetric matrix and the second matrix respectively; and multiplying the determinant of the second symmetric matrix and the determinant of the second matrix to obtain the determinant of the second characteristic correlation matrix.
Similarly, determining the determinant | H 'of the ith remainder formula corresponding to the ith feature'i+m1i+m1I, can be implemented as: generating a random matrix with a determinant as a preset value and the dimension same as that of the second empty matrix; inputting the ith remainder formula and the random matrix into a safety calculation model to obtain a matrix corresponding to the ith remainder formula; respectively calculating the determinant of the second symmetric matrix and the matrix corresponding to the ith remainder; multiplying the determinant of the second symmetric matrix and the determinant of the matrix corresponding to the ith residue to obtain the determinant | H 'of the ith residue'i+m1i+m1|。
In step S412, the second participant determines a target feature from the features corresponding to the second sample feature data based on the collinearity quantization factor.
In some embodiments, the determining of the target feature by the second party from the co-linear quantization factor may be implemented as: judging whether the co-linear quantization factor of each feature corresponding to the second sample feature data is larger than a preset boundary value or not; and determining the characteristic of the colinearity quantization factor larger than the preset boundary value as the target characteristic.
In step S413, the second participant deletes the feature data of the target feature from the second sample feature data to obtain second training data for performing joint training between the second participant and other participants.
And the characteristic data of the target characteristic in the second party has a linear relation with the characteristic data of at least one characteristic in other characteristics, wherein the other characteristics comprise the characteristics which are held by the second party except the target characteristic and all the characteristics which are held by the first party.
In other embodiments, after step S413, the first participant and the second participant perform a joint model by using the first training data and the second training data based on the secure multiparty computing protocol, and since there is no training data with a linear relationship in the first training data and the second training data, the accuracy and stability of the federal model obtained by training can be improved, thereby improving the modeling effect of the federal model.
According to the data processing method provided by the embodiment of the application, a first participant and a second participant are trained in advance based on safe multiparty computation to obtain a safe computation model; then a first participant and a second participant respectively obtain first sample characteristic data and second sample characteristic data, the first participant constructs a virtual first characteristic correlation matrix based on the first sample characteristic data and a safety calculation model, and the second participant constructs a virtual second characteristic correlation matrix based on the second sample characteristic data and the safety calculation model; then the first participant determines the co-linear quantization factor of each feature corresponding to the first sample feature data based on the first feature correlation matrix, and the second participant determines the co-linear quantization factor of each feature corresponding to the second sample feature data based on the second feature correlation matrix; the first participant determines a target feature in the features corresponding to the first sample feature data according to the co-linear quantization factor of the features corresponding to the first sample feature data, the second participant determines a target feature in the features corresponding to the second sample feature data according to the co-linear quantization factor of the features corresponding to the second sample feature data, the feature data of the target features has a linear relation with the feature data of at least one feature in other features, and the other features comprise all features except the target feature held by the first participant and the second participant; after the target features are determined, the first participant deletes the feature data of the target features from the first sample feature data to obtain first training data; and the second participant deletes the feature data of the target feature from the second sample feature data to obtain second training data. Therefore, on the premise of protecting data privacy, co-linear data in feature data held by the first participant and the second participant can be screened and eliminated to obtain training data without linear relationship, so that when joint training is carried out, the first participant and the second participant carry out joint training by using the training data without linear relationship, the accuracy and stability of the federated model can be improved, and the modeling effect of the federated model is improved.
Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.
Longitudinal federated learning is typically a joint training of machine learning models by different participants. In the modeling process of the linear model, the existence of collinearity can obviously affect the stability and the effect of the linear model, so that the collinearity needs to be eliminated in the modeling process. The Variance expansion Factor (VIF) can well quantify the co-linearity of a certain feature and all other features, and is very common in practical modeling. In the related art, when calculating the VIF, the calculation method generally needs to collect the data of each party to one place. But data at parties (e.g., banking enterprises) may involve personal privacy or business secrets, and direct opening to other parties may cause information leakage. In the related art, the feature screening scheme aiming at linear federated modeling mainly aims at single-column data, and a method for describing the colinearity of multiple-column data, namely VIF (visual aid function), is lacked. The related art cannot combine data of multiple parties to calculate the VIF while protecting data privacy.
The embodiment of the application aims at the scenes of two independent participants, and can be assumed to carry out data joint modeling between two companies. The original data of the respective company cannot leave the respective company due to compliance, privacy, business confidentiality, and the like. The intermediate data exchanged in the modeling process cannot deduce or reveal unnecessary raw data information either. Two parties respectively hold different feature data of a plurality of same ID users, and co-linearity can exist among the features, and the modeling effect of a subsequent linear model can be influenced. The method and the device utilize the technical scheme of privacy security of two parties to calculate the variance inflation factor VIF, and feature screening is carried out based on the variance inflation factor VIF.
Fig. 5 is a schematic flow chart of a calculation process of a variance expansion factor under a longitudinal federal condition provided in the embodiment of the present application, fig. 6 is a schematic flow chart of a calculation process of a determinant of a correlation matrix provided in the embodiment of the present application, and a detailed description is given below of a method for calculating a variance expansion factor under a longitudinal federal condition provided in the embodiment of the present application, with reference to fig. 5 and fig. 6.
1) The embodiment of the application provides two independent parties (respectively marked as Alice and Bob), which respectively hold data with different characteristics of the same ID, and are respectively marked as matrices a and B (where a and B have the same row number n, and a has column number m1B has the number m of columns2)。
2) Alice and Bob locally normalize the characteristics (the columns of the matrix) of A and B respectively to obtain matrices C and D. The normalization mode is as follows:
Figure BDA0003040139320000281
x represents a list of features that are to be included in the list,
Figure BDA0003040139320000282
representing the mean, σ, of the featurexRepresents the standard deviation of the features.
3) By the expanded SPDZ protocol, the matrix addition and subtraction and the matrix multiplication under privacy are carried out, and the matrix product C is calculatedTD ═ E, the results of the calculations are held by Alice and Bob.
In the embodiment of the application, the addition and subtraction and matrix multiplication of the matrix under privacy protection are realized through a secure multi-party computing protocol SPDZ, each participant inputs own private data to obtain a computing result, and the private data of other participants cannot be obtained, so that any additional information except the result is not leaked in the whole computing process, and the high-security and practical computing efficiency is achieved.
4) Alice local computation matrix multiplication CTC ═ F, Bob locally computes a matrix multiplication DTD=G。
5) The Pearson correlation moment of Pearson is obtained in the last step
Figure BDA0003040139320000283
Note 1: here Alice holds F, E, ETBob holds E, ET、G。
According to the definition of VIF, i characteristic
Figure BDA0003040139320000284
Here Hii is the remainder of matrix H (i.e., the matrix left by deleting the ith row and ith column of the matrix), where | H | represents the determinant of matrix H.
6) Suppose the ith feature is on Alice side (on Bob side, the column number m of index i plus Fii is needed)1) Then Alice may delete the ith row and ith column of the matrix F to obtain the matrix Fii. Alice and Bob can delete the ith row of the matrix E to obtain Ei*Deleting matrix ETIs obtained in the ith column
Figure BDA0003040139320000285
Then remainder type
Figure BDA0003040139320000286
Note 2: here Alice holds Fii, Ei*
Figure BDA0003040139320000287
Bob holds Ei*
Figure BDA0003040139320000288
Figure BDA0003040139320000289
In the embodiment of the application, the determinant calculation problem of the original matrix H which cannot be processed is converted into the block matrixes Fii and E which can be processed through ingenious matrix transformationi*
Figure BDA0003040139320000291
And
Figure BDA0003040139320000292
the determinant computing problem of (1).
The following solution is based on the blocking matrix Fii, Ei*
Figure BDA0003040139320000293
And
Figure BDA0003040139320000294
the determinant computing problem of (1).
7) Calculate | H from Note 1 vs. Note 2iiThe same problem needs to be solved for | and | H |. In calculating | H |, the following M1、M2、M3、M4Respectively correspond to F, E, ETG, in calculating | HiiIn | time, the following M1、M2、M3、M4Respectively correspond to Fii and Ei*
Figure BDA0003040139320000295
Alice holds matrix M1、M2、M3Bob holds a matrix M2、M3、M4Calculating determinant
Figure BDA0003040139320000296
8) To solve 7) from here (since Alice cannot obtain M)4Bob cannot get M1Therefore, the determinant in 7) cannot be directly calculated). When M is1When the time can be reversed,
Figure BDA0003040139320000297
9) alice calculates | M locally1I and
Figure BDA0003040139320000298
and locally generating the determinant with 1, dimension and M4Uniform random matrix R1. Bob locally generates determinants 1 and M4Random matrix R with consistent dimension2
10) Calculating matrix subtraction and multiplication by using the expanded SPDZ protocol again and recovering the resultThe matrix J ═ R1
Figure BDA0003040139320000299
Due to the existence of the random matrix, through J, neither Alice nor Bob can recover the original matrix of the other party.
In the embodiment of the application, the original matrix is confused by constructing a special random matrix, and the determinant problem of the matrix related to the original data information is converted into the determinant calculation problem of the random matrix with the same determinant. The original matrix is subjected to confusion processing through the random matrix, each participant can not recover the data of other participants, privacy protection can be performed on the data of each participant, and data leakage is avoided.
11) Alice local computation
Figure BDA00030401393200002910
Thereby obtaining determinant
Figure BDA00030401393200002911
The problem of 7) is solved.
12) According to the methods described in 7) to 11), | H11|, | H22|, …, and,
Figure BDA00030401393200002912
And | H |, to give VIF1,VIF2,…,VIFm1+m2
According to the method provided by the embodiment of the application, the original determinant calculation problem which cannot be processed is converted into the block matrix determinant problem which can be processed through ingenious matrix transformation; confusing an original matrix by constructing a special random matrix, and converting a determinant problem of the matrix related to original data information into a determinant calculation problem of the random matrix with the same determinant; through the ingenious cooperation of the SPDZ protocol, any additional information except results is not leaked in the whole calculation process, and the method has high safety and practical calculation efficiency. By the method provided by the embodiment of the application, the safe calculation of the variance expansion factor VIF becomes possible, so that efficient feature screening can be performed, and the overall effect of a subsequent linear model is improved; besides, high safety and practicability are considered, unnecessary data information is not leaked except the result, and the whole calculation cost is controlled within the practical production efficiency range.
Continuing with the exemplary structure of the data processing apparatus implemented as software modules provided in the embodiments of the present application, in some embodiments, as shown in fig. 2, the data processing apparatus 70 stored in the memory 140 is applied to a second participant performing joint training on a model, and the software modules in the data processing apparatus 70 may include:
a building module 71, configured to build a virtual feature correlation matrix based on first sample feature data held by the first participant and a pre-trained security computation model, where the security computation model is obtained by the first participant and other participants of federal learning through pre-training based on secure multiparty computation;
a first determining module 72, configured to determine a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix;
a second determining module 73, configured to determine, based on the collinear quantization factor, a target feature from features corresponding to the first sample feature data;
a deleting module 74, configured to delete the feature data of the target feature from the first sample feature data, so as to obtain first training data for performing joint training between the first participant and the other participants;
wherein the feature data of the target feature has a linear relationship with the feature data of at least one of the other features, the other features including features held by the first party other than the target feature and features held by the other parties.
In some embodiments, the building module 71 further includes:
the first determining submodule is used for determining the feature data of each feature corresponding to the first sample feature data and the number of samples corresponding to the first sample feature data based on the first sample feature data;
the calculation submodule is used for respectively calculating the mean value and the standard deviation corresponding to the characteristic data of each characteristic;
a second determining submodule, configured to determine processed first sample feature data based on the feature data of each feature, a mean value corresponding to the feature data of each feature, a standard deviation corresponding to the feature data of each feature, and the number of samples;
the input submodule is used for inputting the processed first sample characteristic data into the safety calculation model to obtain a first matrix;
and the constructing submodule is used for constructing a virtual characteristic correlation matrix according to the processed first sample characteristic data and the first matrix.
In some embodiments, the building module further comprises:
the determining unit is used for determining a first symmetric matrix according to the processed first sample characteristic data;
the first generating unit is used for generating a null matrix with the number of rows and the number of columns equal to the number of columns of the first matrix;
and the constructing unit is used for constructing a virtual characteristic correlation matrix according to the first symmetric matrix, the first matrix, the transposed matrix of the first matrix and the empty matrix.
In some embodiments, the first determining module 72 further includes:
a third determining submodule for determining a determinant of the feature correlation matrix;
a deleting submodule, configured to delete the ith row and ith column data of the feature correlation matrix to obtain each remainder formula corresponding to the feature correlation matrix, where i is 1, 2, …, m1,m1The number of the characteristics corresponding to the first sample characteristic data;
a fourth determining submodule for determining the determinant of each remainder;
and a fifth determining submodule, configured to determine a co-linear quantization factor of each feature corresponding to the first sample feature data based on the determinant of the feature correlation matrix and the determinants of the respective residue sub-types.
In some embodiments, the third determining sub-module further comprises:
the second generating unit is used for generating a first random matrix with a determinant as a preset value and the dimension being the same as that of the empty matrix;
the input unit is used for inputting the characteristic correlation matrix and the first random matrix into the safety calculation model to obtain a second matrix;
a first calculation unit configured to calculate determinants of the first symmetric matrix and the second matrix, respectively;
and the second calculation unit is used for multiplying the determinant of the first symmetric matrix and the determinant of the second matrix to obtain the determinant of the characteristic correlation matrix.
In some embodiments, the second determining module 73 further includes:
the judgment submodule is used for judging whether the colinearity quantization factor of each feature corresponding to the first sample feature data is larger than a preset boundary value;
and the sixth determining submodule is used for determining the characteristic of the colinearity quantization factor larger than the preset boundary value as the target characteristic.
Here, it should be noted that: the above description of the data processing apparatus embodiment is similar to the above description of the method, with the same advantageous effects as the method embodiment. For technical details not disclosed in the embodiments of the data processing device of the present application, a person skilled in the art should understand with reference to the description of the embodiments of the method of the present application.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method described in the embodiment of the present application.
Embodiments of the present application provide a storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform the methods provided by embodiments of the present application, for example, the methods as illustrated in fig. 3 to 6.
In some embodiments, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (10)

1. A data processing method for use by a first party to federal learning, the method comprising:
constructing a virtual feature correlation matrix based on first sample feature data held by the first participant and a pre-trained safety calculation model, wherein the safety calculation model is obtained by pre-training the first participant and other participants of federal learning based on safety multi-party calculation;
determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix;
determining a target feature from the features corresponding to the first sample feature data based on the co-linear quantization factor;
deleting the feature data of the target feature from the first sample feature data to obtain first training data for joint training of the first participant and the other participants;
wherein the feature data of the target feature has a linear relationship with the feature data of at least one of the other features, the other features including features held by the first party other than the target feature and features held by the other parties.
2. The method of claim 1, wherein constructing a virtual feature correlation matrix based on first sample feature data held by the first participant and a pre-trained security computation model comprises:
determining feature data of each feature corresponding to the first sample feature data and the number of samples corresponding to the first sample feature data based on the first sample feature data;
respectively calculating the mean value and the standard deviation corresponding to the characteristic data of each characteristic;
determining processed first sample feature data based on the feature data of each feature, the mean value corresponding to the feature data of each feature, the standard deviation corresponding to the feature data of each feature and the number of samples;
inputting the processed first sample characteristic data into the safety calculation model to obtain a first matrix;
and constructing a virtual characteristic correlation matrix according to the processed first sample characteristic data and the first matrix.
3. The method of claim 2, wherein constructing a virtual feature correlation matrix from the processed first sample feature data and the first matrix comprises:
determining a first symmetric matrix according to the processed first sample characteristic data;
generating a null matrix having a number of rows and columns equal to the number of columns of the first matrix;
and constructing a virtual characteristic correlation matrix according to the first symmetric matrix, the first matrix, the transposed matrix of the first matrix and the null matrix.
4. The method according to any one of claims 1 to 3, wherein the determining a co-linear quantization factor for each feature corresponding to the first sample feature data based on the feature correlation matrix comprises:
determining a determinant of the feature correlation matrix;
deleting the ith row and ith column data of the characteristic correlation matrix to obtain each remainder formula corresponding to the characteristic correlation matrix, wherein i is 1, 2, …, m1,m1The number of the characteristics corresponding to the first sample characteristic data;
determining a determinant of each residue;
and determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the determinant of the feature correlation matrix and the determinants of the respective residue sub-types.
5. The method of claim 4, wherein determining the determinant of the feature correlation matrix comprises:
generating a first random matrix with a determinant as a preset value and the dimension same as that of the empty matrix;
inputting the characteristic correlation matrix and the first random matrix into the safety calculation model to obtain a second matrix;
calculating determinants of the first symmetric matrix and the second matrix respectively;
and multiplying the determinant of the first symmetric matrix and the determinant of the second matrix to obtain the determinant of the characteristic correlation matrix.
6. The method of claim 1, wherein the determining a target feature from the features corresponding to the first sample feature data based on the co-linear quantization factor comprises:
judging whether the co-linear quantization factor of each feature corresponding to the first sample feature data is larger than a preset boundary value;
and determining the characteristic of the colinearity quantization factor larger than the preset boundary value as the target characteristic.
7. A data processing apparatus for application to a first party of federal learning, the apparatus comprising:
the building module is used for building a virtual characteristic correlation matrix based on first sample characteristic data held by the first participant and a pre-trained safety calculation model, and the safety calculation model is obtained by pre-training the first participant and other participants of federal learning based on safety multi-party calculation;
a first determining module, configured to determine a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix;
a second determining module, configured to determine, based on the collinear quantization factor, a target feature from features corresponding to the first sample feature data;
a deleting module, configured to delete the feature data of the target feature from the first sample feature data, so as to obtain first training data for performing joint training between the first participant and the other participants;
wherein the feature data of the target feature has a linear relationship with the feature data of at least one of the other features, the other features including features held by the first party other than the target feature and features held by the other parties.
8. A data processing apparatus, characterized in that the apparatus comprises:
a memory for storing executable instructions;
a processor for implementing the method of any one of claims 1 to 6 when executing executable instructions stored in the memory.
9. A computer-readable storage medium having stored thereon executable instructions for causing a processor, when executed, to implement the method of any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the method of any of claims 1 to 6 when executed by a processor.
CN202110454684.2A 2021-04-26 2021-04-26 Data processing method, device, equipment, storage medium and program product Pending CN113095514A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110454684.2A CN113095514A (en) 2021-04-26 2021-04-26 Data processing method, device, equipment, storage medium and program product
PCT/CN2021/140955 WO2022227644A1 (en) 2021-04-26 2021-12-23 Data processing method and apparatus, and device, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110454684.2A CN113095514A (en) 2021-04-26 2021-04-26 Data processing method, device, equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN113095514A true CN113095514A (en) 2021-07-09

Family

ID=76679959

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110454684.2A Pending CN113095514A (en) 2021-04-26 2021-04-26 Data processing method, device, equipment, storage medium and program product

Country Status (2)

Country Link
CN (1) CN113095514A (en)
WO (1) WO2022227644A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345597A (en) * 2021-07-15 2021-09-03 中国平安人寿保险股份有限公司 Federal learning method and device of infectious disease probability prediction model and related equipment
CN114692201A (en) * 2022-03-31 2022-07-01 北京九章云极科技有限公司 Multi-party security calculation method and system
CN114996749A (en) * 2022-08-05 2022-09-02 蓝象智联(杭州)科技有限公司 Feature filtering method for federal learning
WO2022227644A1 (en) * 2021-04-26 2022-11-03 深圳前海微众银行股份有限公司 Data processing method and apparatus, and device, storage medium and program product
CN115545216A (en) * 2022-10-19 2022-12-30 上海零数众合信息科技有限公司 Service index prediction method, device, equipment and storage medium
WO2024022082A1 (en) * 2022-07-29 2024-02-01 脸萌有限公司 Information classification method and apparatus, device, and medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10395180B2 (en) * 2015-03-24 2019-08-27 International Business Machines Corporation Privacy and modeling preserved data sharing
US20200285984A1 (en) * 2019-03-06 2020-09-10 Hcl Technologies Limited System and method for generating a predictive model
CN111062487B (en) * 2019-11-28 2021-04-20 支付宝(杭州)信息技术有限公司 Machine learning model feature screening method and device based on data privacy protection
CN110909216B (en) * 2019-12-04 2023-06-20 支付宝(杭州)信息技术有限公司 Method and device for detecting relevance between user attributes
CN111160573B (en) * 2020-04-01 2020-06-30 支付宝(杭州)信息技术有限公司 Method and device for protecting business prediction model of data privacy joint training by two parties
CN111966473B (en) * 2020-07-24 2024-02-06 支付宝(杭州)信息技术有限公司 Operation method and device of linear regression task and electronic equipment
CN111654853B (en) * 2020-08-04 2020-11-10 索信达(北京)数据技术有限公司 Data analysis method based on user information
CN112597540B (en) * 2021-01-28 2021-10-01 支付宝(杭州)信息技术有限公司 Multiple collinearity detection method, device and system based on privacy protection
CN113095514A (en) * 2021-04-26 2021-07-09 深圳前海微众银行股份有限公司 Data processing method, device, equipment, storage medium and program product

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022227644A1 (en) * 2021-04-26 2022-11-03 深圳前海微众银行股份有限公司 Data processing method and apparatus, and device, storage medium and program product
CN113345597A (en) * 2021-07-15 2021-09-03 中国平安人寿保险股份有限公司 Federal learning method and device of infectious disease probability prediction model and related equipment
CN114692201A (en) * 2022-03-31 2022-07-01 北京九章云极科技有限公司 Multi-party security calculation method and system
WO2024022082A1 (en) * 2022-07-29 2024-02-01 脸萌有限公司 Information classification method and apparatus, device, and medium
CN114996749A (en) * 2022-08-05 2022-09-02 蓝象智联(杭州)科技有限公司 Feature filtering method for federal learning
CN114996749B (en) * 2022-08-05 2022-11-25 蓝象智联(杭州)科技有限公司 Feature filtering method for federal learning
CN115545216A (en) * 2022-10-19 2022-12-30 上海零数众合信息科技有限公司 Service index prediction method, device, equipment and storage medium

Also Published As

Publication number Publication date
WO2022227644A1 (en) 2022-11-03

Similar Documents

Publication Publication Date Title
CN113095514A (en) Data processing method, device, equipment, storage medium and program product
Patra et al. BLAZE: blazing fast privacy-preserving machine learning
CN111814985B (en) Model training method under federal learning network and related equipment thereof
WO2021179720A1 (en) Federated-learning-based user data classification method and apparatus, and device and medium
Chen et al. Secure computation for machine learning with SPDZ
Cock et al. Fast, privacy preserving linear regression over distributed datasets based on pre-distributed data
CN111143894B (en) Method and system for improving safe multi-party computing efficiency
CN112085159B (en) User tag data prediction system, method and device and electronic equipment
Zhang et al. Secure distributed genome analysis for GWAS and sequence comparison computation
CN111931950A (en) Method and system for updating model parameters based on federal learning
Fan et al. High-dimensional adaptive function-on-scalar regression
CN112508118B (en) Target object behavior prediction method aiming at data offset and related equipment thereof
Wang et al. Differentially private SGD with non-smooth losses
Christ et al. Differential privacy and swapping: Examining de-identification’s impact on minority representation and privacy preservation in the us census
Miller et al. Universal security for randomness expansion from the spot-checking protocol
Guo et al. The improved split‐step θ methods for stochastic differential equation
Chen et al. Fed-eini: An efficient and interpretable inference framework for decision tree ensembles in vertical federated learning
Zhang et al. Joint intelligence ranking by federated multiplicative update
WO2023096571A2 (en) Data processing for release while protecting individual privacy
CN114547658A (en) Data processing method, device, equipment and computer readable storage medium
Negri et al. Z-process method for change point problems with applications to discretely observed diffusion processes
Chen et al. Fed-EINI: an efficient and interpretable inference framework for decision tree ensembles in federated learning
Taufiq et al. Robust Crypto-Governance Graduate Document Storage and Fraud Avoidance Certificate in Indonesian Private University
Li et al. Consistent estimation in generalized linear mixed models with measurement error
Lv et al. On the sign consistency of the Lasso for the high-dimensional Cox model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination