CN113095514A

CN113095514A - Data processing method, device, equipment, storage medium and program product

Info

Publication number: CN113095514A
Application number: CN202110454684.2A
Authority: CN
Inventors: 魏文斌; 范涛; 陈天健
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2021-07-09
Also published as: WO2022227644A1

Abstract

The application provides a data processing method, a device, equipment, a storage medium and a program product, wherein the method comprises the following steps: constructing a virtual characteristic correlation matrix based on the first sample characteristic data and a pre-trained safety calculation model; determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix; determining target characteristics from the characteristics corresponding to the first sample characteristic data based on the co-linear quantization factor; deleting the feature data of the target feature from the first sample feature data to obtain first training data for performing combined training; and the characteristic data of the target characteristic and the characteristic data of at least one of the other characteristics have a linear relation. Therefore, on the premise of protecting data privacy, data with collinearity in feature data held by each participant is screened and eliminated, accuracy and stability of the obtained federated model of the joint training can be improved, and the modeling effect of the federated model is improved.

Description

Data processing method, device, equipment, storage medium and program product

Technical Field

The present application relates to the field of artificial intelligence technology, and relates to, but is not limited to, a data processing method, apparatus, device, storage medium, and program product.

Background

Machine learning is the science of how to use computers to simulate or implement human learning activities, and is one of the most intelligent features in artificial intelligence, the most advanced research fields. The study of machine learning is mainly divided into two categories of study directions: the first type is the research of traditional machine learning, which mainly researches the learning mechanism and focuses on exploring the learning mechanism of a dummy; the second type is the study of machine learning in big data environment, which mainly studies how to effectively utilize information, and focuses on obtaining hidden, effective and understandable knowledge from huge data.

The federal learning technology is a novel privacy protection technology, and can effectively combine data of all parties to carry out model training on the premise that the data cannot be out of the local. For a plurality of business problems in the field of big data, the problems can be solved through a corresponding machine learning model. And the elimination of collinear data is the key to training a good model. Under the condition of protecting data privacy, the collinearity of multi-party data in federal learning cannot be quantified in the related technology, and training data with the collinearity cannot be efficiently screened and rejected, so that the accuracy rate of a model obtained by training is low, and the stability is poor.

Disclosure of Invention

Embodiments of the present application provide a data processing method, an apparatus, a device, a computer-readable storage medium, and a computer program product, which can remove data with collinearity in linear federal modeling, improve accuracy and stability of a federal model, and improve modeling effect of the model.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a data processing method, which is applied to a first participant of federal learning, and comprises the following steps:

constructing a virtual feature correlation matrix based on first sample feature data held by the first participant and a pre-trained safety calculation model, wherein the safety calculation model is obtained by pre-training the first participant and other participants of federal learning based on safety multi-party calculation;

determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix;

determining a target feature from the features corresponding to the first sample feature data based on the co-linear quantization factor;

deleting the feature data of the target feature from the first sample feature data to obtain first training data for joint training of the first participant and the other participants;

wherein the feature data of the target feature has a linear relationship with the feature data of at least one of the other features, the other features including features held by the first party other than the target feature and features held by the other parties.

The embodiment of the application provides a data processing device, is applied to the first party of federal study, the device includes:

the building module is used for building a virtual characteristic correlation matrix based on first sample characteristic data held by the first participant and a pre-trained safety calculation model, and the safety calculation model is obtained by pre-training the first participant and other participants of federal learning based on safety multi-party calculation;

a first determining module, configured to determine a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix;

a second determining module, configured to determine, based on the collinear quantization factor, a target feature from features corresponding to the first sample feature data;

a deleting module, configured to delete the feature data of the target feature from the first sample feature data, so as to obtain first training data for performing joint training between the first participant and the other participants;

An embodiment of the present application provides a data processing apparatus, where the apparatus includes:

a memory for storing executable instructions;

and the processor is used for realizing the method provided by the embodiment of the application when executing the executable instructions stored in the memory.

Embodiments of the present application provide a computer-readable storage medium, where executable instructions are stored on the computer-readable storage medium, and when the computer-readable storage medium is executed by a processor, the computer-readable storage medium implements a method provided by embodiments of the present application.

Embodiments of the present application provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the method provided by the embodiments of the present application.

The embodiment of the application has the following beneficial effects:

in the data processing method provided by the embodiment of the application, when data processing is carried out, firstly, a first participant and other participants of federal learning are trained in advance based on safe multiparty computation to obtain a safe computation model; then the first participant acquires first sample characteristic data held by the first participant, and a virtual characteristic correlation matrix is constructed based on the first sample characteristic data and a pre-trained safety calculation model; then determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix; determining target features in the features corresponding to the first sample feature data by the co-linear quantization factor, wherein the feature data of the target features has a linear relation with the feature data of at least one feature in other features, and the other features include not only the features, other than the target features, held by the first participant but also all the features held by other participants; and after the target characteristics are determined, deleting the characteristic data of the target characteristics from the first sample characteristic data to obtain first training data for the first participant to perform joint training with other participants. Therefore, on the premise of protecting data privacy, data with collinearity in feature data held by each participant can be screened and eliminated to obtain training data without linear relation, so that when the joint training is carried out, each participant utilizes the training data without linear relation to carry out the joint training, the accuracy and stability of the federated model can be improved, and the modeling effect of the federated model is improved.

Drawings

Fig. 1 is a schematic network architecture diagram of a data processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a component structure of a data processing device according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart of an implementation of a data processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another implementation of a data processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a calculation flow of a variance inflation factor under a longitudinal federal condition according to an embodiment of the present application;

fig. 6 is a schematic diagram of a calculation process of determinant of correlation matrix according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only used to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where permissible, so that the embodiments of the present application described herein can be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) In Vertical federal Learning (Vertical fed Learning), under the condition that the users of two data sets overlap more and the user features overlap less, the data sets are segmented according to the Vertical direction (namely feature dimension), and partial data which are the same for both users but have not the same user features are taken out for training machine Learning.

2) The Variance expansion Factor (VIF), also called Variance expansion Factor, is a numerical value that characterizes the degree of complex collinearity between observed values of independent variables, and is used to measure the severity of complex (multiple) collinearity in a multiple linear regression model.

3) Homomorphic Encryption (Homomorphic Encryption), which is a cryptographic technique based on the theory of computational complexity of mathematical puzzles. The homomorphic encrypted data is processed to produce an output, which is decrypted, the result being the same as the output obtained by processing the unencrypted original data in the same way.

An exemplary application of the apparatus implementing the embodiment of the present application is described below, and the apparatus provided in the embodiment of the present application may be implemented as a terminal device. In the following, exemplary applications covering terminal devices when the apparatus is implemented as a terminal device will be explained.

Fig. 1 is a schematic diagram of a network architecture of a data processing method according to an embodiment of the present disclosure, as shown in fig. 1, the network architecture at least includes a first party 100, a second party 200, and a network 300. To enable support of one exemplary application, first participant 100 and second participant 200 may be participants in longitudinal federal learning who jointly train machine learning models. The first participant 100 and the second participant 200 may be clients, for example, participant devices such as banks or hospitals, which store user characteristic data, and the clients may be devices with model training functions such as a laptop, a tablet computer, a desktop computer, and a special training device. The first participant 100 is connected to the second participant 200 via a network 300, which may be a wide area network or a local area network, or a combination of both, and the data transmission is accomplished using wireless or wired links.

The first participant 100 first obtains first sample feature data from its own data, and processes the first sample feature data to obtain processed first sample feature data. The second participant 200 obtains second sample feature data from its own data, and processes the second sample feature data to obtain processed second sample feature data. The first sample characteristic data and the second sample characteristic data have the same identification, that is, the first sample characteristic data and the second sample characteristic data are data of different characteristics of the same batch of samples held by the first participant 100 and the second participant 200, respectively. The first participant 100 and the second participant 200 determine a first matrix E using the processed first sample feature data and the processed second sample feature data based on the secure multi-party computation, the first matrix E being held by both the first participant 100 and the second participant 200. Based on secure multiparty computation, the first participant 100 can only obtain the first matrix E, and cannot know the processed second sample feature data held by the second participant 200; similarly, the second participant 200 can only obtain the first matrix E, and cannot know the processed first sample feature data held by the first participant 100.

After the first participant 100 obtains the first matrix E, a virtual feature correlation matrix is constructed according to the first matrix E and the processed first sample feature data; and then calculating the determinant of the feature correlation matrix and the determiners of the respective syndromes corresponding to the feature correlation matrix, and determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the determinant of the feature correlation matrix and the determiners of the respective syndromes, wherein the co-linear quantization factor may be a variance expansion coefficient used for quantizing the co-linearity of each feature with all other features, wherein all other features include not only the features of the first participant 100 except the feature quantized this time, but also all features of the second participant 200. The first participant 100 determines which feature data of the features have collinearity according to the collinearity quantization factor of each feature, determines the features as target features, and finally deletes the feature data of the target features from the first sample feature data to obtain first training data for the first participant 100 and the second participant 200 to perform joint training. In the same way, the second participant 200 obtains second training data for the joint training. Therefore, when the joint training is performed, the first participant 100 and the second participant 200 perform the joint training by using the first training data and the second training data which do not have the collinearity, and a federal model with high accuracy and good stability can be obtained.

By the method provided by the embodiment of the application, on the premise of protecting data privacy, co-linear data in characteristic data held by each participant can be screened and eliminated to obtain training data without linear relation, so that when the co-training is carried out, each participant uses the training data without linear relation to carry out the co-training, the accuracy and stability of the federal model can be improved, and the modeling effect of the federal model is improved.

The apparatus provided in the embodiments of the present application may be implemented as hardware or a combination of hardware and software, and various exemplary implementations of the apparatus provided in the embodiments of the present application are described below.

According to the exemplary structure of the data processing device 10 shown in fig. 2, where the data processing device 10 is illustrated as a device applied to a first party of federal learning, other exemplary structures of the data processing device 10 may be foreseen, and thus the structure described herein should not be seen as a limitation, for example, some components described below may be omitted, or components not described below may be added to suit the special needs of some applications.

The data processing apparatus 10 shown in fig. 2 includes: at least one processor 110, memory 140, at least one network interface 120, and a user interface 130. Each of the components in data processing device 10 are coupled together by a bus system 150. It will be appreciated that the bus system 150 is used to enable communications among the components of the connection. The bus system 150 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 150 in fig. 2.

The user interface 130 may include a display, a keyboard, a mouse, a touch-sensitive pad, a touch screen, and the like.

The memory 140 may be either volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM). The volatile Memory may be Random Access Memory (RAM). The memory 140 described in embodiments herein is intended to comprise any suitable type of memory.

The memory 140 in embodiments of the present application is capable of storing data to support the operation of the data processing apparatus 10. Examples of such data include: any computer program for operation on data processing device 10, such as an operating system and application programs. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.

As an example of the method provided by the embodiment of the present application implemented by software, the method provided by the embodiment of the present application may be directly embodied as a combination of software modules executed by the processor 110, the software modules may be located in a storage medium located in the memory 140, and the processor 110 reads executable instructions included in the software modules in the memory 140, and completes the method provided by the embodiment of the present application in combination with necessary hardware (for example, including the processor 110 and other components connected to the bus 150).

By way of example, the Processor 110 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.

The data processing method provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the device provided by the embodiment of the present application.

Fig. 3 is a schematic implementation flow diagram of a data processing method provided in an embodiment of the present application, which is applied to a first participant of the network architecture shown in fig. 1, and will be described with reference to the steps shown in fig. 3.

Step S301, a virtual feature correlation matrix is constructed based on first sample feature data held by a first participant and a pre-trained safety calculation model.

In the embodiment of the application, among the features corresponding to the first sample feature data, the feature data of different features may have a linear relationship, and if a data training model having collinearity is adopted, the trained model has low accuracy and poor stability. In order to remove data with a linear relation, a first participant constructs a virtual feature correlation matrix which is marked as H according to the first sample feature data and a pre-trained safety calculation model.

The safety calculation model is obtained by pre-training a first participant and other participants of federal learning based on safety multiparty calculation, and can perform matrix addition and subtraction and matrix multiplication under privacy. In the embodiment of the present application, the data processing method is described by taking federal learning of two participants as an example. In practical application, the data processing method can also be applied to three or more participants for federal learning. On the premise that data held by each participant is not local and data privacy is guaranteed, the first participant and the second participant are trained on the basis of a privacy protection technology to obtain a trained security computation model, and the trained security computation model is held by each participant.

In this embodiment of the present application, because the first participant and the second participant perform joint training, the training data needs to come from the same user to perform joint training, that is, the first sample feature data used by the first participant to perform joint training is the same as the second sample feature data used by the second participant to perform joint training, and their respective corresponding user Identifiers (IDs) are the same. Therefore, before the first participant acquires the first sample feature data, it is first necessary to determine the common users held by the participants, then screen out the target users participating in the training from the common users, and use the feature data of the target users as the first sample feature data.

Step S302, based on the feature correlation matrix, determining a co-linear quantization factor of each feature corresponding to the first sample feature data.

In practical applications, methods for determining a co-linear quantitative factor between feature data of a plurality of features include a variance inflation factor method, a feature root analysis method, and a condition number method. In addition, an intuitive judgment method capable of qualitatively analyzing the collinear degree among feature data of a plurality of features exists, and the intuitive judgment method is generally used for preliminary judgment and cannot be used for quantitative analysis.

The first sample feature data held by the first participant and the second sample feature data held by the second participant correspond to data of different columns in the feature correlation matrix, and whether colinearity exists between each column of data and other columns of data in the feature correlation matrix is determined, that is, whether colinearity exists between the feature data of each feature in the first sample feature and the feature data of other features can be determined. Here, the existence of collinearity between the feature data of different features includes the following possibilities: the feature data of one feature is present as a multiple of the feature data of another feature; the feature data of one feature is equal to the feature data of another feature plus a constant term; the feature data of one feature is equal to the feature data of the other two features.

In the embodiment of the application, according to the feature correlation matrix, the co-linear quantization factor of each feature corresponding to the first sample feature data is determined, and according to the co-linear quantization factor of each feature corresponding to the first sample feature data, which features have feature data collinear with the feature data can be determined.

Step S303, based on the collinearity quantization factor, determines a target feature from the features corresponding to the first sample feature data.

And determining the target characteristics with the collinearity according to the collinearity quantization factor. Taking the variance expansion factor as an example, in the case where there is no multicollinearity, the variance expansion factor is close to 1, and the stronger the multicollinearity, the larger the variance expansion factor. In practice, multiple collinearity always exists between data more or less, so it is not practical to use the variance expansion factor equal to 1 as the criterion for evaluating the collinearity, and generally, a boundary value may be preset according to the actual application scenario, and in the embodiment of the present application, the preset boundary value may be 10. And judging whether the variance expansion factor of each feature is larger than 10, and when the variance expansion factor of a certain feature is larger than 10, considering that the feature data of the feature and the feature data of other features have stronger collinearity, and determining the feature as the target feature.

Step S304, deleting the feature data of the target feature from the first sample feature data to obtain first training data for the first participant to perform joint training with other participants.

And the characteristic data of the target characteristic and the characteristic data of at least one of other characteristics have a linear relation, wherein the other characteristics comprise the characteristics which are held by the first party except the target characteristic and the characteristics which are held by other parties.

In the embodiment of the application, a variance expansion factor of each feature in first sample feature data of a first participant can be calculated based on the feature correlation matrix, a target feature of which the variance expansion factor is larger than a preset boundary value is determined, and then the feature data of the target feature in the first sample feature data is deleted, so that feature data without collinearity is obtained and used as first training data. Therefore, when the joint training is carried out, the first participant carries out the joint training by using the first training data without colinearity, and a federal model with high accuracy and good stability can be obtained.

The data processing method provided by the embodiment of the application is applied to a first party of federal learning, and comprises the following steps: constructing a virtual characteristic correlation matrix based on first sample characteristic data held by a first participant and a pre-trained safety calculation model, wherein the safety calculation model is obtained by pre-training the first participant and other participants of federal learning based on safety multi-party calculation; determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix; determining target characteristics from the characteristics corresponding to the first sample characteristic data based on the colinear quantization factor; deleting the feature data of the target feature from the first sample feature data to obtain first training data for joint training of the first participant and other participants; and the characteristic data of the target characteristic and the characteristic data of at least one of other characteristics have a linear relation, and the other characteristics comprise the characteristics which are held by the first participant and are other than the target characteristic and the characteristics which are held by other participants. By the method, on the premise of protecting data privacy, co-linear data in the feature data held by each participant can be screened and eliminated to obtain training data without linear relation, so that each participant performs joint training by using the training data without linear relation during joint training, the accuracy and stability of the federated model can be improved, and the modeling effect of the federated model is improved.

In some embodiments, before step S301 in the embodiment shown in fig. 3, the first participant obtains the first sample feature data from the feature data held by the first participant.

Because the first participant and the second participant perform joint training, the training data need to come from the same user to perform joint training, that is, the first sample feature data used by the first participant to perform joint training and the second sample feature data used by the second participant to perform joint training have the same ID. When the first sample characteristic data is obtained, firstly, common users held by all participants need to be determined, then target users participating in the training are screened from the common users, and the characteristic data of the target users are used as the first sample characteristic data.

The following illustrates the process of the first party obtaining the first sample characteristic data.

The first party and the second party respectively hold feature data of different features of a batch of users, for example, the first party holds feature data of several features of birthdays AA, ages BB, weight CC and deposit DD of the batch of users, and the second party holds feature data of several features of ages BB, consuming capacities EE and hobbies FF of the batch of users. The database table of the first party and the database table of the second party are shown in the following tables 1 and 2, respectively:

table 1 database table of a first party

ID	AA	BB	CC	DD
					01	a1	b1	c1	d1
02	a2	b2	c2	d2
					03	a3	b3	c3	d3
04	a4	b4	c4	d4
					06	a6	b6	c6	d6

Table 2 database table of second participant

ID	BB	EE	FF
				01	b1	e1	f1
03	b3	e3	f3
				04	b4	e4	f4
05	b5	e5	f5
				06	b6	e6	f6

Obtaining the feature data held by the first party according to the database table of the first party

Obtaining the characteristic data held by the second party according to the database table of the second party

On the premise of privacy protection, since each participant cannot know which user data is stored by other participants, the first participant and the second participant determine the common user based on the privacy protection technology in the embodiment of the application. Determining a co-user based on privacy preserving techniques may be implemented as: the first participant and the second participant respectively acquire the identification of the own user; and then, based on the privacy protection technology, calculating an intersection of the identifier of the user held by the first participant and the identifier of the user held by the second participant, wherein the obtained result is the common user. The method determines that the common users held by the first participant and the second participant are users with the IDs of 01,03,04 and 06, and the common users are recorded as S ═ {01,03,04,06 }.

After the common user S is obtained, the first participant determines target users participating in the training, for example, randomly selects random users from the common users as the target users participating in the training, and then obtains the characteristic data of the target users from the data stored in the first participant as first sample characteristic data. And the first participant sends the ID of the screened target user to the second participant, and the second participant acquires the characteristic data of the target users from the data stored by the second participant as second sample characteristic data. Alternatively, the second participant may determine the target user participating in the training, and send the ID of the determined target user to the first participant. For example, the first participant randomly selects 3 users with IDs 01,03, and 04 from S as target users, which are denoted as S1 ═ 01,03,04}, and the first participant selects feature data of the target user S1 from the feature data owned by the first participant to obtain first sample feature data

On the second party side, the second party screens out the feature data of the target user S1 from the feature data owned by the second party, and obtains second sample feature data

Through the steps, on the premise that data of all participants cannot be local and privacy protection is achieved, the first participant obtains first sample characteristic data, and the second participant obtains second sample characteristic data.

In some embodiments, the step S301 "building a virtual feature correlation matrix based on the first sample feature data held by the first participant and the pre-trained security computation model" in the embodiment shown in fig. 3 can be implemented by:

step S3011 is to determine feature data of each feature corresponding to the first sample feature data and the number of samples corresponding to the first sample feature data based on the first sample feature data.

Still referring to the above exemplary data, as shown in table 3, the first sample feature data corresponds to the following features: AA. BB, CC and DD, the feature data of feature AA are { a1, a3 and a4}, the feature data of feature BB are { b1, b3 and b4}, the feature data of feature CC are { c1, c3 and c4}, the feature data of feature DD are { d1, d3 and d4}, and it can be seen that the feature data of each feature corresponds to a column of data of A.

The samples corresponding to the first sample feature data are: { a1, b1, c1, d1}, { a3, b3, c3, d3} and { a4, b4, c4, d4}, and determining the number n of samples to be 3.

Step S3012, respectively calculate the mean and standard deviation corresponding to the feature data of each feature.

According to

Calculating the mean value corresponding to the characteristic data of the characteristics AA, BB, CC and DD according to

And calculating standard deviations corresponding to the feature data of the features AA, BB, CC and DD, wherein x is a, b, c, d and n is 3.

Step S3013, determining processed first sample feature data based on the feature data of each feature, the mean value corresponding to the feature data of each feature, the standard deviation corresponding to the feature data of each feature and the number of samples;

in the embodiment of the present application, the mean value corresponding to the feature data of each feature can be obtained

Standard deviation σ corresponding to feature data of each feature_xAnd the number n of samples is used for updating the feature data x of each feature to obtain each processed feature data x. The processing formula of each feature data x after processing here may be

Thereby obtaining processed first sample characteristic data

Wherein the content of the first and second substances,

x＝a,b,c,d，n＝3。

step S3014, input the processed first sample feature data to the secure computation model to obtain a first matrix.

In the embodiment of the application, each participant trains based on a secure multi-party computing protocol, such as an SPDZ protocol or a MASCOT protocol, to obtain a trained secure computing model. The trained model can perform matrix addition and subtraction and matrix multiplication under privacy. Secure Multi-Party computing (MPC) research is mainly directed to the problem of how to securely compute an agreed function without a trusted third Party. If m participants participate in the federal learning training, the trained safety calculation model can be expressed as y_1,…,m＝f(x_1,…,m) Wherein m is more than or equal to 2. After the training is completed, the m participants all hold the security computation model, and each participant inputs its own private input value to obtain a corresponding output value, for example, the ith participant inputs x_iOutput is y_iWherein i is more than or equal to 1 and less than or equal to m.

In the embodiment of the application, two participants are taken as an example, and the trained safety calculation model is represented as y_1,2＝f(x_1,2) The first participant and the second participant both hold the model, and the first participant inputs the processed first sample characteristic data C into y_1,2＝f(x_1,2) Obtaining a first matrix E, where E is equal to C^TAnd D, wherein D is the processed second sample characteristic data. Here, the first matrix E ═ C^TD is merely to illustrate that the value of E is equal to C^TThe value of D is such that,does not indicate that E is according to C^TD is calculated, the first party is according to y_1,2＝f(x_1,2) And E, the first participant does not need to acquire the processed second sample characteristic data D from the second participant. From C^TD, the first matrix E is a4 × 3 matrix.

Correspondingly, on the side of the second participant, the second participant inputs the processed second sample characteristic data D into y_1,2＝f(x_1,2) Obtaining a first matrix E, where E is equal to C^TD, wherein C^TIs a transposed matrix of the processed first sample feature data. Here, the first matrix E ═ C^TD is merely to illustrate that the value of E is equal to C^TThe value of D does not indicate that E is according to C^TD is calculated, the second party is according to y_1,2＝f(x_1,2) And E is obtained through calculation, and the second party does not need to obtain the transposition matrix C of the processed first sample characteristic data from the first party^T。

In the embodiment of the application, the first matrix E is determined based on a secure multiparty computing protocol, and each participant cannot acquire private data of other participants, so that privacy and security of data of each participant can be ensured.

Step S3015, a virtual feature correlation matrix is constructed according to the processed first sample feature data and the first matrix.

In the embodiment of the present application, "constructing a virtual feature correlation matrix" may be implemented by the following steps:

in step S30151, a first symmetric matrix is determined according to the processed first sample feature data.

The Symmetric matrix (Symmetric Matrices) is a square matrix having a main diagonal as a symmetry axis and equal elements. In this embodiment of the application, the first symmetric matrix F is constructed according to the processed first sample feature data C, and the transposed matrix F may be constructed according to the processed first sample feature data C

And the processed first sample data

The determined first symmetric matrix F, the formula F of the calculation of F ═ C^TThe dimensions of C and F are 4 x 4.

In step S30152, an empty matrix is generated, where the number of rows and the number of columns are equal to the number of columns of the obtained first matrix.

When a feature correlation matrix is constructed according to a feature held by a first participant and a feature held by a second participant, the first participant can only know the feature held by the second participant and cannot obtain feature data of each feature held by the second participant on the premise that data of each participant is not local.

From the above step S3014, the first matrix E is a4 × 3 matrix with 3 columns, and thus the generated empty matrix is a3 × 3 matrix, which is denoted as G'. A second symmetric matrix G, corresponding to the empty matrix G', is determined by the second participant from the second sample feature data B.

Step S30153, a virtual feature correlation matrix is constructed according to the first symmetric matrix, the first matrix, the transposed matrix of the first matrix, and the null matrix.

According to the first symmetric matrix F, the first matrix E and the transposed matrix E of the first matrix^TAnd a null matrix G' constructing a virtual eigen-correlation matrix H, denoted as

Wherein the first symmetric matrix F has a dimension of 4 x 4, the first matrix E has a dimension of 4 x 3, and the first matrix is a transpose matrix E^TIs 3 x 4, and the empty matrix G' is 3 x 3, so the feature correlation matrix H is 7 x 7 in dimension.

According to the method provided by the embodiment of the application, the first participant trains with other participants to obtain the safe calculation model based on the safe multi-party calculation protocol, and the problem that the co-linearity of multi-party data in federal learning cannot be quantified can be converted into the problem that the block matrix determinant can be processed on the premise that the data held by each participant cannot be local and the privacy of the data is guaranteed, so that the co-linearity factor of the multi-party data in federal learning can be quantified based on the characteristic correlation matrix, and a basis is provided for eliminating training data with co-linearity and obtaining the federal model with accuracy and stability.

In some embodiments, the step S302 "determining the co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix" in the embodiment shown in fig. 3 may be implemented by:

step S3021, determining a determinant of the feature correlation matrix.

The determinant of the feature correlation matrix H is denoted by | H |, and in the embodiment of the present application, determining the determinant of the feature correlation matrix may be implemented by steps S30211 to S30214:

step S30211, a first random matrix with a determinant as a preset value and a dimension the same as that of the empty matrix is generated.

When the determinant of the characteristic correlation matrix is calculated, since data of each participant cannot be local, in order to ensure that private data is not leaked out, a first random matrix for confusing original data is generated in the embodiment of the application, so that privacy protection is realized.

In the embodiment of the present application, a first random matrix generated by a first participant satisfies: the determinant is a matrix with preset value and dimension identical to that of the empty matrix G'. For convenience of calculation, the preset value may be set to 1, and it can be known from step S30152 that the empty matrix G' is a matrix of 3 × 3, so that the first random matrix generated is a matrix with a determinant of 1 and a dimension of 3 × 3, which is denoted as R1.

Step S30212, inputting the feature correlation matrix and the first random matrix into the security computation model to obtain a second matrix.

The well-trained security computation model of the first and second participants is represented as y_1,2＝f(x_1,2) The first participant inputs the feature correlation matrix H and the first random matrix R1 into the model to obtain a second matrix J, which is R1 (D-CA)^-1B)R2，Wherein A is^-1The first sample feature data a is an inverse of the first sample feature data a, D is the processed second sample feature data, and R2 is a second random matrix with a determinant of 1 and a dimension of 4 × 4 (the dimension is the same as the first symmetric matrix F) generated by the second participant. Here, the second matrix J ═ R1 (D-CA)^-1B) R2 is only for the purpose of illustrating that the value of the second matrix J is equal to R1 (D-CA)^-1B) The value of R2 does not indicate that the second matrix J is according to R1 (D-CA)^-1B) R2, the first party is calculated according to y_1,2＝f(x_1,2) The first participant does not need to obtain the processed second sample feature data D and the second random matrix R2 from the second participant with the calculated J.

In the embodiment of the application, the determinant of R1 and R2 is 1, when the determinant of R1 or R2 is not 1,

in the embodiment of the application, the second matrix J is determined based on a secure multiparty computing protocol, and each participant uses the generated random matrix to confuse respective data, so that each participant cannot know private data of other participants, and therefore privacy and security of data of each participant can be ensured.

Step S30213, determinants of the first symmetric matrix and the second matrix are calculated, respectively.

Obtaining a first symmetric matrix F according to the step S30151, obtaining a second matrix J according to the step S30212, and calculating the determinants of the first symmetric matrix F and the second matrix J respectively to obtain the determinant | F | of the first symmetric matrix and the determinant | J | of the second matrix.

Here, when the determinant | R1| of the first random matrix or the determinant | R1| of the second random matrix generated is not 1, the determinant | J | of the second matrix needs to be multiplied when calculating the determinant | J |

Namely, it is

|R1(D-CA^-1B)R2|。

Step S30214, multiplying the determinant of the first symmetric matrix and the determinant of the second symmetric matrix to obtain a determinant of the feature correlation matrix.

And multiplying the determinant | F | of the first symmetric matrix with the determinant | J | of the second matrix to obtain the determinant | H | ═ F | -J | of the characteristic correlation matrix.

In the embodiment of the application, each participant calculates the determinant of the virtual characteristic correlation matrix based on the safety calculation model, and the whole calculation process does not need to acquire the private data of other participants, so that the data safety of each participant can be ensured.

And step S3022, deleting the ith row and ith column data of the characteristic correlation matrix to obtain each residue type corresponding to the characteristic correlation matrix.

Wherein, i is 1, 2, …, m₁Wherein m is₁The number of features corresponding to the first sample feature data. As described above, the first sample feature data A has a feature number of 4, m₁4. Since the first participant can only determine the co-linear quantization factors of the features corresponding to the first sample feature data owned by the first participant, only the 1 st, 2 nd, 3 rd and 4 th residue types corresponding to the feature correlation matrix need to be obtained here.

Deleting feature correlation matrix H_7*7The remaining data, without changing the original order, form a matrix with dimension 6 x 6, which is the ith residue Hii corresponding to the ith feature of the first sample feature data, for example, when i is 2, the obtained data is the 2 nd residue H corresponding to the feature BB₂₂。

In step S3023, determinants of respective residue formulas are determined.

In the embodiment of the present application, the determinant for each residue may be determined in the same manner as the determinant for the feature correlation matrix in step S3021. For example, determining the determinant of the i-th feature corresponding to the remainder formula may be implemented as: generating a random matrix with a determinant as a preset value and the same dimensionality as the empty matrix; inputting the ith remainder formula and the random matrix into a safety calculation model to obtain a matrix corresponding to the ith remainder formula; respectively calculating the determinant of the first symmetric matrix and the matrix corresponding to the ith remainder; multiplying the determinant of the first symmetric matrix and the determinant of the matrix corresponding to the ith residue to obtain the determinant | Hii | of the ith residue.

Step S3024, determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the determinant of the feature correlation matrix and the determinants of the respective residue.

In the embodiment of the present application, the variance expansion factor VIF of the ith feature, the collinearity quantization factor, is determined by taking the variance expansion factor as an example_iThe calculation formula can be determined according to the determinant | H | of the characteristic correlation matrix and the determinant | Hii | of the ith residue, as shown in formula (1):

wherein when i is 1, the resulting VIF₁VIF obtained for colinear quantization factor of characteristic AA, i ═ 2₂VIF obtained for feature BB with i equal to 3₃VIF obtained for co-linear quantization factor of characteristic CC, i ═ 4₄Is a co-linear quantization factor of the feature DD.

According to the method provided by the embodiment of the application, the first participant calculates the determinant of the characteristic correlation matrix and the determinant of the residue of each characteristic corresponding to the first sample characteristic data based on the safe multi-party calculation protocol, so that the variance expansion factor of each characteristic corresponding to the first sample characteristic data is obtained, and a foundation is provided for eliminating the training data with co-linearity and obtaining the federal model with accuracy and stability.

In some embodiments, the step S303 "of determining the target feature from the features corresponding to the first sample feature data based on the co-linear quantization factor" in the embodiment shown in fig. 3 may be implemented by:

step S3031, determining whether the collinearity quantization factor of each feature corresponding to the first sample feature data is greater than a preset boundary value.

Here, the co-linear quantization factor and the preset edge of each feature corresponding to the first sample feature data are judgedThe magnitude of the threshold value, i.e. determining the VIF₁、VIF₂、VIF₃And VIF₄And the magnitude of the predetermined boundary value.

Step S3032, determining the feature of the collinearity quantization factor greater than the preset boundary value as the target feature.

The predetermined boundary value is 10 as an example, and the VIF is judged to be obtained₁And VIF₂If greater than 10, the VIF will be₁Corresponding features AA and VIF₂The corresponding feature BB is determined as the target feature.

In combination with the above example, the target feature AA of the first participant is birthday, the target feature BB of the first participant is age, and the feature BB of the second participant is age, and it is determined that there is a linear relationship between birthday and age (age equals this year minus birth year), and a linear relationship between age and age (equal, i.e. multiple is 1). That is, the feature data of the target feature AA of the first party has a linear relationship with the feature data of the feature BB of the first party and the feature data of the feature BB of the second party, and the feature data of the target feature BB of the first party has a linear relationship with the feature data of the feature AA of the first party and the feature data of the feature BB of the second party.

After obtaining the target feature, step S304 is executed to delete the feature data of the target feature from the first sample feature data, i.e. delete the first sample feature data

The first training data obtained from the 1 st column data (feature data of feature AA) and the 2 nd column data (feature data of feature BB)

In the same manner as the above-described step of determining the target feature among the features corresponding to the first sample feature data held by the first party, the target feature among the features corresponding to the second sample feature data held by the second party is determined as the feature BB of the second party, and the feature data of the feature BB is deleted from the second sample feature data, that is, the feature data of the feature BB is deleted from the second sample feature dataCharacteristic data

Middle 1 st column data (feature data of feature BB), and second training data

And then, the first participant and the second participant perform joint training according to the first training data A 'and the second training data B', and the characteristic data with collinear relationship is deleted from the first training data A 'and the second training data B', so that when the joint training is performed, the first participant and the second participant perform the joint training by using the first training data and the second training data without linear relationship, the obtained federated model has high accuracy and good stability, and the modeling effect of the federated model can be improved.

Based on the foregoing embodiments, a data processing method is further provided in an embodiment of the present application, and fig. 4 is a schematic flow chart of a further implementation of the data processing method provided in the embodiment of the present application, which is applied to the network architecture shown in fig. 1, and as shown in fig. 4, the data processing method includes the following steps:

in step S401, the first party and the second party determine a common user held by the first party and the second party based on a secure multiparty computing protocol.

On the premise of privacy protection, since each participant cannot know which user data is stored by other participants, the first participant and the second participant determine the common user based on the privacy protection technology in the embodiment of the application. Determining a co-user based on privacy preserving techniques may be implemented as: the first participant and the second participant respectively acquire the identification of the own user; and then, based on the privacy protection technology, calculating an intersection of the identifier of the user held by the first participant and the identifier of the user held by the second participant, wherein the obtained result is the common user.

Step S402, the first participant determines the target users participating in the training.

After obtaining the common users, the first participant determines the target users participating in the training, and for example, randomly selects a random number of users from the common users as the target users participating in the training.

In step S403, the first party sends the identifier of the target user to the second party.

In other embodiments, the second participant may also determine the target user participating in the training, and in this case, step S402 and step S403 may be replaced with:

step S402', the second participant determines the target users participating in the training.

Step S403', the second party sends the identification of the target user to the first party.

In step S404, the first participant acquires first sample feature data owned by the first participant.

The first participant acquires the characteristic data of the target user from the data stored by the first participant as first sample characteristic data.

Step S405, the first participant constructs a virtual first feature correlation matrix based on the first sample feature data and a pre-trained safety calculation model.

In some embodiments, the first participant constructing the first feature correlation matrix may be implemented as: determining feature data of each feature corresponding to the first sample feature data and the number of samples corresponding to the first sample feature data based on the first sample feature data; respectively calculating the mean value and the standard deviation corresponding to the characteristic data of each characteristic; determining processed first sample feature data based on the feature data of each feature, the mean value corresponding to the feature data of each feature, the standard deviation corresponding to the feature data of each feature and the number of samples; inputting the processed first sample characteristic data into a safety calculation model to obtain a first matrix; determining a first symmetric matrix according to the processed first sample characteristic data; generating a first empty matrix having a number of rows and columns equal to the number of columns of the first matrix; and constructing a virtual first characteristic correlation matrix according to the first symmetric matrix, the first matrix, the transposed matrix of the first matrix and the first empty matrix.

For example, the first sample characteristic data is a; according to

Processing each characteristic data x in the A to obtain a first processed sample characteristic data C; inputting the processed first sample characteristic data C into a safety calculation model y trained in advance_1,…,m＝f(x_1,…,m) Obtaining a first matrix E; transpose matrix C according to the processed first sample feature data^TAnd the processed first sample characteristic data C, and determining a first symmetric matrix F ═ C^TC; a first empty matrix generated by the first participant is G'; obtaining a first feature correlation matrix constructed by a first participant

In step S406, the first participant determines a co-linear quantization factor of each feature corresponding to the first sample feature data based on the first feature correlation matrix.

In some embodiments, the determining, by the first party, the co-linear quantization factor for each feature corresponding to the first sample feature data may be implemented as: determining a determinant of a first feature correlation matrix; deleting the ith row and ith column data of the first characteristic correlation matrix to obtain each remainder formula corresponding to the first characteristic correlation matrix, wherein i is 1, 2, …, m₁，m₁The number of the characteristics corresponding to the first sample characteristic data; determining a determinant of each residue; determinant based on first feature correlation matrixAnd determining a co-linear quantization factor of each characteristic corresponding to the first sample characteristic data according to the determinant of each residue.

Deleting the data of the ith row and the ith column of the first characteristic correlation matrix H to obtain an ith residue Hii corresponding to the ith characteristic of the first sample characteristic data, calculating the determinant | H | of the first characteristic correlation matrix and the determinant | Hii | of the ith residue, and determining the collinearity quantization factor (taking the square difference expansion factor as an example) of each characteristic corresponding to the first sample characteristic data

Wherein determining the determinant | H | of the first feature correlation matrix may be implemented as: generating a first random matrix with a determinant as a preset value and the dimension same as that of the first empty matrix; inputting the first characteristic correlation matrix and the first random matrix into a safety calculation model to obtain a second matrix; calculating determinants of the first symmetric matrix and the second matrix respectively; and multiplying the determinant of the first symmetric matrix and the determinant of the second symmetric matrix to obtain the determinant of the first characteristic correlation matrix.

Similarly, determining the determinant | Hii | of the ith remainder formula corresponding to the ith feature may be implemented as: generating a random matrix with a determinant as a preset value and the dimension same as that of the first empty matrix; inputting the ith remainder formula and the random matrix into a safety calculation model to obtain a matrix corresponding to the ith remainder formula; respectively calculating the determinant of the first symmetric matrix and the matrix corresponding to the ith remainder; multiplying the determinant of the first symmetric matrix and the determinant of the matrix corresponding to the ith residue to obtain the determinant | Hii | of the ith residue.

In step S407, the first participant determines a target feature from the features corresponding to the first sample feature data based on the collinear quantization factor.

In some embodiments, the determining of the target feature by the first party from the co-linear quantization factor may be implemented as: judging whether the co-linear quantization factor of each feature corresponding to the first sample feature data is larger than a preset boundary value or not; and determining the characteristic of the colinearity quantization factor larger than the preset boundary value as the target characteristic.

Step S408, the first participant deletes the feature data of the target feature from the first sample feature data to obtain first training data for the first participant to perform joint training with other participants.

And the characteristic data of the target characteristic in the first participant has a linear relation with the characteristic data of at least one characteristic in other characteristics, wherein the other characteristics comprise the characteristics which are held by the first participant except the target characteristic and all the characteristics which are held by the second participant.

In step S409, the second participant obtains second sample feature data owned by the second participant.

And the second participant acquires the characteristic data of the target user from the data stored by the second participant as second sample characteristic data.

And step S410, the second participant constructs a virtual second feature correlation matrix based on the second sample feature data and a pre-trained safety calculation model.

In some embodiments, the second participant constructing the second feature correlation matrix may be implemented as: determining feature data of each feature corresponding to the second sample feature data and the number of samples corresponding to the second sample feature data based on the second sample feature data; respectively calculating the mean value and the standard deviation corresponding to the characteristic data of each characteristic; determining processed second sample feature data based on the feature data of each feature, the mean value corresponding to the feature data of each feature, the standard deviation corresponding to the feature data of each feature and the number of samples; inputting the processed second sample characteristic data into a safety calculation model to obtain a first matrix; determining a second symmetric matrix according to the processed second sample characteristic data; generating a second empty matrix having a number of rows and columns equal to the number of columns of the first matrix; and constructing a virtual first characteristic correlation matrix according to the second symmetric matrix, the first matrix, the transposed matrix of the first matrix and the second empty matrix.

For example, the second sample characteristic data is B; according to

For each feature in BProcessing the data x to obtain processed second sample characteristic data D; inputting the processed second sample characteristic data D into a safety calculation model y trained in advance_1,…,m＝f(x_1,…,m) Obtaining a first matrix E; transpose matrix D according to the processed second sample feature data^TAnd the processed second sample characteristic data D, and determining a second symmetric matrix G ═ D^TD; a second empty matrix generated by the second participant is F'; obtaining a second characteristic correlation matrix constructed by a second participant

In step S411, the second participant determines a co-linear quantization factor of each feature corresponding to the second sample feature data based on the feature correlation matrix.

In some embodiments, the determining, by the second party, the co-linear quantization factor of the features corresponding to the second sample feature data may be implemented as: determining a determinant of a second feature correlation matrix; deleting the ith + m of the second eigen-correlation matrix₁Line, i + m₁Column data (because the first m1 of the second eigen correlation matrix H' is the feature of the first sample eigen data, m1 needs to be added when determining the co-linear quantization factor of each feature corresponding to the second sample eigen data), and each remainder formula corresponding to the second eigen correlation matrix is obtained, where i is 1, 2, …, and m is equal to 1₂，m₂The number of the features corresponding to the second sample feature data is obtained; determining a determinant of each residue; and determining a co-linear quantization factor of each feature corresponding to the second sample feature data based on the determinant of the second feature correlation matrix and the determinants of each remainder.

Deleting the i + m th of the second eigen-correlation matrix H₁Line, i + m₁Obtaining ith remainder formula H 'corresponding to ith characteristic of second sample characteristic data from the data of the row'_i+m1i+m1Calculating determinant | H ' | of the second feature correlation matrix and determinant | H ' of the ith residue sub-formula '_i+m1i+m1Determining a co-linear quantization factor of each feature corresponding to the first sample feature data (using a variance expansion factor as an example)

Wherein determining the determinant | H' | of the second feature correlation matrix may be implemented as: generating a second random matrix with a determinant as a preset value and the dimension same as that of the second empty matrix; inputting the second characteristic correlation matrix and the second random matrix into a safety calculation model to obtain a second matrix; calculating the determinants of the second symmetric matrix and the second matrix respectively; and multiplying the determinant of the second symmetric matrix and the determinant of the second matrix to obtain the determinant of the second characteristic correlation matrix.

Similarly, determining the determinant | H 'of the ith remainder formula corresponding to the ith feature'_i+m1i+m1I, can be implemented as: generating a random matrix with a determinant as a preset value and the dimension same as that of the second empty matrix; inputting the ith remainder formula and the random matrix into a safety calculation model to obtain a matrix corresponding to the ith remainder formula; respectively calculating the determinant of the second symmetric matrix and the matrix corresponding to the ith remainder; multiplying the determinant of the second symmetric matrix and the determinant of the matrix corresponding to the ith residue to obtain the determinant | H 'of the ith residue'_i+m1i+m1|。

In step S412, the second participant determines a target feature from the features corresponding to the second sample feature data based on the collinearity quantization factor.

In some embodiments, the determining of the target feature by the second party from the co-linear quantization factor may be implemented as: judging whether the co-linear quantization factor of each feature corresponding to the second sample feature data is larger than a preset boundary value or not; and determining the characteristic of the colinearity quantization factor larger than the preset boundary value as the target characteristic.

In step S413, the second participant deletes the feature data of the target feature from the second sample feature data to obtain second training data for performing joint training between the second participant and other participants.

And the characteristic data of the target characteristic in the second party has a linear relation with the characteristic data of at least one characteristic in other characteristics, wherein the other characteristics comprise the characteristics which are held by the second party except the target characteristic and all the characteristics which are held by the first party.

In other embodiments, after step S413, the first participant and the second participant perform a joint model by using the first training data and the second training data based on the secure multiparty computing protocol, and since there is no training data with a linear relationship in the first training data and the second training data, the accuracy and stability of the federal model obtained by training can be improved, thereby improving the modeling effect of the federal model.

According to the data processing method provided by the embodiment of the application, a first participant and a second participant are trained in advance based on safe multiparty computation to obtain a safe computation model; then a first participant and a second participant respectively obtain first sample characteristic data and second sample characteristic data, the first participant constructs a virtual first characteristic correlation matrix based on the first sample characteristic data and a safety calculation model, and the second participant constructs a virtual second characteristic correlation matrix based on the second sample characteristic data and the safety calculation model; then the first participant determines the co-linear quantization factor of each feature corresponding to the first sample feature data based on the first feature correlation matrix, and the second participant determines the co-linear quantization factor of each feature corresponding to the second sample feature data based on the second feature correlation matrix; the first participant determines a target feature in the features corresponding to the first sample feature data according to the co-linear quantization factor of the features corresponding to the first sample feature data, the second participant determines a target feature in the features corresponding to the second sample feature data according to the co-linear quantization factor of the features corresponding to the second sample feature data, the feature data of the target features has a linear relation with the feature data of at least one feature in other features, and the other features comprise all features except the target feature held by the first participant and the second participant; after the target features are determined, the first participant deletes the feature data of the target features from the first sample feature data to obtain first training data; and the second participant deletes the feature data of the target feature from the second sample feature data to obtain second training data. Therefore, on the premise of protecting data privacy, co-linear data in feature data held by the first participant and the second participant can be screened and eliminated to obtain training data without linear relationship, so that when joint training is carried out, the first participant and the second participant carry out joint training by using the training data without linear relationship, the accuracy and stability of the federated model can be improved, and the modeling effect of the federated model is improved.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

Longitudinal federated learning is typically a joint training of machine learning models by different participants. In the modeling process of the linear model, the existence of collinearity can obviously affect the stability and the effect of the linear model, so that the collinearity needs to be eliminated in the modeling process. The Variance expansion Factor (VIF) can well quantify the co-linearity of a certain feature and all other features, and is very common in practical modeling. In the related art, when calculating the VIF, the calculation method generally needs to collect the data of each party to one place. But data at parties (e.g., banking enterprises) may involve personal privacy or business secrets, and direct opening to other parties may cause information leakage. In the related art, the feature screening scheme aiming at linear federated modeling mainly aims at single-column data, and a method for describing the colinearity of multiple-column data, namely VIF (visual aid function), is lacked. The related art cannot combine data of multiple parties to calculate the VIF while protecting data privacy.

The embodiment of the application aims at the scenes of two independent participants, and can be assumed to carry out data joint modeling between two companies. The original data of the respective company cannot leave the respective company due to compliance, privacy, business confidentiality, and the like. The intermediate data exchanged in the modeling process cannot deduce or reveal unnecessary raw data information either. Two parties respectively hold different feature data of a plurality of same ID users, and co-linearity can exist among the features, and the modeling effect of a subsequent linear model can be influenced. The method and the device utilize the technical scheme of privacy security of two parties to calculate the variance inflation factor VIF, and feature screening is carried out based on the variance inflation factor VIF.

Fig. 5 is a schematic flow chart of a calculation process of a variance expansion factor under a longitudinal federal condition provided in the embodiment of the present application, fig. 6 is a schematic flow chart of a calculation process of a determinant of a correlation matrix provided in the embodiment of the present application, and a detailed description is given below of a method for calculating a variance expansion factor under a longitudinal federal condition provided in the embodiment of the present application, with reference to fig. 5 and fig. 6.

1) The embodiment of the application provides two independent parties (respectively marked as Alice and Bob), which respectively hold data with different characteristics of the same ID, and are respectively marked as matrices a and B (where a and B have the same row number n, and a has column number m₁B has the number m of columns₂)。

2) Alice and Bob locally normalize the characteristics (the columns of the matrix) of A and B respectively to obtain matrices C and D. The normalization mode is as follows:

x represents a list of features that are to be included in the list,

representing the mean, σ, of the feature_xRepresents the standard deviation of the features.

3) By the expanded SPDZ protocol, the matrix addition and subtraction and the matrix multiplication under privacy are carried out, and the matrix product C is calculated^TD ═ E, the results of the calculations are held by Alice and Bob.

In the embodiment of the application, the addition and subtraction and matrix multiplication of the matrix under privacy protection are realized through a secure multi-party computing protocol SPDZ, each participant inputs own private data to obtain a computing result, and the private data of other participants cannot be obtained, so that any additional information except the result is not leaked in the whole computing process, and the high-security and practical computing efficiency is achieved.

4) Alice local computation matrix multiplication C^TC ═ F, Bob locally computes a matrix multiplication D^TD＝G。

5) The Pearson correlation moment of Pearson is obtained in the last step

Note 1: here Alice holds F, E, E^TBob holds E, E^T、G。

According to the definition of VIF, i characteristic

Here Hii is the remainder of matrix H (i.e., the matrix left by deleting the ith row and ith column of the matrix), where | H | represents the determinant of matrix H.

6) Suppose the ith feature is on Alice side (on Bob side, the column number m of index i plus Fii is needed)₁) Then Alice may delete the ith row and ith column of the matrix F to obtain the matrix Fii. Alice and Bob can delete the ith row of the matrix E to obtain E_i*Deleting matrix E^TIs obtained in the ith column

Then remainder type

Note 2: here Alice holds Fii, E_i*、

Bob holds E_i*、

In the embodiment of the application, the determinant calculation problem of the original matrix H which cannot be processed is converted into the block matrixes Fii and E which can be processed through ingenious matrix transformation_i*、

And

the determinant computing problem of (1).

The following solution is based on the blocking matrix Fii, E_i*、

And

the determinant computing problem of (1).

。

Alice holds matrix M₁、M₂、M₃Bob holds a matrix M₂、M₃、M₄Calculating determinant

8) To solve 7) from here (since Alice cannot obtain M)₄Bob cannot get M₁Therefore, the determinant in 7) cannot be directly calculated). When M is₁When the time can be reversed,

9) alice calculates | M locally₁I and

and locally generating the determinant with 1, dimension and M₄Uniform random matrix R₁. Bob locally generates determinants 1 and M₄Random matrix R with consistent dimension₂。

10) Calculating matrix subtraction and multiplication by using the expanded SPDZ protocol again and recovering the resultThe matrix J ═ R₁

Due to the existence of the random matrix, through J, neither Alice nor Bob can recover the original matrix of the other party.

In the embodiment of the application, the original matrix is confused by constructing a special random matrix, and the determinant problem of the matrix related to the original data information is converted into the determinant calculation problem of the random matrix with the same determinant. The original matrix is subjected to confusion processing through the random matrix, each participant can not recover the data of other participants, privacy protection can be performed on the data of each participant, and data leakage is avoided.

11) Alice local computation

Thereby obtaining determinant

The problem of 7) is solved.

12) According to the methods described in 7) to 11), | H11|, | H22|, …, and,

And | H |, to give VIF₁，VIF₂，…，VIF_m1+m2。

According to the method provided by the embodiment of the application, the original determinant calculation problem which cannot be processed is converted into the block matrix determinant problem which can be processed through ingenious matrix transformation; confusing an original matrix by constructing a special random matrix, and converting a determinant problem of the matrix related to original data information into a determinant calculation problem of the random matrix with the same determinant; through the ingenious cooperation of the SPDZ protocol, any additional information except results is not leaked in the whole calculation process, and the method has high safety and practical calculation efficiency. By the method provided by the embodiment of the application, the safe calculation of the variance expansion factor VIF becomes possible, so that efficient feature screening can be performed, and the overall effect of a subsequent linear model is improved; besides, high safety and practicability are considered, unnecessary data information is not leaked except the result, and the whole calculation cost is controlled within the practical production efficiency range.

Continuing with the exemplary structure of the data processing apparatus implemented as software modules provided in the embodiments of the present application, in some embodiments, as shown in fig. 2, the data processing apparatus 70 stored in the memory 140 is applied to a second participant performing joint training on a model, and the software modules in the data processing apparatus 70 may include:

a building module 71, configured to build a virtual feature correlation matrix based on first sample feature data held by the first participant and a pre-trained security computation model, where the security computation model is obtained by the first participant and other participants of federal learning through pre-training based on secure multiparty computation;

a first determining module 72, configured to determine a co-linear quantization factor of each feature corresponding to the first sample feature data based on the feature correlation matrix;

a second determining module 73, configured to determine, based on the collinear quantization factor, a target feature from features corresponding to the first sample feature data;

a deleting module 74, configured to delete the feature data of the target feature from the first sample feature data, so as to obtain first training data for performing joint training between the first participant and the other participants;

In some embodiments, the building module 71 further includes:

the first determining submodule is used for determining the feature data of each feature corresponding to the first sample feature data and the number of samples corresponding to the first sample feature data based on the first sample feature data;

the calculation submodule is used for respectively calculating the mean value and the standard deviation corresponding to the characteristic data of each characteristic;

a second determining submodule, configured to determine processed first sample feature data based on the feature data of each feature, a mean value corresponding to the feature data of each feature, a standard deviation corresponding to the feature data of each feature, and the number of samples;

the input submodule is used for inputting the processed first sample characteristic data into the safety calculation model to obtain a first matrix;

and the constructing submodule is used for constructing a virtual characteristic correlation matrix according to the processed first sample characteristic data and the first matrix.

In some embodiments, the building module further comprises:

the determining unit is used for determining a first symmetric matrix according to the processed first sample characteristic data;

the first generating unit is used for generating a null matrix with the number of rows and the number of columns equal to the number of columns of the first matrix;

and the constructing unit is used for constructing a virtual characteristic correlation matrix according to the first symmetric matrix, the first matrix, the transposed matrix of the first matrix and the empty matrix.

In some embodiments, the first determining module 72 further includes:

a third determining submodule for determining a determinant of the feature correlation matrix;

a deleting submodule, configured to delete the ith row and ith column data of the feature correlation matrix to obtain each remainder formula corresponding to the feature correlation matrix, where i is 1, 2, …, m₁，m₁The number of the characteristics corresponding to the first sample characteristic data;

a fourth determining submodule for determining the determinant of each remainder;

and a fifth determining submodule, configured to determine a co-linear quantization factor of each feature corresponding to the first sample feature data based on the determinant of the feature correlation matrix and the determinants of the respective residue sub-types.

In some embodiments, the third determining sub-module further comprises:

the second generating unit is used for generating a first random matrix with a determinant as a preset value and the dimension being the same as that of the empty matrix;

the input unit is used for inputting the characteristic correlation matrix and the first random matrix into the safety calculation model to obtain a second matrix;

a first calculation unit configured to calculate determinants of the first symmetric matrix and the second matrix, respectively;

and the second calculation unit is used for multiplying the determinant of the first symmetric matrix and the determinant of the second matrix to obtain the determinant of the characteristic correlation matrix.

In some embodiments, the second determining module 73 further includes:

the judgment submodule is used for judging whether the colinearity quantization factor of each feature corresponding to the first sample feature data is larger than a preset boundary value;

and the sixth determining submodule is used for determining the characteristic of the colinearity quantization factor larger than the preset boundary value as the target characteristic.

Here, it should be noted that: the above description of the data processing apparatus embodiment is similar to the above description of the method, with the same advantageous effects as the method embodiment. For technical details not disclosed in the embodiments of the data processing device of the present application, a person skilled in the art should understand with reference to the description of the embodiments of the method of the present application.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method described in the embodiment of the present application.

Embodiments of the present application provide a storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform the methods provided by embodiments of the present application, for example, the methods as illustrated in fig. 3 to 6.

In some embodiments, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A data processing method for use by a first party to federal learning, the method comprising:

2. The method of claim 1, wherein constructing a virtual feature correlation matrix based on first sample feature data held by the first participant and a pre-trained security computation model comprises:

determining feature data of each feature corresponding to the first sample feature data and the number of samples corresponding to the first sample feature data based on the first sample feature data;

respectively calculating the mean value and the standard deviation corresponding to the characteristic data of each characteristic;

determining processed first sample feature data based on the feature data of each feature, the mean value corresponding to the feature data of each feature, the standard deviation corresponding to the feature data of each feature and the number of samples;

inputting the processed first sample characteristic data into the safety calculation model to obtain a first matrix;

and constructing a virtual characteristic correlation matrix according to the processed first sample characteristic data and the first matrix.

3. The method of claim 2, wherein constructing a virtual feature correlation matrix from the processed first sample feature data and the first matrix comprises:

determining a first symmetric matrix according to the processed first sample characteristic data;

generating a null matrix having a number of rows and columns equal to the number of columns of the first matrix;

and constructing a virtual characteristic correlation matrix according to the first symmetric matrix, the first matrix, the transposed matrix of the first matrix and the null matrix.

4. The method according to any one of claims 1 to 3, wherein the determining a co-linear quantization factor for each feature corresponding to the first sample feature data based on the feature correlation matrix comprises:

determining a determinant of the feature correlation matrix;

deleting the ith row and ith column data of the characteristic correlation matrix to obtain each remainder formula corresponding to the characteristic correlation matrix, wherein i is 1, 2, …, m₁，m₁The number of the characteristics corresponding to the first sample characteristic data;

determining a determinant of each residue;

and determining a co-linear quantization factor of each feature corresponding to the first sample feature data based on the determinant of the feature correlation matrix and the determinants of the respective residue sub-types.

5. The method of claim 4, wherein determining the determinant of the feature correlation matrix comprises:

generating a first random matrix with a determinant as a preset value and the dimension same as that of the empty matrix;

inputting the characteristic correlation matrix and the first random matrix into the safety calculation model to obtain a second matrix;

calculating determinants of the first symmetric matrix and the second matrix respectively;

and multiplying the determinant of the first symmetric matrix and the determinant of the second matrix to obtain the determinant of the characteristic correlation matrix.

6. The method of claim 1, wherein the determining a target feature from the features corresponding to the first sample feature data based on the co-linear quantization factor comprises:

judging whether the co-linear quantization factor of each feature corresponding to the first sample feature data is larger than a preset boundary value;

and determining the characteristic of the colinearity quantization factor larger than the preset boundary value as the target characteristic.

7. A data processing apparatus for application to a first party of federal learning, the apparatus comprising:

8. A data processing apparatus, characterized in that the apparatus comprises:

a memory for storing executable instructions;

a processor for implementing the method of any one of claims 1 to 6 when executing executable instructions stored in the memory.

9. A computer-readable storage medium having stored thereon executable instructions for causing a processor, when executed, to implement the method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the method of any of claims 1 to 6 when executed by a processor.