CN114692089A

CN114692089A - Single variable processing method and variable screening method

Info

Publication number: CN114692089A
Application number: CN202210418824.5A
Authority: CN
Inventors: 陈行; 张德; 彭南博
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2022-07-01

Abstract

The embodiment of the invention relates to a univariate processing method and a variable screening method.A first data end obtains a difference value of a dependent variable and a mean value of the dependent variable and sends the difference value to a second data end; receiving a third parameter and an encrypted fourth parameter sent by a second data end; obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean; obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data terminal for decryption; receiving the decrypted first correlation coefficient sent by the second data end, and outputting the first correlation coefficient; namely, the embodiment of the invention can effectively analyze the linear correlation degree of the independent variable and the dependent variable under the federal scene.

Description

Single variable processing method and variable screening method

Technical Field

The invention relates to the technical field of computers, in particular to a univariate processing method and a variable screening method.

Background

The federated learning framework is a distributed artificial intelligence model training framework, can help different data owners to realize federated modeling and federated training under the condition that private data does not need to be shared, and can effectively solve the problems of data security and data islanding.

The feature engineering is the most important loop in machine learning modeling, and refers to a process of processing original data into model training data, which generally comprises three steps of feature preprocessing, feature selection and feature dimension reduction. When the characteristics are selected, a characteristic univariate analysis method is adopted to analyze the distribution condition of each characteristic and the prediction capability of the label. And univariate analysis under the federal scene comprises Evidence Weight (WOE) and Information Value (IV).

However, the WOE, IV, etc. indicators cannot represent the linear degree of correlation between independent and dependent variables in the federal scenario.

Disclosure of Invention

The invention provides a univariate processing method and a variable screening method, which aim to solve the problem that the description of the linear correlation degree of independent variables and dependent variables under a federal scene is lacked in the prior art.

In a first aspect, the present invention provides a univariate processing method, which is applied to a first data end in a univariate processing system, where the univariate processing system includes the first data end and a second data end, where the first data end stores a dependent variable, and the second data end stores an independent variable; the method comprises the following steps: obtaining a difference value between the dependent variable and the mean value of the dependent variable, and sending the difference value to a second data terminal, wherein the difference value is used for calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable; receiving a third parameter and an encrypted fourth parameter sent by a second data end, wherein the third parameter is obtained by calculation according to the regression coefficient and the independent variable mean value, and the fourth parameter is obtained by calculation according to the regression coefficient and the independent variable; obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean value; obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data terminal for decryption; and receiving the decrypted first correlation coefficient sent by the second data terminal, and outputting the first correlation coefficient.

As an optional embodiment, the obtaining a difference between the dependent variable and a mean of the dependent variable, and sending the difference to a second data end, where the difference is used to calculate a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable, includes: receiving an encrypted first parameter sent by a second data end, wherein the first parameter is obtained by calculation according to an independent variable and an independent variable mean value; and obtaining an encrypted second parameter according to the difference value of the encrypted first parameter, the dependent variable and the mean value of the dependent variable, and sending the encrypted second parameter to a second data terminal for decryption, wherein the decrypted second parameter is used for calculating a regression coefficient of a unitary linear regression model constructed by the independent variable and the dependent variable.

As an alternative embodiment, the second data terminal includes a second key pair, and the second key pair includes a second public key and a second private key; the encrypted fourth parameter and the encrypted first parameter are obtained by encrypting the second public key; and the decrypted first correlation coefficient and the decrypted second parameter are obtained by decrypting through the second private key.

As an alternative embodiment, the arguments stored in the second data terminal are subjected to binning processing; before the obtaining of the difference value between the dependent variable and the dependent variable mean value, the method further includes: sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to a second data terminal; receiving an encrypted sample tag statistic value sent by a second data end, wherein the encrypted sample tag statistic value is obtained by the second data end through statistics on the encrypted sample tag value of each box according to a sample identifier; and decrypting the encrypted sample label statistic to obtain the dependent variable.

As an optional embodiment, before sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to the second data terminal, the method further includes: generating a first key pair, the first key pair comprising a first public key and a first private key; encrypting a sample tag value through the first public key to obtain the encrypted sample tag value; the decrypting the encrypted sample tag statistic includes: and decrypting the encrypted sample label statistic value through the first private key.

As an optional embodiment, if the second data end includes a plurality of independent variables, the step of obtaining the difference between the dependent variable and the dependent variable mean is performed iteratively until a first correlation coefficient corresponding to each independent variable is output.

As an optional embodiment, after outputting the first correlation coefficient corresponding to each argument, the method further includes: and selecting the independent variable of which the first correlation coefficient meets a first preset condition to form a candidate data set.

In a second aspect, the present invention provides another univariate processing method, which is applied to a second data end in a univariate processing system, where the univariate processing system includes a first data end and the second data end, where the first data end stores a dependent variable, and the second data end stores an independent variable; the method comprises the following steps: receiving a difference value between a dependent variable and a dependent variable mean value sent by a first data end, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the difference value; obtaining a third parameter according to the regression coefficient and the independent variable mean value, obtaining a fourth parameter according to the regression coefficient and the independent variable, and encrypting the fourth parameter; sending the third parameter and the encrypted fourth parameter to a first data end, wherein the third parameter and the encrypted fourth parameter are used for calculating and obtaining an encrypted first correlation coefficient corresponding to the argument; and receiving the encrypted first correlation coefficient sent by the first data end, decrypting the encrypted first correlation coefficient, sending the decrypted first correlation coefficient to the first data end, and outputting the first correlation coefficient.

As an optional embodiment, the receiving a difference between a dependent variable sent by the first data end and a mean value of the dependent variable, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the difference includes: calculating according to the independent variable and the independent variable mean value to obtain a first parameter, encrypting the first parameter, and sending the encrypted first parameter to a first data terminal; receiving an encrypted second parameter sent by a first data end, wherein the encrypted second parameter is obtained by calculation according to the encrypted first parameter, a difference value of a dependent variable and a mean value of the dependent variable; and decrypting the encrypted second parameter, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the decrypted second parameter.

As an optional embodiment, before receiving the difference between the dependent variable sent by the first data end and the mean value of the dependent variable, the method further includes: performing box separation processing on the independent variable; receiving a sample identifier sent by a first data end and an encrypted sample tag value corresponding to the sample identifier; and counting the encrypted sample label value of each box according to the sample identification to obtain an encrypted sample label statistical value, and sending the encrypted sample label statistical value to a first data end for decryption to obtain the dependent variable.

In a third aspect, the present invention provides a first data end, including a first processing module, a first sending module, and a first sending module: the first processing module is used for obtaining a difference value between a dependent variable and a dependent variable mean value, and sending the difference value to a second data end through the first sending module, wherein the difference value is used for calculating a regression coefficient of a unary linear regression model constructed by an independent variable and the dependent variable; the first receiving module is configured to receive a third parameter and an encrypted fourth parameter sent by a second data end, where the third parameter is obtained through calculation according to the regression coefficient and an independent variable mean value, and the fourth parameter is obtained through calculation according to the regression coefficient and an independent variable; the first processing module is further used for obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean; obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data end through the first sending module for decryption; the first receiving module is further configured to receive the decrypted correlation coefficient sent by the second data end, and output the correlation coefficient.

In a fourth aspect, the present invention provides a second data terminal, including a second processing module, a second sending module, and a second receiving module; the second receiving module is used for receiving a difference value between the dependent variable and the mean value of the dependent variable sent by the first data terminal; the second processing module is used for calculating a regression coefficient of a unary linear regression type constructed by the independent variable and the dependent variable according to the difference value; obtaining a third parameter according to the regression coefficient and the independent variable mean value, obtaining a fourth parameter according to the regression coefficient and the independent variable, and encrypting the fourth parameter; the second sending module is configured to send the third parameter and the encrypted fourth parameter to the first data end, where the third parameter and the encrypted fourth parameter are used to calculate and obtain an encrypted correlation coefficient corresponding to the argument; the second receiving module is further configured to receive the encrypted first correlation coefficient sent by the first data end, decrypt the encrypted first correlation coefficient through the second processing module, and send the decrypted first correlation coefficient to the first data end through the second sending module.

In a fifth aspect, the present invention provides a univariate processing system, which includes a first data terminal and a second data terminal; wherein the first data terminal is configured to perform the method according to any of the first aspects, and the second data terminal is configured to perform the method according to any of the second aspects.

In a sixth aspect, the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; a memory for storing a computer program; a processor configured to implement the method of any one of the first aspect when executing a program stored in a memory.

In a seventh aspect, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method of any of the first aspect.

In an eighth aspect, the present invention provides a variable screening method, which is applied to a variable screening system, where the variable screening system includes a fifth data end and a sixth data end, where the fifth data end stores a dependent variable and a first independent variable set, and the sixth data end stores a second independent variable set; the method comprises the following steps: a fifth data terminal acquires a first correlation coefficient of each independent variable and dependent variable in the first independent variable set; the fifth data end obtains the difference value of the dependent variable and the mean value of the dependent variable and sends the difference value to the sixth data end; a sixth data end receives the difference value, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the first independent variable set according to the difference value; the sixth data end obtains a third parameter according to the regression coefficient and the mean value of the independent variable, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the first data end; a fifth data terminal receives the third parameter and the encrypted fourth parameter, obtains a constant item of the unary linear regression model according to the third parameter and the dependent variable mean value, obtains an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains a dispersion square sum according to a difference value between the dependent variable and the dependent variable mean value; obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a sixth data end; the sixth data end receives the encrypted first correlation coefficient, decrypts the encrypted first correlation coefficient, and sends the decrypted first correlation coefficient to the fifth data end; a fifth data end receives the decrypted first correlation coefficient and outputs the first correlation coefficient; iteratively executing the step of obtaining the difference value between the dependent variable and the dependent variable mean value until a first correlation coefficient corresponding to each independent variable in the second independent variable is output; selecting independent variables of which the first correlation coefficients in the first independent variable set meet first preset conditions to form a first candidate data set, and selecting independent variables of which the first correlation coefficients in the second independent variable set meet the first preset conditions to form a second candidate data set; selecting any independent variable in the first candidate data set as a target variable, taking other independent variables in the first candidate data set as first input variables, and taking the second candidate data set as second input variables; the sixth data end obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the fifth data end; the fifth data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient; iteratively executing the step of selecting any independent variable in the first candidate data set as a target variable until a corresponding second correlation coefficient when each independent variable in the first candidate data set is output as the target variable; selecting any independent variable in the second candidate data set as a target variable, taking other independent variables in the second candidate data set as third input variables, and taking the first candidate data set as fourth input variables; the fifth data end obtains a fourth input variable calculation value according to a fourth input variable and a fourth model parameter, and sends the fourth input variable calculation value to the sixth data end; the sixth data end obtains a target variable predicted value according to the fourth input variable calculation value, the third input variable and the third model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient; iteratively executing the step of selecting any independent variable in the second candidate data set as a target variable until a corresponding second correlation coefficient when each independent variable in the second candidate data set is output as the target variable; and selecting an independent variable of which the second correlation coefficient in the first candidate data set meets a second preset condition to form a third candidate data set, selecting an independent variable of which the second correlation coefficient in the second candidate data set meets the second preset condition to form a fourth candidate data set, and forming a final candidate data set by the third candidate data set and the fourth candidate data set.

In a ninth aspect, the present invention provides another variable screening method, which is applied to a variable screening system, where the variable screening system includes a fifth data end and a sixth data end, where the fifth data end stores a dependent variable and a first independent variable set, and the sixth data end stores a second independent variable set; the method comprises the following steps: selecting any independent variable in the first independent variable set as a target variable, taking other independent variables in the first independent variable set as first input variables, and taking the second independent variable set as second input variables; the sixth data end obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the fifth data end; the fifth data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient; iteratively executing the step of selecting any independent variable in the first independent variable set as a target variable until a second correlation coefficient corresponding to each independent variable in the first independent variable set as the target variable is output; selecting any independent variable in the second independent variable set as a target variable, taking other independent variables in the second independent variable set as third input variables, and taking the second independent variable set as fourth input variables; the fifth data end obtains a fourth input variable calculation value according to a fourth input variable and a fourth model parameter, and sends the fourth input variable calculation value to the sixth data end; the sixth data end obtains a target variable predicted value according to the fourth input variable calculation value, the third input variable and the third model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient; iteratively executing the step of selecting any independent variable in the second independent variable set as a target variable until a corresponding second correlation coefficient when each independent variable in the second independent variable set is output as the target variable; selecting independent variables of which the second correlation coefficients meet second preset conditions in the first independent variable set to form a fifth candidate data set, and selecting independent variables of which the second correlation coefficients meet second preset conditions in the second independent variable set to form a sixth candidate data set; a fifth data terminal acquires a first correlation coefficient of each independent variable and dependent variable in the fifth candidate data set; the fifth data end obtains the difference value of the dependent variable and the mean value of the dependent variable and sends the difference value to the sixth data end; a sixth data end receives the difference value, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the sixth candidate data set according to the difference value; the sixth data end obtains a third parameter according to the regression coefficient and the mean value of the independent variable, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the fifth data end; a fifth data terminal receives the third parameter and the encrypted fourth parameter, obtains a constant item of the unary linear regression model according to the third parameter and the dependent variable mean value, obtains an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains a dispersion square sum according to a difference value between the dependent variable and the dependent variable mean value; obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a sixth data end; the sixth data end receives the encrypted first correlation coefficient, decrypts the encrypted first correlation coefficient, and sends the decrypted first correlation coefficient to the fifth data end; a fifth data end receives the decrypted first correlation coefficient and outputs the first correlation coefficient; iteratively executing the step of obtaining the difference value between the dependent variable and the dependent variable mean value until a first correlation coefficient corresponding to each independent variable in a sixth candidate data set is output; and selecting an independent variable of which the first correlation coefficient in the fifth candidate data set meets a first preset condition to form a seventh candidate data set, selecting an independent variable of which the first correlation coefficient in the sixth candidate data set meets the first preset condition to form an eighth candidate data set, and forming a final candidate data set by the seventh candidate data set and the eighth candidate data set.

In a tenth aspect, the present invention provides a variable screening system, which includes a fifth data terminal and a sixth data terminal; wherein the fifth data terminal and the sixth data terminal are used for executing the method according to the eighth aspect or the ninth aspect.

The technical scheme provided by the embodiment of the invention at least has part or all of the following advantages:

by transmitting the intermediate parameters between different data owners, the data safety of the two owners is guaranteed, meanwhile, the linear correlation degree of independent variables and dependent variables can be effectively analyzed, and effective basis is provided for screening of candidate characteristic variables.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a system architecture diagram according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a univariate processing method according to an embodiment of the present invention;

fig. 3 is a detailed implementation flowchart of steps S101 and S102 in the first embodiment of the present invention;

fig. 4 is a schematic flowchart of a univariate processing method according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram of a univariate and dependent variable-fitted univariate linear regression model according to an embodiment of the present invention;

fig. 6 is a schematic flowchart of a univariate processing method according to a third embodiment of the present invention;

fig. 7 is a schematic structural diagram of a first data terminal according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a second data terminal according to an embodiment of the present invention;

FIG. 9 is a flowchart illustrating a multi-variable processing method according to a fourth embodiment of the present invention;

FIG. 10 is a diagram illustrating the coefficient of variance of expansion VIF for each feature variable according to an embodiment of the present invention;

FIG. 11 is a flowchart illustrating a multi-variable processing method according to a fifth embodiment of the present invention;

FIG. 12 is a flowchart illustrating a multi-variable processing method according to a sixth embodiment of the present invention;

fig. 13 is a schematic structural diagram of a third data terminal according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of a fourth data terminal according to an embodiment of the present invention;

fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The federated learning framework is a distributed artificial intelligence model training framework, can help different data owners to realize federated modeling and federated training without sharing private data, and can effectively solve the problems of data safety and data isolated island.

The feature engineering is the most important loop in machine learning modeling, and refers to a process of processing original data into model training data, so that the information of the data is extracted to the greatest extent for the model to use. It generally comprises the following steps:

(1) characteristic pretreatment: the original data is dirty with large probability, and abnormal characteristic sample processing, missing value processing, standardization, normalization and the like are required;

(2) selecting characteristics: after the feature preprocessing is completed, meaningful features need to be screened out and input into the model for training. The screening method comprises characteristic univariate analysis, namely the distribution condition of each characteristic and the prediction capability of the label are analyzed, common indexes comprise WOE, IV, PSI, KS and the like, and the method is widely applied to the modeling of the scoring card in the field of consumption finance.

(3) And (3) feature dimensionality reduction: after the feature selection is completed, the problems of low calculation efficiency, high model complexity and the like caused by overlarge feature matrix can be solved through feature dimension reduction. The specific method comprises PCA principal component analysis, LDA linear discriminant analysis, ICA independent component analysis and the like.

And univariate analysis under the federal scene comprises Evidence Weight (WOE) and Information Value (IV). However, both WOE and IV are used to select more important variables to be added into the model, the prediction strength can be used as a basis for judging whether the variables are important, the linear correlation between the independent variable and the dependent variable is lack of representation, and a measure for the severity of (multiple) collinearity of the multiple linear regression model is also lacking, and if multiple collinearity exists among features, the weight parameter estimation of the model is distorted or difficult to estimate accurately.

In view of the above technical problems, the technical idea of the present invention is as follows: the variance expansion coefficient R is calculated by the transmission of encrypted intermediate parameters between different data owners²The interpretability of the explanatory variable to the dependent variable is analyzed, the calculated variance expansion coefficient VIF is used for analyzing the multiple collinearity among the characteristics, and the interpretability of the model-entering variable is increased.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention, and as shown in fig. 1, the system architecture 100 includes: the system comprises an a-side server 101, a B-side server 102 and a network 103, wherein the a-side server 101 and the B-side server 102 respectively store partial sample data, encrypted intermediate parameter transmission is realized through the network 103, and the network 103 can be a wired or wireless communication link or an optical fiber cable and the like.

It should be noted that, the a-side server 101 and the B-side server 102 cooperate together to implement the following embodiments, and the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

The invention provides a univariate processing method, which is applied to a univariate processing system, wherein the univariate processing system comprises a first data end and a second data end, the first data end stores a dependent variable, the second data end stores an independent variable, and the method can be referred to as shown in figure 1 (as shown in figure 1, an A-end server is equivalent to the first data end, and a B-end server is equivalent to the second data end).

Fig. 2 is a schematic flowchart of a univariate processing method according to an embodiment of the present invention, and as shown in fig. 2, the univariate processing method includes:

and S101, the first data end obtains a difference value between the dependent variable and the mean value of the dependent variable, and sends the difference value to the second data end.

Wherein the difference is used to calculate a regression coefficient of a unary linear regression model constructed from the independent variables and the dependent variables.

Referring to fig. 1, the a-side server stores a dependent variable y corresponding to n sample quantities, and the B-side server stores an independent variable x corresponding to n sample quantities_tOptionally, the a terminal or the B terminal further stores other independent variables. When it is necessary to analyze a single variable x_tFor the explanatory ability of the dependent variable y, the regression calculation is realized by the least square method, namely, a unary linear regression model constructed by the independent variable and the dependent variable is shown as the formula (1):

y′＝f(x_t,i)＝a₀+a₁x_t (1)

wherein y' represents a dependent variable prediction value, x_t,iDenotes the independent variable x_tThe ith sample value of (a); a is₁The regression coefficient of the unary linear regression model is represented by the following formula (2):

wherein the content of the first and second substances,

denotes x_tThe mean value of (a); y is_iIndicates the value of the ith sample label,

represents the mean value of y.

a₀A constant term representing a unary linear regression model, whose calculation formula is shown in (3):

then, the degree of linear correlation of the independent variable to the dependent variable, namely, the variance expansion coefficient R can be calculated according to the predicted dependent variable value²As shown in equation (4):

wherein, RMSE represents the residual sum of squares of the regression model, and the calculation formula is shown as (5); SST represents the sum of squared deviations of the constructed regression model, and the calculation formula is shown in (6).

In this step, the A terminal can be according to y_iObtaining dependent variable mean

Then calculating to obtain y_iAnd

difference of (2)

And will be

And sending the data to the B terminal. It should be noted that the difference value is sent from the a terminal to the B terminal

The B end cannot back calculate y according to the difference_iAnd the data security of the A end is ensured.

And S102, the second data end receives the difference value between the dependent variable and the mean value of the dependent variable sent by the first data end, and calculates the regression coefficient of the unary linear regression model constructed by the independent variable and the dependent variable according to the difference value.

Specifically, terminal B receives

Then, the regression coefficient a of the unary linear regression model can be calculated according to the formula (2)₁。

Fig. 3 is a detailed implementation flowchart of steps S101 and S102 in the first embodiment of the present invention. As shown in fig. 3, steps S101 and S102 include steps S1011 to S1013 as follows:

step S1011, the second data end calculates to obtain the first parameter according to the independent variable and the independent variable mean, encrypts the first parameter, and sends the encrypted first parameter to the first data end.

Correspondingly, at the first data end side, the first data end receives the encrypted first parameter sent by the second data end, and the first parameter is obtained by calculation according to the independent variable and the mean value of the independent variable.

In this step, the B terminal can be according to x_t,iCalculating the mean of the independent variables

Then according to x_t,i、

Calculating to obtain the first parameter

u₁In relation to the regression coefficient, the B-port is assigned to the first parameter u₁Encrypting, and then encrypting the encrypted u₁And sending the data to the first data terminal.

Optionally, the second data end includes a second key pair, where the second key pair includes a second public key and a second private key; the encrypted first parameter is obtained by encrypting the second public key. Specifically, the B-side server will generate a second key pair comprising a second public key PK_BAnd a second private key SK_BObtaining u at the B terminal₁Thereafter, use of PK_BFor u is paired₁Encrypting to obtain the encrypted first parameter Enc_B(u₁) And Enc will be_B(u₁) And sending the data to the first data terminal. Note that Enc_B(u₁) Is through PK_BEncrypted only through a second private key SK at the B terminal_BPerform decryption, i.e. the A-side cannot solve u back₁。

Step S1012, the first data end obtains an encrypted second parameter according to the encrypted first parameter, the difference between the dependent variable and the dependent variable mean, and sends the encrypted second parameter to the second data end.

Correspondingly, the encrypted second parameter sent by the first data end is received at the second data end side, wherein the encrypted second parameter is obtained by calculation according to the encrypted first parameter, the difference value between the dependent variable and the mean value of the dependent variable.

In this step, end A is according to

Calculating to obtain the encrypted second parameter

Note that, because Enc_B(u₁) Is through PK_BEncrypted, thus obtained second parameter v₁Is also via PK_BEncrypted.

And S1013, the second data end decrypts the encrypted second parameter, and calculates a regression coefficient of the unary linear regression model constructed by the independent variable and the dependent variable according to the decrypted second parameter.

Optionally, the decrypted second parameter is obtained by decrypting with the second private key, that is, the B-side utilizes the second private key SK_BFor Enc_B(v₁) Decrypting to obtain the decrypted second parameter v₁Then the B terminal is according to v₁Calculating a regression coefficient a₁The calculation formula is shown in (7), and formula (7) is a modification of formula (2).

And S103, the second data terminal obtains a third parameter according to the regression coefficient and the independent variable mean value, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the first data terminal.

And the third parameter and the encrypted fourth parameter are used for calculating and obtaining an encrypted first correlation coefficient corresponding to the argument. Correspondingly, at the first data end side, a third parameter and an encrypted fourth parameter sent by the second data end are received, wherein the third parameter is obtained by calculation according to the regression coefficient and the mean of the independent variables, and the fourth parameter is obtained by calculation according to the regression coefficient and the independent variables.

In this step, the B terminal is based on the regression coefficient a₁And the mean of the independent variables

Obtaining a third parameter

According to the regression coefficient a₁And independent variable x_tObtaining a fourth parameter w₁＝a₁x_tAnd to w₁Encrypting, and encrypting w₁And sending the data to the A terminal. Optionally, the second data end passes through the second public key PK_BTo w₁Encrypting to obtain an encrypted fourth parameter Enc_B(w₁)。

Optionally, the regression coefficient a can be used₁And independent variable x_tObtaining a fifth parameter

And passes the second public key PK_BFor the fifth parameter w₂Encrypting to obtain Enc_B(w₂) Will Enc_B(w₂) And sending the data to the first data terminal.

And step S104, the first data end obtains a constant item of the unary linear regression model according to the third parameter and the dependent variable mean value.

In this step, the A terminal is according to u₂And

obtaining constant terms of a unary linear regression model

Is a variation of equation (3).

And S105, the first data end obtains the encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains the dispersion square sum according to the difference value of the dependent variable and the dependent variable mean.

In this step, the first data end is according to Enc_B(w₁)、a₀And y_iObtaining an encrypted residual sum of squares (RMSE), optionally in accordance with Enc_B(w₁)、Enc_B(w₂)、a₀And y_iObtaining an encrypted RMSE, wherein the calculation formula is shown as (8) and is a variant of the formula (5); the first data terminal is based on y_i、

The sum of squared deviations SST is obtained and the calculation formula is shown in (6).

And S106, the first data end obtains an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sends the encrypted first correlation coefficient to the second data end.

In this step, the first correlation coefficient may also be called variance expansion coefficient R²End A is according to Enc_B(RMSE), SST obtain encrypted variance expansion coefficients

Sending Enc_B(R²) To the B terminal.

Step S107, the second data end receives the encrypted first correlation coefficient sent by the first data end, decrypts the encrypted first correlation coefficient, sends the decrypted first correlation coefficient to the first data end, and outputs the first correlation coefficient.

In this step, the B-side passes through the second private key SK_BFor Enc_B(R²) Encrypting to obtain decrypted first correlation coefficient, i.e. coefficient of variance expansion R²And R is²Sending to A terminal, and B terminal outputting R²。

Step S108, the first data end receives the decrypted first correlation coefficient sent by the second data end, and outputs the first correlation coefficient.

Specifically, if the B-side includes independent variables of multiple dimensions, the present embodiment may be adopted to analyze a first correlation coefficient between each independent variable and a dependent variable in the first data side.

In particular, the first correlation coefficient, i.e. the coefficient of variance expansion R²Describes the magnitude of the effect of the independent variable on the dependent variable, if R²Smaller, the degree of interpretation of the dependent variable by the independent variable is lower, and the degree of linear correlation is lower. The first preset condition may be set to be greater than a certain threshold, e.g. a threshold of 0.3, i.e. only if R is present²Above 0.3, the independent variable is relatively more interpreted than the dependent variable, and the independent variable may be selected as the candidate data set. The candidate data set may be used as a trainingThe training data set trains the model, or uses other aspects.

According to the univariate processing method provided by the embodiment of the invention, the difference value of the dependent variable and the mean value of the dependent variable is obtained and sent to the second data terminal, and the difference value is used for calculating the regression coefficient of the univariate linear regression model constructed by the independent variable and the dependent variable; receiving a third parameter and an encrypted fourth parameter sent by a second data end, wherein the third parameter is obtained by calculation according to the regression coefficient and the independent variable mean value, and the fourth parameter is obtained by calculation according to the regression coefficient and the independent variable; obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean; obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data terminal for decryption; receiving a decrypted first correlation coefficient sent by a second data end, and outputting the first correlation coefficient; the embodiment of the invention transmits the encrypted intermediate parameter between the two data providers, ensures the data safety of the two data providers, and can analyze the linear correlation degree of the single variable to the dependent variable, thereby providing an effective basis for the screening of the subsequent characteristic variables.

On the basis of the foregoing embodiment, fig. 4 is a flowchart illustrating a univariate processing method according to a second embodiment of the present invention, in which in this embodiment, the independent variables stored in the second data terminal are subjected to binning processing. As shown in fig. 4, the univariate processing method includes:

step S201, the first data end sends the sample identifier and the encrypted sample tag value corresponding to the sample identifier to the second data end.

Correspondingly, at the second data end side, the sample identifier sent by the first data end and the encrypted sample tag value corresponding to the sample identifier are received.

Step S202, the second data terminal counts the encrypted sample label value of each box according to the sample identification to obtain an encrypted sample label statistic value, and sends the encrypted sample label statistic value to the first data terminal.

Correspondingly, at the first data end side, the encrypted sample tag statistic value sent by the second data end is received, wherein the encrypted sample tag statistic value is obtained by the second data end by counting the encrypted sample tag value of each box according to the sample identifier.

And step S203, the first data end decrypts the encrypted sample label statistic to obtain the dependent variable.

And step S204, the first data end obtains the difference value of the dependent variable and the mean value of the dependent variable, and sends the difference value to the second data end.

Step S205, the second data end receives the difference between the dependent variable and the mean of the dependent variable sent by the first data end, and calculates a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the difference.

And S206, the second data terminal obtains a third parameter according to the regression coefficient and the independent variable mean value, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the first data terminal.

And S207, the first data end obtains a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value.

And S208, the first data end obtains the encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains the dispersion square sum according to the difference value of the dependent variable and the dependent variable mean.

Step S209, the first data end obtains an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual sum of squares and the encrypted dispersion sum of squares, and sends the encrypted first correlation coefficient to the second data end.

Step S210, the second data end receives the encrypted first correlation coefficient sent by the first data end, decrypts the encrypted first correlation coefficient, sends the decrypted first correlation coefficient to the first data end, and outputs the first correlation coefficient.

Step S211, the first data end receives the decrypted first correlation coefficient sent by the second data end, and outputs the first correlation coefficient.

The implementation manners of step S204 to step S211 in this embodiment are similar to the implementation manners of step S101 to step S108 in the above embodiment, and are not described herein again.

The difference from the above embodiment is that in order to reduce the risk of overfitting of the unary linear regression model to be constructed and obtain a more stable regression model, in this embodiment, the independent variables stored in the second data end are subjected to binning processing; sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to a second data terminal; receiving an encrypted sample tag statistic value sent by a second data end, wherein the encrypted sample tag statistic value is obtained by the second data end through statistics on the encrypted sample tag value of each box according to a sample identifier; and decrypting the encrypted sample label statistic to obtain the dependent variable.

Specifically, all variables in the B terminal are firstly subjected to box separation, and are divided into m boxes in total, and the ith box has n in total_iTaking sample x in the ith bin_t,iAnd dependent variable y_iAverage value of (d):

then, a regression calculation is implemented by using a least square method, namely, a unary linear regression model constructed by the independent variable and the dependent variable is shown as a formula (9):

Y′＝f(X_t,i)＝a₀+a₁X_t (9)

wherein, X_t＝[X_t,0,X_t,1,ΛX_t,m]^T，

Wherein the content of the first and second substances,

finally, calculating according to the regression prediction value

Wherein the content of the first and second substances,

in this embodiment, the a-side first sends the sample identifier in the sample data and the encrypted sample tag value y to the second data side; then, the B terminal counts y of each box according to the sample identification to obtain an encrypted sample label statistic value y _ bin _ sum corresponding to each box, and sends the encrypted y _ bin _ sum to the A terminal; the A-end decrypts the encrypted y _ bin _ sum to obtain the decrypted y _ bin _ sum corresponding to each box, and then uses the decrypted y _ bin _ sum

Obtaining the dependent variable Y corresponding to each box_i(ii) a Then, the a terminal and the B terminal cooperatively perform the method steps as described in the first embodiment, so as to obtain the first correlation coefficient corresponding to the independent variable.

Optionally, before step 201, the method further includes: a first data terminal generates a first key pair, wherein the first key pair comprises a first public key and a first private key; encrypting a sample tag value through the first public key to obtain the encrypted sample tag value; the decrypting the encrypted sample tag statistic in step 210 includes: and decrypting the encrypted sample label statistic value through the first private key.

Specifically, the A-side generates a first key pair comprising a first public key PK_AAnd a first private key SK_AThe A terminal utilizes PK_AEncrypting the sample label value y to obtain Enc_A(y) and converting Enc_A(y) and the sample identifier are sent to the B terminal; then, the B terminal counts the encrypted sample label value of each box according to the sample identification to obtain an encrypted sample label statistic value Enc_A(y _ bin _ sum), and Enc_A(y _ bin _ sum) is sent to the A end; the A terminal utilizes a second private key SK_ADecrypting it to obtain Y corresponding to each box_i。

Then, the A terminal is based on the dependent variable Y_iObtaining dependent variable mean

Then calculating to obtain a dependent variable Y_iAnd dependent variable mean value

Difference of (2)

And will be

Sending the data to the B terminal; so that the B terminal is according to

Calculating regression coefficients of a unary linear regression model constructed by the independent variables and the dependent variables; then, the B terminal is based on the regression coefficient a₁And the mean of the independent variables

Obtaining a third parameter

According to the regression coefficient a₁And independent variable X_tObtaining a fourth parameter w₁＝a₁X_tAre combined with each otherw₁Encrypting, and encrypting w₁And sending the data to the A terminal. Optionally, B passes the second public key PK_BTo w₁Encrypting to obtain an encrypted fourth parameter Enc_B(w₁). Optionally, the regression coefficient a can be used₁And independent variable X_tObtaining a fifth parameter w₂＝a₁ ²(X_tX_t) And by the second public key PK_BFor the fifth parameter w₂Encrypting to obtain Enc_B(w₂) Will Enc_B(w₂) Sending the data to the A end; then, the A terminal is according to u₂And

obtaining constant terms of a unary linear regression model

According to Enc_B(w₁)、a₀And Y_iObtaining an encrypted residual sum of squares, RMSE, optionally in accordance with Enc_B(w₁)、Enc_B(w₂)a₀And Y_iObtaining an encrypted RMSE, wherein the calculation formula is shown as (10); the first data terminal is based on Y_i、

The sum of squared deviations SST is obtained and the calculation formula is shown in (11).

End A is according to Enc_B(RMSE), SST obtain encrypted variance expansion coefficients

Sending Enc_B(R²) Feeding the B end; b terminalBy means of a second private key SK_BFor Enc_B(R²) Encrypting to obtain a decrypted first correlation coefficient, i.e. coefficient of variance expansion R²And R is²Sending to A terminal, outputting R²While the second data terminal outputs R²。

Fig. 5 is a schematic diagram of a univariate linear regression model fitted with a dependent variable and a univariate linear regression model according to an embodiment of the present invention. After fitting a univariate linear regression model as shown in FIG. 5, the coefficient of variance expansion R corresponding to the argument can be calculated²。

According to the single variable processing method provided by the embodiment of the invention, the independent variables stored by the second data terminal are subjected to box separation processing; sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to a second data terminal; receiving an encrypted sample label statistic value sent by a second data end, wherein the encrypted sample label statistic value is obtained by counting the encrypted sample label value of each box by the second data end according to a sample identifier; decrypting the encrypted sample tag statistic to obtain the dependent variable; the embodiment of the invention enables the linear regression model to be more stable by performing the box separation processing on the independent variable, and ensures the safety of data by encrypting the sample label value and sending the encrypted sample label value to another data provider.

To further understand the embodiment of the present invention, fig. 6 is a schematic flowchart of a univariate processing method provided in a third embodiment of the present invention, and with reference to fig. 1 and fig. 6, the univariate processing method includes:

step 1.A, a pair of public and private keys PK is generated_A、SK_APublic key PK_ASending the data to B;

step 2.B generates a pair of public and private keys PK_B、SK_B(PK_A≠PK_B,SK_A≠SK_B) Public key PK_BSending to A;

step 3.A, encrypting the label Enc_A(y) send to B;

step 4.B, counting the number of positive and negative samples of each box characteristic of the own party, and counting the labelInformation Enc_A(y _ bin _ sum) to a;

step 5.A uses the private key SK_ADecrypting Enc_A(y _ bin _ sum), calculating y of the B terminal_iA value;

step 6.B calculation

Using the public key PK_BEncrypting u₁Sending Enc_B(u₁) Feeding A;

step 7.A calculation

Will Enc_B(v₁) Sending the data to B;

step 8.B uses the private key SK_BDecrypting Enc_B(v₁) Calculating

Sending u₂Feeding A;

step 9.B calculation

w₂＝a₁X_tUsing the public key PK_BEncrypting w₁And w₂Sending Enc_B(w₁) And Enc_B(w₂) Feeding A;

step 10.A calculation

Sending Enc_B(R²) Feeding B;

step 11.B private key SK_BDecrypting Enc_B(R²) Sending R²Feeding A;

and 12, calculating and repeating the step 6 to the step 11 until all independent variables at the B end are analyzed.

Specifically, in order to ensure that the information transmitted by the two parties is not cracked by the third party, the a-side server may utilize the private key SK to send the message to the B-side server_AThe B-end server can utilize the received public key PK of the A-end server to perform encryption_ADecrypting the message; similarly, the B-side server can utilize SK to send messages to the A-side server_BThe encryption is carried out, and the A-side server can utilize the received public key PK of the B-side server_BThe message is decrypted.

In summary, the server at the a-side and the server at the B-side have different data, the two sides can respectively perform data-related calculation locally, only encrypted intermediate parameters are transmitted between the two sides, and the receiver cannot solve the original data, so that the linear correlation degree of each independent variable and each dependent variable is analyzed on the basis of ensuring the data safety of the two sides, and an effective basis is provided for subsequent variable screening and the like.

The embodiment of the invention also provides a first data terminal. Fig. 7 is a schematic structural diagram of a first data end according to an embodiment of the present invention, and as shown in fig. 7, the first data end includes a first processing module 10, a first sending module 11, and a first receiving module 12;

the first processing module 10 is configured to obtain a difference between a dependent variable and a mean of the dependent variable, and send the difference to a second data end through the first sending module 11, where the difference is used to calculate a regression coefficient of a unary linear regression model constructed by an independent variable and the dependent variable; the first receiving module 12 is configured to receive a third parameter and an encrypted fourth parameter sent by a second data end, where the third parameter is obtained through calculation according to the regression coefficient and an independent variable mean value, and the fourth parameter is obtained through calculation according to the regression coefficient and an independent variable; the first processing module 10 is further configured to obtain a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean; obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data end through the first sending module 11 for decryption; the first receiving module 12 is further configured to receive the decrypted first correlation coefficient sent by the second data end, and output the first correlation coefficient.

As an optional embodiment, the first receiving module 12 is specifically configured to receive an encrypted first parameter sent by a second data end, where the first parameter is obtained according to an independent variable and an independent variable mean value; the first processing module 10 is specifically configured to obtain an encrypted second parameter according to the difference between the encrypted first parameter, the dependent variable, and the mean of the dependent variable, and send the encrypted second parameter to a second data end through the first sending module 11 for decryption, where the decrypted second parameter is used to calculate a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable.

As an alternative embodiment, the independent variables stored in the second data terminal are subjected to binning processing; the first sending module 11 is further configured to send the sample identifier and the encrypted sample tag value corresponding to the sample identifier to a second data terminal; the first receiving module 12 is configured to receive an encrypted sample tag statistic value sent by a second data end, where the encrypted sample tag statistic value is obtained by the second data end by counting encrypted sample tag values of each box according to a sample identifier; the first processing module 10 is configured to decrypt the encrypted sample tag statistic to obtain the dependent variable.

As an alternative embodiment, the first processing module 10 is further configured to generate a first key pair, where the first key pair includes a first public key and a first private key; encrypting a sample tag value through the first public key to obtain the encrypted sample tag value; and decrypting the encrypted sample label statistic value through the first private key.

As an alternative embodiment, the first processing module 10 is further configured to select an argument of which the first correlation coefficient satisfies a first preset condition, and constitute the candidate data set.

The implementation principle and technical effect of the first data terminal provided in this embodiment are similar to those of the above embodiments, and are not described herein again.

The embodiment of the invention also provides a second data terminal. Fig. 8 is a schematic structural diagram of a second data end according to an embodiment of the present invention, as shown in fig. 8, the second data end includes a second processing module 20, a second sending module 21, and a second receiving module 22;

the second receiving module 22 is configured to receive a difference between a dependent variable sent by the first data end and a mean value of the dependent variable; the second processing module 20 is configured to calculate a regression coefficient of a unary linear regression model constructed from the independent variable and the dependent variable according to the difference; obtaining a third parameter according to the regression coefficient and the independent variable mean value, obtaining a fourth parameter according to the regression coefficient and the independent variable, and encrypting the fourth parameter; the second sending module 21 is configured to send the third parameter and the encrypted fourth parameter to the first data end, where the third parameter and the encrypted fourth parameter are used to calculate and obtain an encrypted correlation coefficient corresponding to the argument; the second receiving module 22 is further configured to receive the encrypted first correlation coefficient sent by the first data end, decrypt the encrypted first correlation coefficient through the second processing module 20, and send the decrypted first correlation coefficient to the first data end through the second sending module 21.

As an optional embodiment, the second processing module 20 is configured to calculate and obtain a first parameter according to the independent variable and the independent variable mean, encrypt the first parameter, and send the encrypted first parameter to the first data end through the second sending module 21; the second receiving module 22 is configured to receive an encrypted second parameter sent by the first data end, where the encrypted second parameter is obtained by calculation according to the difference between the encrypted first parameter, the dependent variable, and the mean of the dependent variable; the second processing module 20 is further configured to decrypt the encrypted second parameter, and calculate a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the decrypted second parameter.

As an alternative embodiment, the second processing module 20 is further configured to perform binning processing on the independent variables; the second receiving module 22 is further configured to receive a sample identifier sent by the first data end, and an encrypted sample tag value corresponding to the sample identifier; the second processing module is further configured to count the encrypted sample tag values of each box according to the sample identifier to obtain an encrypted sample tag statistical value, and send the encrypted sample tag statistical value to the first data end through the second sending module 21 for decryption to obtain the dependent variable.

The implementation principle and technical effect of the second data terminal provided in this embodiment are similar to those of the above embodiments, and are not described herein again.

The embodiment of the invention also provides a single variable processing system. Referring to fig. 1, the univariate processing system comprises a first data terminal and a second data terminal; wherein the first data terminal and the second data terminal are configured to perform the method of any of the above embodiments.

The implementation principle and technical effect of the single variable processing system provided in this embodiment are similar to those of the above embodiments, and are not described herein again.

The present invention further provides a multi-variable processing method, which is applied to a multi-variable processing system including a third data end and a fourth data end, wherein the third data end stores a first characteristic variable set, and the second data end stores a second characteristic variable set, as shown in fig. 1 (an a-side server in fig. 1 is equivalent to the third data end, and a B-side server in fig. 1 is equivalent to the fourth data end).

Fig. 9 is a flowchart illustrating a multi-variable processing method according to a fourth embodiment of the invention. As shown in fig. 9, the multi-variable processing method includes:

step S301, selecting any one of the characteristic variables in the first characteristic variable set as a target variable, taking other characteristic variables in the first characteristic variable set as first input variables, and taking the second characteristic variable set as second input variables.

Specifically, the A end stores a first characteristic variable set

The

Is f _ dim_AFrom

Selecting any one characteristic variable x_tAs target variable (dependent variable), i.e.

The other characteristic variables in the first characteristic variable set are used as first input variables, namely the first input variables

The B terminal stores a second characteristic variable set X_BX of the formula_BIs f _ dim_B，X_BAs a second input variable.

Step S302, the fourth data terminal obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the third data terminal.

And the second input variable calculation value is used for determining a second correlation coefficient corresponding to the target variable. Correspondingly, a second input variable calculation value sent by the fourth data end is received at the third data end side, wherein the second input variable calculation value is obtained through calculation according to a second input variable and a second model parameter.

In this step, the terminal B can be according to the second input variable X_BSecond model parameter W_BObtaining a second input variable calculation value P_B＝W_BX_BAnd transmit P_BTo the A-side server.

And step S303, the third data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter.

In this step, the A terminal can calculate a value P according to the second input variable_BFirst input variable X_AAnd a first model parameter W_ACalculating to obtain a target variable predicted value Y' ═ W_AX_A+P_B。

And S304, the third data end obtains a residual square sum according to the dependent variable and the target variable predicted value, and obtains a dispersion square sum according to the dependent variable and the dependent variable mean value.

In this step, the A terminal can be determined according to the dependent variable Y_iObtaining the square sum of residual errors of the target variable predicted value Y

According to Y_iAnd dependent variable mean

Obtaining the sum of squared deviations

Step S305, the third data end determines a second correlation coefficient corresponding to the target variable according to the residual sum of squares and the dispersion sum of squares, and outputs the second correlation coefficient.

In this step, the second correlation coefficient is also calledIs coefficient of variance expansion

As an optional embodiment, the step of selecting any one of the characteristic variables in the first characteristic variable set as the target variable is performed iteratively until a second correlation coefficient corresponding to each characteristic variable as the target variable is output.

Specifically, steps S301-S305 are repeatedly executed for f _ dim_ASecondly, outputting second correlation coefficients corresponding to all the characteristic variables in the first characteristic variable set in the A end; similarly, any feature variable in the second feature variable set may be iteratively selected as the target variable, that is, the a terminal and the B terminal perform roles of exchanging each other to execute the steps in the above embodiment, and f _ dim is executed in total_BAnd outputting second correlation coefficients corresponding to all characteristic variables in a second characteristic variable set in the B terminal. In summary, f _ dim needs to be performed in total_A+f_dim_BAnd after secondary analysis, outputting second correlation coefficients corresponding to the characteristic variables of the two parties.

As an optional embodiment, after outputting the second correlation coefficient corresponding to each feature variable as the target variable, the method further includes: and selecting the characteristic variables of which the second correlation coefficients meet second preset conditions to form a candidate data set.

Specifically, the second correlation coefficient, also called coefficient of variance VIF, may describe how many co-linearity between each feature variable is, and in order to obtain a more reliable model, the feature variable with the larger coefficient of variance VIF may be eliminated. Fig. 10 is a schematic diagram of a coefficient of variance expansion VIF corresponding to each feature variable according to an embodiment of the present invention, as shown in fig. 10, by observing the VIF values of all feature variables, if the VIF value is found to be large (significant outlier), the feature variable is removed, so as to obtain a feature combination with low correlation to enhance the interpretability of the model.

The multi-variable processing method provided by the embodiment of the invention is applied to a third data terminal in a multi-variable processing system, wherein the multi-variable processing system comprises the third data terminal and a fourth data terminal, a first characteristic variable set is stored in the third data terminal, and a second characteristic variable set is stored in the second data terminal; selecting any characteristic variable in the first characteristic variable set as a target variable, taking other characteristic variables in the first characteristic variable set as first input variables, and taking the second characteristic variable set as second input variables; receiving a second input variable calculation value sent by the fourth data end, wherein the second input variable calculation value is obtained by calculation according to a second input variable and a second model parameter; obtaining a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient; the embodiment of the invention realizes that the linear correlation degree among the characteristic variables can be analyzed on the basis of ensuring the data safety of different data ownership sets by transmitting the intermediate parameters among different data ownership sets, and provides an effective basis for subsequent characteristic screening.

On the basis of the foregoing embodiment, fig. 11 is a schematic flowchart of a multi-variable processing method according to a fifth embodiment of the present invention, and as shown in fig. 11, the multi-variable processing method includes:

step S401, selecting any one of the characteristic variables in the first characteristic variable set as a target variable, taking other characteristic variables in the first characteristic variable set as first input variables, and taking the second characteristic variable set as second input variables.

And S402, the fourth data terminal constructs a second model according to the initial value of the second model parameter and the second input variable, encrypts the second model and sends the encrypted second model to the third data terminal.

Correspondingly, the encrypted second model sent by the fourth data end is received at the third data end, wherein the second model is constructed and obtained according to the initial value of the second model parameter and the second input variable.

And S403, the third data terminal constructs and obtains a first model according to the initial value of the first model parameter, the first input variable and the target variable.

Optionally, steps S404, S406, S408, S410, and S412 are further included. The first model is encrypted at the third data end, and the encrypted first model is sent to the fourth data end, the encrypted first model is used for the fourth data end to calculate and obtain an encrypted second gradient value of the global loss function on the second model parameter, and the encrypted second gradient value is added with a second random number to obtain an encrypted second gradient related value; and receiving the encrypted second gradient correlation value sent by the fourth data end, decrypting the encrypted second gradient correlation value, sending the decrypted second gradient correlation value to the fourth data end, wherein the decrypted second gradient correlation value is used for the fourth data end to obtain a second gradient value, and updating a second model parameter according to the second gradient value. Correspondingly, at the fourth data end, receiving an encrypted first model sent by the third data end, wherein the first model is constructed and obtained according to a first model parameter initial value, a first input variable and a target variable; calculating and obtaining an encrypted second gradient value of the global loss function to a second model parameter according to the encrypted first model, the encrypted second model and a second input variable; adding the encrypted second gradient value and a second random number to obtain an encrypted second gradient correlation value, and sending the encrypted second gradient correlation value to a third data terminal for decryption; receiving a decrypted second gradient correlation value sent by a third data end, and obtaining a second gradient value according to the second gradient correlation value; and updating the second model parameter according to the second gradient value.

And S404, the third data end encrypts the first model and sends the encrypted first model to the fourth data end.

And S405, the third data end calculates and obtains the encrypted first gradient value of the global loss function to the first model parameter according to the encrypted second model, the encrypted first model and the first input variable.

Step S406, the fourth data end calculates and obtains an encrypted second gradient value of the global loss function to the second model parameter according to the encrypted first model, the encrypted second model, and the second input variable.

And step S407, the third data terminal adds the encrypted first gradient value and the first random number to obtain an encrypted first gradient correlation value, and sends the encrypted first gradient correlation value to the fourth data terminal.

And step S408, the fourth data terminal adds the encrypted second gradient value and the second random number to obtain an encrypted second gradient correlation value, and sends the encrypted second gradient correlation value to the third data terminal.

And step S409, the fourth data terminal receives the encrypted first gradient correlation value, decrypts the encrypted first gradient correlation value, and sends the decrypted first gradient correlation value to the third data terminal.

And step S410, the third data end receives the encrypted second gradient correlation value sent by the fourth data end, decrypts the encrypted second gradient correlation value, and sends the decrypted second gradient correlation value to the fourth data end.

Step S411, the third data end receives the decrypted first gradient correlation value sent by the fourth data end, obtains a first gradient value according to the first gradient correlation value, and updates the first model parameter according to the first gradient value.

Step S412, the fourth data end receives the decrypted second gradient correlation value sent by the third data end, obtains a second gradient value according to the second gradient correlation value, and updates the second model parameter according to the second gradient value.

And iteratively executing the steps S402-S413 until the regression model constructed by the first input variable, the second input variable and the target variable converges and the global loss function obtains the minimum value.

Step S413, the fourth data end obtains a second input variable calculation value according to the second input variable and the second model parameter, and sends the second input variable calculation value to the third data end.

And S414, the third data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter.

And step S415, the third data end obtains a residual square sum according to the dependent variable and the target variable predicted value, and obtains a dispersion square sum according to the dependent variable and the dependent variable mean value.

And S416, the third data end determines a second correlation coefficient of the dependent variable, the first input variable and the second input variable according to the residual sum of squares and the dispersion sum of squares, and outputs the second correlation coefficient.

The implementation manners of step S401, step S413 to step S416 in this embodiment are similar to the implementation manners of step S301 to step S305 in the foregoing embodiment, and are not described herein again.

The difference from the above embodiment is that the present embodiment further defines how to determine the model parameters constructed by the first input variable, the second input variable and the target variable, and in the present embodiment, the following steps are iteratively performed until the regression model constructed by the first input variable, the second input variable and the target variable converges and the global loss function takes the minimum value: receiving an encrypted second model sent by a fourth data end, wherein the second model is constructed and obtained according to a second model parameter initial value and a second input variable; constructing and obtaining a first model according to the initial value of the first model parameter, the first input variable and the target variable; calculating and obtaining an encrypted first gradient value of the global loss function to the first model parameter according to the encrypted second model, the encrypted first model and the first input variable; adding the encrypted first gradient value and a first random number to obtain an encrypted first gradient correlation value, and sending the encrypted first gradient correlation value to a fourth data terminal for decryption; receiving a decrypted first gradient correlation value sent by a fourth data end, and obtaining a first gradient value according to the first gradient correlation value; updating the first model parameter according to the first gradient value; further comprising: the third data end encrypts the first model and sends the encrypted first model to the fourth data end, wherein the encrypted first model is used for the fourth data end to calculate and obtain an encrypted second gradient value of the global loss function on the second model parameter, and the encrypted second gradient value and a second random number are added to obtain an encrypted second gradient correlation value; and receiving the encrypted second gradient correlation value sent by the fourth data end, decrypting the encrypted second gradient correlation value, sending the decrypted second gradient correlation value to the fourth data end, wherein the decrypted second gradient correlation value is used for the fourth data end to obtain a second gradient value, and updating a second model parameter according to the second gradient value.

Specifically, the embodiment of the present invention regresses the linear model by a gradient descent method until the regression model constructed by the first input variable, the second input variable, and the target variable converges and the global loss function obtains the minimum value.

For the first model of the A end, the first model parameter of the A end is initialized to obtain the initial value w of the first model parameter_AInitializing the second model parameter of the B terminal to obtain the initial value w of the second model parameter_B(ii) a Then B terminal is according to w_BAnd a second input variable X_BConstruction of the second model F_B＝w_BX_BTo F_BEncrypting, and converting the encrypted F_BAnd sending the data to the A terminal.

Optionally, the fourth data end includes a fourth key pair, where the fourth key pair includes a fourth public key and a fourth private key; wherein the encrypted second model is obtained by the fourth public key encryption. Specifically, the fourth key pair generated by the B-side includes a fourth public key PK_BAnd a fourth private key SK_BAnd the B terminal utilizes PK_BTo F is aligned with_BEncrypting to obtain a second encrypted model Enc_B(F_B) Will Enc_B(F_B) And sending the data to the A end, wherein the A end cannot decrypt the data.

Then, the A terminal starts value w according to the first model parameter_AFirst input variable X_AConstructing a first model with a target variable YF_AIn which F_A＝w_AX_A-Y。

Then, the A terminal is according to the encrypted second model Enc_B(F_B) First model F_AAnd a first input variable X_ACalculating a first gradient value Enc for obtaining the encryption of the global loss function to the first model parameter_B(G_A)＝(F_A+Enc_B(F_B))X_A。

Then, the A end encrypts the first gradient value Enc_B(G_A) And a first random number R_AAddition treatment of R_AIs composed of random values, and G_AObtaining the encrypted first gradient correlation value Enc by using vectors with the same dimension_B(G_A+R_A) (ii) a Will Enc_B(G_A+R_A) Sending to the B terminal, and the B terminal utilizes the fourth private key SK_BFor Enc_B(G_A+R_A) Decrypting and sending G_A+R_ATo terminal A, terminal A subtracts R_AObtaining G_AThen using G_AThe first model parameters are updated.

Similarly, for the second model of the B terminal, the first model F is obtained at the A terminal_A＝w_AX_Aafter-Y, will be paired with F_AEncrypting and converting the encrypted F_AAnd sending the data to the B terminal.

Optionally, the method further includes: a third data terminal generates a third key pair, wherein the third key pair comprises a third public key and a third private key; the encrypting the first model comprises: and carrying out encryption processing on the first model through the third public key. Specifically, the third key pair generated by the a-side includes the third public key PK_AAnd a fourth private key SK_AThe A terminal utilizes PK_ATo F_AEncrypting to obtain Enc_A(F_A)。

Then, the B terminal is according to Enc_A(F_A)、F_BAnd a second input variable X_BCalculating a second gradient value Enc for obtaining the encryption of the global loss function to the second model parameter_A(G_B)＝(Enc_A(F_A)+F_B)X_B(ii) a Then the B terminal will Enc_A(G_B) And a second random number R_BAddition treatment of R_BIs a random value, with G_BObtaining the encrypted second gradient correlation value Enc by using vectors with the same dimension_A(G_B+R_B) (ii) a Will Enc_A(G_B+R_B) Sending to the A end, the A end utilizes the third private key SK_AFor Enc_A(G_B+R_B) Decrypting and sending G_B+R_BTo B terminal, R is subtracted from B terminal_BObtaining G_BThen using G_BThe second model parameters are updated.

As an optional embodiment, the method further comprises: the third data end obtains a first local loss function through calculation according to the first model, obtains an encrypted third local loss function through calculation according to the first model and the encrypted second model, and receives an encrypted second local loss function obtained through calculation by the fourth data end according to the second model; obtaining an encrypted global loss function value according to the first local loss function, the encrypted second local loss function and the third local loss function, and sending the encrypted global loss function value to a fourth data terminal for decryption; and receiving the decrypted global loss function value and the regression model convergence identifier sent by the fourth data terminal.

Specifically, the global penalty function is shown in equation (12):

the global Loss function Loss is decomposed and divided into local Loss functions F related to the first model_sqrALocal loss function F associated with the second model_sqrBAnd a local loss function comprising the first model and the second model, as shown in equation (13):

thus, to obtain the global penalty function, terminal A is based on the first model F_ACalculating to obtain a first local loss function F_sqrAAccording to said F_AEncrypted second model Enc_B(F_B) Calculating to obtain an encrypted third local loss function, and receiving the B terminal according to a second model F_BCalculating the obtained encrypted second local loss function F_sqrB(ii) a Obtaining the value of an encrypted global Loss function Loss according to the first local Loss function, the encrypted second local Loss function and the third local Loss function, and sending the encrypted global Loss function value Loss to the terminal B for decryption; and receiving the decrypted global loss function value and the regression model convergence identifier sent by the terminal B, and determining the updated first model parameter and second model parameter as final model parameters when the regression model converges and the global loss function obtains the minimum value, wherein the final model parameters are used for calculating the second correlation coefficient of each independent variable.

The multivariate processing method provided by the embodiment of the invention iteratively executes the following steps until a regression model constructed by the first input variable, the second input variable and the target variable converges and a global loss function obtains a minimum value: receiving an encrypted second model sent by a fourth data end, wherein the second model is constructed and obtained according to a second model parameter initial value and a second input variable; constructing and obtaining a first model according to the initial value of the first model parameter, the first input variable and the target variable; calculating and obtaining an encrypted first gradient value of the global loss function to the first model parameter according to the encrypted second model, the encrypted first model and the first input variable; adding the encrypted first gradient value and a first random number to obtain an encrypted first gradient correlation value, and sending the encrypted first gradient correlation value to a fourth data terminal for decryption; receiving a decrypted first gradient correlation value sent by a fourth data end, and obtaining a first gradient value according to the first gradient correlation value; updating the first model parameter according to the first gradient value; further comprising: the third data end encrypts the first model and sends the encrypted first model to the fourth data end, the encrypted first model is used for the fourth data end to calculate and obtain an encrypted second gradient value of the global loss function to the second model parameter, and the encrypted second gradient value and a second random number are added to obtain an encrypted second gradient correlation value; receiving the encrypted second gradient correlation value sent by a fourth data end, decrypting the encrypted second gradient correlation value, and sending the decrypted second gradient correlation value to the fourth data end, wherein the decrypted second gradient correlation value is used for the fourth data end to obtain a second gradient value, and a second model parameter is updated according to the second gradient value; namely, in the embodiment of the present invention, the respective models are regressed by using a gradient descent method to obtain more stable model parameters.

For further understanding of the embodiments of the present invention, fig. 12 is a schematic flowchart of a multi-variable processing method according to a sixth embodiment of the present invention, which is combined with fig. 1 and 12, and the multi-variable processing method includes:

step 1, B generating a pair of public and private keys PK_B、SK_BPublic key PK_BSending to A;

step 2, A generates a pair of public and private keys PK_A、SK_A(PK_A≠PK_B,SK_A≠SK_B) Public key PK_ASending the data to B;

step 3, B, comparing the feature dimension f _ dim of own party_BSending the information to A;

step 4, A, determining the feature dimension f _ dim of the own party_ASending the information to the B;

step 5, A from the full data

Is prepared from

Step 6, B, calculating F_B＝w_BX_BUsing own public key PK_BEncryption F_BSending Enc_B(F_B) Feeding A;

step 7, A calculationF_A＝w_AX_AY, using own public key PK_AEncryption F_ASending Enc_A(F_A) Feeding B;

step 8, B, calculating Enc_A(G_B)＝(Enc_A(F_A)+F_B)X_BSending Enc_A(G_B+R_B) To A, wherein R_BVector of random values (and G)_BThe dimensions are the same);

step 9, A, calculating Enc_B(G_A)＝(F_A+Enc_B(F_B))X_ASending Enc_B(G_A+R_A) To B, wherein R_AVector of random values (and G)_AThe dimensions are the same);

step 10, B private key SK_BDecrypting Enc_B(G_A+R_A) Sending G_A+R_AFeeding A;

step 11, A private key SK_ADecrypting Enc_A(G_B+R_B) Sending G_B+R_BFeeding B;

step 12, B calculation

Sending Enc_B(F_sqrB) Feeding A;

step 13, A calculation

Sending Enc_B(L)+Enc_B(L_normA) Feeding B;

step 14.B decrypt Enc_B(L)+Enc_B(L_normA) Calculating L_total＝L+L_normA+L_normBAnd judging whether the model is fit at present, and sending L_totalAnd fit the label (true/false) to A;

wherein L is_normAIs a regular term of the A terminal, L_normBAs a regularization term at the B-terminalOverfitting the constrained model.

Step 15, carrying out gradient optimization on the two parties locally respectively, and updating the model weight;

step 16, iterating the step 5 to the step 15 until the model effect meets the requirement;

step 17.B calculates P_B＝W_BX_BSending P_BFeeding A;

step 18.a calculates Y ═ W_AX_A+P_B，

(if current Y is the feature of party B, then B calculates VIF_AThen needs to be sent to a);

step 19, repeating the steps 5 to 8 until all the characteristic variables are analyzed, namely, all the characteristics of the A side are sequentially analyzed, and then all the characteristics of the B side are sequentially analyzed, wherein (f _ dim) is needed to be performed in total_A+f_dim_B) And (5) secondary analysis.

Step 1 and step 2 are to ensure that the information transmitted by the two parties is not decrypted by the third party, and the a-side server may utilize the private key SK to send the message to the B-side server_AThe B-end server can utilize the received public key PK of the A-end server to perform encryption_ADecrypting the message; similarly, the B-side server can utilize SK to send messages to the A-side server_BThe encryption is carried out, and the A-side server can utilize the received public key PK of the B-side server_BThe message is decrypted.

To sum up, the server at the a end and the server at the B end respectively have different data, the two parties can respectively perform data-related calculation locally, only encrypted intermediate parameters are transmitted between the two parties, and the receiver cannot solve the original data reversely, so that the multiple co-linearity degree corresponding to each independent variable is analyzed on the basis of ensuring the data safety of the two parties, and an effective basis is provided for subsequent variable screening and the like.

The embodiment of the invention also provides a third data terminal. Fig. 13 is a schematic structural diagram of a third data end according to an embodiment of the present invention, as shown in fig. 13, the third data end includes a third processing module 30 and a third receiving module 31;

the third processing module 30 is configured to select any one characteristic variable in the first characteristic variable set as a target variable, use other characteristic variables in the first characteristic variable set as first input variables, and use the second characteristic variable set as second input variables; a third receiving module 31, configured to receive a second input variable calculation value sent by the fourth data end, where the second input variable calculation value is obtained through calculation according to a second input variable and a second model parameter; the third processing module 30 is further configured to obtain a predicted value of the target variable according to the calculated value of the second input variable, the first input variable, and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; and determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient.

As an alternative embodiment, the multi-variable processing system further comprises a third sending module 32; the third processing module 30, the third receiving module 31 and the third sending module 32 are further configured to: iteratively executing the following steps until the regression model constructed by the first input variable, the second input variable and the target variable converges and the global loss function takes the minimum value: the third receiving module 31 receives an encrypted second model sent by a fourth data end, wherein the second model is constructed and obtained according to a second model parameter initial value and a second input variable; the third processing module 30 builds a first model according to the initial value of the first model parameter, the first input variable and the target variable; calculating and obtaining an encrypted first gradient value of the global loss function to the first model parameter according to the encrypted second model, the encrypted first model and the first input variable; adding the encrypted first gradient value and the first random number to obtain an encrypted first gradient correlation value, and sending the encrypted first gradient correlation value to a fourth data end through the third sending module 32 for decryption; the third receiving module 31 receives the decrypted first gradient correlation value sent by the fourth data end, and obtains a first gradient value according to the first gradient correlation value through the third processing module 30; and updating the first model parameter according to the first gradient value.

As an optional implementation manner, the fourth data end includes a fourth key pair, where the fourth key pair includes a fourth public key and a fourth private key; wherein the encrypted second model is obtained by the fourth public key encryption; the decrypted first gradient correlation value is obtained by decrypting the fourth private key.

As an optional implementation manner, the third processing module 30 is further configured to encrypt the first model, and send the encrypted first model to the fourth data end, where the encrypted first model is used by the fourth data end to calculate and obtain an encrypted second gradient value of the global loss function on the second model parameter, and add the encrypted second gradient value and the second random number to obtain an encrypted second gradient correlation value; the third receiving module 31 is further configured to receive the encrypted second gradient correlation value sent by the fourth data end, decrypt the encrypted second gradient correlation value, and send the decrypted second gradient correlation value to the fourth data end, where the decrypted second gradient correlation value is used by the fourth data end to obtain a second gradient value, and update the second model parameter according to the second gradient value.

As an optional implementation manner, the third processing module 30 is further configured to: generating a third key pair, the third key pair comprising a third public key and a third private key; encrypting the first model through the third public key; and decrypting the encrypted second gradient correlation value by the third private key.

As an optional implementation manner, the third processing module 30 is further configured to obtain a first local loss function according to a first model calculation, obtain an encrypted third local loss function according to the first model and an encrypted second model calculation, and receive, by the third receiving module 31, an encrypted second local loss function obtained by a fourth data terminal according to the second model calculation; the third processing module 30 obtains an encrypted global loss function value according to the first local loss function, the encrypted second local loss function, and the third local loss function, and sends the encrypted global loss function value to the fourth data end through the third sending module 32 for decryption; the third receiving module 31 receives the decrypted global loss function value and the regression model convergence identifier sent by the fourth data terminal.

As an optional implementation manner, the third processing module 30 is configured to iteratively execute the step of selecting any one of the feature variables in the first feature variable set as a target variable until a second correlation coefficient corresponding to each feature variable as a target variable is output.

As an alternative embodiment, the third processing module 30 is configured to select a feature variable of which the second correlation coefficient satisfies a second preset condition, and constitute a candidate data set.

The implementation principle and technical effect of the third data terminal provided in this embodiment are similar to those of the above embodiments, and are not described herein again.

The embodiment of the invention also provides a fourth data terminal. Fig. 14 is a schematic structural diagram of a fourth data end according to an embodiment of the present invention, and as shown in fig. 14, the fourth data end includes a fourth processing module 40 and a fourth sending module 41; the fourth processing module 40 is configured to obtain a second input variable calculation value according to a second input variable and a second model parameter; the fourth sending module 41 is configured to send the second input variable calculated value to the third data end, where the second input variable calculated value is used to determine a second correlation coefficient corresponding to the target variable.

As an optional embodiment, the fourth data end further includes a fourth receiving module 42; the fourth processing module 40 is further configured to iteratively execute the following steps until the regression model constructed by the first input variable, the second input variable, and the target variable converges and the global loss function takes a minimum value: a second model is constructed according to a second model parameter initial value and a second input variable, the second model is encrypted, the encrypted second model is sent to a third data end through a fourth sending module 41, the encrypted second model is used for the third data end to obtain an encrypted first gradient value of the global loss function on the first model parameter, and the encrypted first gradient value and a first random number are added to obtain an encrypted first gradient correlation value; the fourth receiving module 42 is configured to receive the encrypted first gradient correlation value sent by the third data end, decrypt the encrypted first gradient correlation value through the fourth processing module 40, and send the decrypted first gradient correlation value to the third data end through the fourth invention module 41, where the decrypted first gradient correlation value is used by the third data end to obtain a first gradient value, and update the first model parameter according to the first gradient value.

As an optional embodiment, the fourth receiving module 41 is further configured to receive an encrypted first model sent by a third data end, where the first model is constructed and obtained according to a first model parameter initial value, a first input variable, and a target variable; the fourth processing module 40 is further configured to calculate, according to the encrypted first model, the encrypted second model, and the second input variable, an encrypted second gradient value of the global loss function to the second model parameter; adding the encrypted second gradient value and a second random number to obtain an encrypted second gradient correlation value, and sending the encrypted second gradient correlation value to a third data terminal for decryption; the fourth receiving module 42 is configured to receive the decrypted second gradient correlation value sent by the third data end, and obtain a second gradient value according to the second gradient correlation value through the fourth processing module 40; and updating the second model parameter according to the second gradient value.

The implementation principle and technical effect of the fourth data terminal provided in this embodiment are similar to those of the above embodiments, and are not described herein again.

The embodiment of the invention also provides a multi-variable processing system. Referring to FIG. 1, the multivariable system comprises a third data terminal and a fourth data terminal; wherein the third data terminal and the fourth data terminal cooperate to perform the method of any one of the embodiments.

The multi-variable processing system provided in this embodiment has similar implementation principles and technical effects as those of the above embodiments, and will not be described herein again.

The invention further provides a variable screening method, which is applied to a variable screening system, wherein the variable screening system comprises a fifth data end and a sixth data end, the fifth data end stores a dependent variable and a first independent variable set, and the sixth data end stores a second independent variable set, which can be shown in fig. 1 (an a-end server in fig. 1 is equivalent to the fifth data end, and a B-end server in fig. 1 is equivalent to the sixth data end).

The variable screening method comprises the following steps of S501-S517:

step S501, the fifth data end obtains a first correlation coefficient of each independent variable and dependent variable in the first independent variable set.

Specifically, the a-side stores a dependent variable and a partial independent variable (i.e., a first independent variable set) corresponding to the sample data, and the B-side stores another partial independent variable (i.e., a second independent variable set) corresponding to the sample data. First, the a-terminal obtains the linear correlation degree (i.e. the first correlation coefficient, the variance expansion coefficient R in the above embodiment) between the independent variable and the dependent variable of the a-terminal²)。

Step S502, the fifth data end obtains the difference value of the dependent variable and the mean value of the dependent variable, and sends the difference value to the sixth data end.

Step S503, the sixth data end receives the difference, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the first independent variable set according to the difference.

Step S504, the sixth data end obtains a third parameter according to the regression coefficient and the independent variable mean value, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the first data end.

Step S505, the fifth data terminal receives the third parameter and the encrypted fourth parameter, obtains a constant item of the unary linear regression model according to the third parameter and the mean value of the dependent variable, obtains an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains a dispersion square sum according to a difference value between the mean value of the dependent variable and the mean value of the dependent variable; and obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a sixth data terminal.

Step S506, the sixth data end receives the encrypted first correlation coefficient, decrypts the encrypted first correlation coefficient, and sends the decrypted first correlation coefficient to the fifth data end.

Step S507, the fifth data end receives the decrypted first correlation coefficient, and outputs the first correlation coefficient; and iteratively executing the step of obtaining the difference value between the dependent variable and the dependent variable mean value until a first correlation coefficient corresponding to each independent variable in the second independent variable is output.

Step S508, selecting an independent variable in the first independent variable set whose first correlation coefficient satisfies a first preset condition to form a first candidate data set, and selecting an independent variable in the second independent variable set whose first correlation coefficient satisfies the first preset condition to form a second candidate data set.

Specifically, the implementation manners of steps S502 to S507 are similar to the implementation manner of the univariate processing method described in the first aspect, and are used for obtaining the variance expansion coefficient R of each independent variable at the B end and the dependent variable at the a end²Then selecting a coefficient of variance R from the first set of arguments²Arguments greater than a certain threshold (e.g., 0.3) constitute a first candidate data set, and the expansion of variance factor R is selected from a second set of arguments²Arguments greater than a certain threshold (e.g., 0.3) constitute a second candidate data set.

Step S509, selecting any independent variable in the first candidate data set as a target variable, taking other independent variables in the first candidate data set as first input variables, and taking the second candidate data set as second input variables.

And step S510, the sixth data end obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the fifth data end.

Step S511, the fifth data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; and determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient.

Step S512, iteratively executing the step of selecting any one of the independent variables in the first candidate data set as the target variable until a second correlation coefficient corresponding to each independent variable in the first candidate data set as the target variable is output.

Step S513, selecting any independent variable in the second candidate data set as a target variable, taking other independent variables in the second candidate data set as third input variables, and taking the first candidate data set as a fourth input variable.

And step S514, the fifth data end obtains a fourth input variable calculation value according to a fourth input variable and a fourth model parameter, and sends the fourth input variable calculation value to the sixth data end.

Step S515, the sixth data end obtains a target variable predicted value according to the fourth input variable calculation value, the third input variable and the third model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; and determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient.

Step S516, iteratively executing the step of selecting any independent variable in the second candidate data set as the target variable until a second correlation coefficient corresponding to each independent variable in the second candidate data set as the target variable is output.

Step S517, selecting an argument whose second correlation coefficient in the first candidate data set satisfies a second preset condition to form a third candidate data set, selecting an argument whose second correlation coefficient in the second candidate data set satisfies the second preset condition to form a fourth candidate data set, and forming a final candidate data set by using the third candidate data set and the fourth candidate data set.

Specifically, after obtaining the first candidate data set and the second candidate data set, the multivariate processing method as described in the above embodiment is then performed to obtain the degree of multicollinearity of each independent variable with other independent variables in the two candidate data sets, i.e., the coefficient of variance expansion VIF, and to reject the independent variable with a larger VIF value, thereby forming the final candidate data set.

To sum up, the variable screening method provided by the embodiment of the present invention firstly analyzes the variance expansion coefficient R of each independent variable and each dependent variable²Obtaining R²And (3) selecting a characteristic combination with a larger VIF value by analyzing the multiple collinearity degree VIF of each independent variable, thereby providing an effective basis for subsequent characteristic screening.

The embodiment of the invention also provides another variable screening method, which comprises the following steps:

step S601, selecting any one of the independent variables in the first independent variable set as a target variable, taking other independent variables in the first independent variable set as first input variables, and taking the second independent variable set as second input variables.

Step S602, the sixth data end obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the fifth data end.

Step S603, the fifth data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; and determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient.

Step S604, iteratively executing the step of selecting any one of the independent variables in the first independent variable set as a target variable until a second correlation coefficient corresponding to each independent variable in the first independent variable set as a target variable is output.

And step S605, selecting any independent variable in the second independent variable set as a target variable, using other independent variables in the second independent variable set as third input variables, and using the second independent variable set as fourth input variables.

Step S606, the fifth data end obtains a fourth input variable calculation value according to a fourth input variable and a fourth model parameter, and sends the fourth input variable calculation value to the sixth data end.

Step S607, the sixth data end obtains a target variable predicted value according to the fourth input variable calculation value, the third input variable and the third model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; and determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient.

Step S608, iteratively executing the step of selecting any one of the independent variables in the second independent variable set as the target variable until a second correlation coefficient corresponding to each independent variable in the second independent variable set as the target variable is output.

Step S609, selecting an argument in which the second correlation coefficient in the first argument set satisfies a second preset condition to form a fifth candidate data set, and selecting an argument in which the second correlation coefficient in the second argument set satisfies the second preset condition to form a sixth candidate data set.

Step S610, the fifth data end obtains a first correlation coefficient between each independent variable and dependent variable in the fifth candidate data set.

Step S611, the fifth data end obtains a difference between the dependent variable and the mean of the dependent variable, and sends the difference to the sixth data end.

Step S612, the sixth data end receives the difference, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the sixth candidate data set according to the difference.

Step S613, the sixth data end obtains a third parameter according to the regression coefficient and the mean of the independent variables, obtains a fourth parameter according to the regression coefficient and the independent variables, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the fifth data end.

Step S614, a fifth data end receives the third parameter and the encrypted fourth parameter, obtains a constant item of the unary linear regression model according to the third parameter and the mean value of the dependent variable, obtains an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains a dispersion square sum according to the difference value of the mean value of the dependent variable and the mean value of the dependent variable; and obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a sixth data terminal.

Step S615, the sixth data end receives the encrypted first correlation coefficient, decrypts the encrypted first correlation coefficient, and sends the decrypted first correlation coefficient to the fifth data end.

Step S616, the fifth data end receives the decrypted first correlation coefficient and outputs the first correlation coefficient;

step S617, iteratively executing the step of obtaining the difference between the dependent variable and the dependent variable mean until the first correlation coefficient corresponding to each independent variable in the sixth candidate data set is output.

Step S618, selecting an argument whose first correlation coefficient in the fifth candidate data set satisfies a first preset condition to form a seventh candidate data set, selecting an argument whose first correlation coefficient in the sixth candidate data set satisfies the first preset condition to form an eighth candidate data set, and forming a final candidate data set by the seventh candidate data set and the eighth candidate data set

In summary, the changes provided by the embodiments of the present inventionFirstly, selecting a characteristic combination with a larger VIF value by analyzing the multiple collinearity degree VIF of each independent variable; then, the variance expansion coefficient R of each independent variable and dependent variable in the feature combination with larger VIF value is analyzed²Obtaining R²And the larger independent variable provides an effective basis for subsequent feature screening.

As shown in fig. 1, the variable screening system includes a fifth data terminal and a sixth data terminal; and the fifth data end and the sixth data end are used for realizing the variable screening method.

As shown in fig. 15, an embodiment of the present invention provides an electronic device, which includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete mutual communication via the communication bus 114,

a memory 113 for storing a computer program;

in an embodiment of the present invention, the processor 111, when executing the program stored in the memory 113, is configured to implement the steps of the single variable processing method or the multi-variable processing method provided in any one of the foregoing method embodiments.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the univariate processing method or the multivariate processing method provided in any of the foregoing method embodiments.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The univariate processing method is characterized by being applied to a first data terminal in a univariate processing system, wherein the univariate processing system comprises the first data terminal and a second data terminal, the first data terminal stores a dependent variable, and the second data terminal stores an independent variable; the method comprises the following steps:

obtaining a difference value between the dependent variable and the mean value of the dependent variable, and sending the difference value to a second data terminal, wherein the difference value is used for calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable;

receiving a third parameter and an encrypted fourth parameter sent by a second data end, wherein the third parameter is obtained by calculation according to the regression coefficient and the independent variable mean value, and the fourth parameter is obtained by calculation according to the regression coefficient and the independent variable;

obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value;

obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean value;

obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data terminal for decryption;

and receiving the decrypted first correlation coefficient sent by the second data terminal, and outputting the first correlation coefficient.

2. The method according to claim 1, wherein the obtaining a difference value between the dependent variable and a mean value of the dependent variable, and sending the difference value to a second data terminal, wherein the difference value is used for calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable, comprises:

receiving an encrypted first parameter sent by a second data end, wherein the first parameter is obtained by calculation according to an independent variable and an independent variable mean value;

and obtaining an encrypted second parameter according to the difference value of the encrypted first parameter, the dependent variable and the mean value of the dependent variable, and sending the encrypted second parameter to a second data terminal for decryption, wherein the decrypted second parameter is used for calculating a regression coefficient of a unitary linear regression model constructed by the independent variable and the dependent variable.

3. The method of claim 2, wherein the second data peer comprises a second key pair, the second key pair comprising a second public key and a second private key;

the encrypted fourth parameter and the encrypted first parameter are obtained by encrypting the second public key;

and the decrypted first correlation coefficient and the decrypted second parameter are obtained by decrypting through the second private key.

4. The method according to any one of claims 1 to 3, wherein the arguments stored at the second data terminal are subjected to binning; before obtaining the difference value between the dependent variable and the dependent variable mean, the method further includes:

sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to a second data terminal;

receiving an encrypted sample tag statistic value sent by a second data end, wherein the encrypted sample tag statistic value is obtained by the second data end through statistics on the encrypted sample tag value of each box according to a sample identifier;

and decrypting the encrypted sample label statistic to obtain the dependent variable.

5. The method of claim 4, wherein before sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to the second data terminal, the method further comprises:

generating a first key pair, the first key pair comprising a first public key and a first private key;

encrypting a sample tag value through the first public key to obtain the encrypted sample tag value;

the decrypting the encrypted sample tag statistic includes:

and decrypting the encrypted sample label statistic value through the first private key.

6. The method according to any one of claims 1 to 3, wherein if the second data terminal includes a plurality of independent variables, the step of obtaining the difference between the dependent variable and the mean of the dependent variable is performed iteratively until a first correlation coefficient corresponding to each independent variable is output.

7. The method of claim 6, wherein after outputting the first correlation coefficient corresponding to each argument, further comprising:

and selecting the independent variable of which the first correlation coefficient meets a first preset condition to form a candidate data set.

8. The univariate processing method is characterized by being applied to a second data terminal in a univariate processing system, wherein the univariate processing system comprises a first data terminal and the second data terminal, the first data terminal stores a dependent variable, and the second data terminal stores an independent variable; the method comprises the following steps:

receiving a difference value of a dependent variable and a dependent variable mean value sent by a first data end, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the difference value;

obtaining a third parameter according to the regression coefficient and the independent variable mean value, obtaining a fourth parameter according to the regression coefficient and the independent variable, and encrypting the fourth parameter;

the third parameter and the encrypted fourth parameter are sent to a first data end, and the third parameter and the encrypted fourth parameter are used for calculating and obtaining an encrypted first correlation coefficient corresponding to an argument;

and receiving the encrypted first correlation coefficient sent by the first data end, decrypting the encrypted first correlation coefficient, sending the decrypted first correlation coefficient to the first data end, and outputting the first correlation coefficient.

9. The method according to claim 8, wherein the receiving a difference between a dependent variable sent by the first data terminal and a mean value of the dependent variable, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the difference comprises:

calculating according to the independent variable and the independent variable mean value to obtain a first parameter, encrypting the first parameter, and sending the encrypted first parameter to a first data terminal;

receiving an encrypted second parameter sent by a first data end, wherein the encrypted second parameter is obtained by calculation according to the encrypted first parameter, a difference value of a dependent variable and a mean value of the dependent variable;

and decrypting the encrypted second parameter, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the decrypted second parameter.

10. The method according to claim 8 or 9, wherein before receiving the difference between the dependent variable and the mean of the dependent variable sent by the first data terminal, the method further comprises:

performing box separation processing on the independent variable;

receiving a sample identifier sent by a first data end and an encrypted sample label value corresponding to the sample identifier;

and counting the encrypted sample label value of each box according to the sample identification to obtain an encrypted sample label statistical value, and sending the encrypted sample label statistical value to a first data end for decryption to obtain the dependent variable.

11. A first data end is characterized by comprising a first processing module, a first sending module and a first receiving module:

the first processing module is used for obtaining a difference value between a dependent variable and a dependent variable mean value, and sending the difference value to a second data end through the first sending module, wherein the difference value is used for calculating a regression coefficient of a unary linear regression model constructed by an independent variable and the dependent variable;

the first receiving module is configured to receive a third parameter and an encrypted fourth parameter sent by a second data end, where the third parameter is obtained through calculation according to the regression coefficient and an independent variable mean value, and the fourth parameter is obtained through calculation according to the regression coefficient and an independent variable;

the first processing module is further used for obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean; obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data end through the first sending module for decryption;

the first receiving module is further configured to receive the decrypted correlation coefficient sent by the second data end, and output the correlation coefficient.

12. The second data terminal is characterized by comprising a second processing module, a second sending module and a second receiving module;

the second receiving module is used for receiving a difference value between the dependent variable and the mean value of the dependent variable sent by the first data terminal;

the second processing module is used for calculating a regression coefficient of a unary linear regression model constructed by independent variables and dependent variables according to the difference value; obtaining a third parameter according to the regression coefficient and the independent variable mean value, obtaining a fourth parameter according to the regression coefficient and the independent variable, and encrypting the fourth parameter;

the second sending module is configured to send the third parameter and the encrypted fourth parameter to the first data end, where the third parameter and the encrypted fourth parameter are used to calculate and obtain an encrypted correlation coefficient corresponding to the argument;

the second receiving module is further configured to receive the encrypted first correlation coefficient sent by the first data end, decrypt the encrypted first correlation coefficient through the second processing module, and send the decrypted first correlation coefficient to the first data end through the second sending module.

13. A univariate processing system is characterized by comprising a first data terminal and a second data terminal;

wherein the first data terminal is configured to perform the method according to any one of claims 1 to 7, and the second data terminal is configured to perform the method according to any one of claims 8 to 10.

14. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any one of claims 1 to 10 when executing a program stored on a memory.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-10.

16. The variable screening method is characterized by being applied to a variable screening system, wherein the variable screening system comprises a fifth data end and a sixth data end, the fifth data end stores a dependent variable and a first independent variable set, and the sixth data end stores a second independent variable set; the method comprises the following steps:

a fifth data terminal acquires a first correlation coefficient of each independent variable and dependent variable in the first independent variable set;

the fifth data end obtains the difference value of the dependent variable and the mean value of the dependent variable and sends the difference value to the sixth data end;

a sixth data end receives the difference value, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the first independent variable set according to the difference value;

the sixth data end obtains a third parameter according to the regression coefficient and the mean value of the independent variable, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the first data end;

a fifth data terminal receives the third parameter and the encrypted fourth parameter, obtains a constant item of the unary linear regression model according to the third parameter and the dependent variable mean value, obtains an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains a dispersion square sum according to a difference value between the dependent variable and the dependent variable mean value; obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a sixth data end;

the sixth data end receives the encrypted first correlation coefficient, decrypts the encrypted first correlation coefficient, and sends the decrypted first correlation coefficient to the fifth data end;

a fifth data end receives the decrypted first correlation coefficient and outputs the first correlation coefficient; iteratively executing the step of obtaining the difference value between the dependent variable and the dependent variable mean value until a first correlation coefficient corresponding to each independent variable in the second independent variable is output;

selecting independent variables of which the first correlation coefficients in the first independent variable set meet first preset conditions to form a first candidate data set, and selecting independent variables of which the first correlation coefficients in the second independent variable set meet the first preset conditions to form a second candidate data set;

selecting any independent variable in the first candidate data set as a target variable, taking other independent variables in the first candidate data set as first input variables, and taking the second candidate data set as second input variables;

the sixth data end obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the fifth data end;

the fifth data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient;

iteratively executing the step of selecting any independent variable in the first candidate data set as a target variable until a corresponding second correlation coefficient when each independent variable in the first candidate data set is output as the target variable;

selecting any independent variable in the second candidate data set as a target variable, taking other independent variables in the second candidate data set as third input variables, and taking the first candidate data set as fourth input variables;

the fifth data end obtains a fourth input variable calculation value according to a fourth input variable and a fourth model parameter, and sends the fourth input variable calculation value to the sixth data end;

the sixth data end obtains a target variable predicted value according to the fourth input variable calculation value, the third input variable and the third model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient;

iteratively executing the step of selecting any independent variable in the second candidate data set as a target variable until a corresponding second correlation coefficient when each independent variable in the second candidate data set is output as the target variable;

and selecting an independent variable of which the second correlation coefficient in the first candidate data set meets a second preset condition to form a third candidate data set, selecting an independent variable of which the second correlation coefficient in the second candidate data set meets the second preset condition to form a fourth candidate data set, and forming a final candidate data set by the third candidate data set and the fourth candidate data set.

17. The variable screening method is characterized by being applied to a variable screening system, wherein the variable screening system comprises a fifth data end and a sixth data end, the fifth data end stores a dependent variable and a first independent variable set, and the sixth data end stores a second independent variable set; the method comprises the following steps:

selecting any independent variable in the first independent variable set as a target variable, taking other independent variables in the first independent variable set as first input variables, and taking the second independent variable set as second input variables;

iteratively executing the step of selecting any independent variable in the first independent variable set as a target variable until a second correlation coefficient corresponding to each independent variable in the first independent variable set as the target variable is output;

selecting any independent variable in the second independent variable set as a target variable, taking other independent variables in the second independent variable set as third input variables, and taking the second independent variable set as fourth input variables;

iteratively executing the step of selecting any independent variable in the second independent variable set as a target variable until a corresponding second correlation coefficient when each independent variable in the second independent variable set is output as the target variable;

selecting independent variables of which the second correlation coefficients in the first independent variable set meet second preset conditions to form a fifth candidate data set, and selecting independent variables of which the second correlation coefficients in the second independent variable set meet the second preset conditions to form a sixth candidate data set;

a fifth data terminal acquires a first correlation coefficient of each independent variable and dependent variable in the fifth candidate data set;

a sixth data end receives the difference value, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the sixth candidate data set according to the difference value;

the sixth data end obtains a third parameter according to the regression coefficient and the mean value of the independent variable, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the fifth data end;

a fifth data end receives the decrypted first correlation coefficient and outputs the first correlation coefficient;

iteratively executing the step of obtaining the difference value between the dependent variable and the dependent variable mean value until a first correlation coefficient corresponding to each independent variable in a sixth candidate data set is output;

and selecting an independent variable of which the first correlation coefficient in the fifth candidate data set meets a first preset condition to form a seventh candidate data set, selecting an independent variable of which the first correlation coefficient in the sixth candidate data set meets the first preset condition to form an eighth candidate data set, and forming a final candidate data set by the seventh candidate data set and the eighth candidate data set.

18.A variable screening system is characterized by comprising a fifth data end and a sixth data end; wherein the fifth data terminal and the sixth data terminal are adapted to perform the method according to claim 16 or 17.