CN114692089A - Single variable processing method and variable screening method - Google Patents

Single variable processing method and variable screening method Download PDF

Info

Publication number
CN114692089A
CN114692089A CN202210418824.5A CN202210418824A CN114692089A CN 114692089 A CN114692089 A CN 114692089A CN 202210418824 A CN202210418824 A CN 202210418824A CN 114692089 A CN114692089 A CN 114692089A
Authority
CN
China
Prior art keywords
variable
encrypted
parameter
value
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210418824.5A
Other languages
Chinese (zh)
Inventor
陈行
张德
彭南博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Holding Co Ltd
Original Assignee
Jingdong Technology Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Holding Co Ltd filed Critical Jingdong Technology Holding Co Ltd
Priority to CN202210418824.5A priority Critical patent/CN114692089A/en
Publication of CN114692089A publication Critical patent/CN114692089A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • H04L63/0442Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload wherein the sending and receiving network entities apply asymmetric encryption, i.e. different keys for encryption and decryption

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Algebra (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Complex Calculations (AREA)

Abstract

The embodiment of the invention relates to a univariate processing method and a variable screening method.A first data end obtains a difference value of a dependent variable and a mean value of the dependent variable and sends the difference value to a second data end; receiving a third parameter and an encrypted fourth parameter sent by a second data end; obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean; obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data terminal for decryption; receiving the decrypted first correlation coefficient sent by the second data end, and outputting the first correlation coefficient; namely, the embodiment of the invention can effectively analyze the linear correlation degree of the independent variable and the dependent variable under the federal scene.

Description

Single variable processing method and variable screening method
Technical Field
The invention relates to the technical field of computers, in particular to a univariate processing method and a variable screening method.
Background
The federated learning framework is a distributed artificial intelligence model training framework, can help different data owners to realize federated modeling and federated training under the condition that private data does not need to be shared, and can effectively solve the problems of data security and data islanding.
The feature engineering is the most important loop in machine learning modeling, and refers to a process of processing original data into model training data, which generally comprises three steps of feature preprocessing, feature selection and feature dimension reduction. When the characteristics are selected, a characteristic univariate analysis method is adopted to analyze the distribution condition of each characteristic and the prediction capability of the label. And univariate analysis under the federal scene comprises Evidence Weight (WOE) and Information Value (IV).
However, the WOE, IV, etc. indicators cannot represent the linear degree of correlation between independent and dependent variables in the federal scenario.
Disclosure of Invention
The invention provides a univariate processing method and a variable screening method, which aim to solve the problem that the description of the linear correlation degree of independent variables and dependent variables under a federal scene is lacked in the prior art.
In a first aspect, the present invention provides a univariate processing method, which is applied to a first data end in a univariate processing system, where the univariate processing system includes the first data end and a second data end, where the first data end stores a dependent variable, and the second data end stores an independent variable; the method comprises the following steps: obtaining a difference value between the dependent variable and the mean value of the dependent variable, and sending the difference value to a second data terminal, wherein the difference value is used for calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable; receiving a third parameter and an encrypted fourth parameter sent by a second data end, wherein the third parameter is obtained by calculation according to the regression coefficient and the independent variable mean value, and the fourth parameter is obtained by calculation according to the regression coefficient and the independent variable; obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean value; obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data terminal for decryption; and receiving the decrypted first correlation coefficient sent by the second data terminal, and outputting the first correlation coefficient.
As an optional embodiment, the obtaining a difference between the dependent variable and a mean of the dependent variable, and sending the difference to a second data end, where the difference is used to calculate a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable, includes: receiving an encrypted first parameter sent by a second data end, wherein the first parameter is obtained by calculation according to an independent variable and an independent variable mean value; and obtaining an encrypted second parameter according to the difference value of the encrypted first parameter, the dependent variable and the mean value of the dependent variable, and sending the encrypted second parameter to a second data terminal for decryption, wherein the decrypted second parameter is used for calculating a regression coefficient of a unitary linear regression model constructed by the independent variable and the dependent variable.
As an alternative embodiment, the second data terminal includes a second key pair, and the second key pair includes a second public key and a second private key; the encrypted fourth parameter and the encrypted first parameter are obtained by encrypting the second public key; and the decrypted first correlation coefficient and the decrypted second parameter are obtained by decrypting through the second private key.
As an alternative embodiment, the arguments stored in the second data terminal are subjected to binning processing; before the obtaining of the difference value between the dependent variable and the dependent variable mean value, the method further includes: sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to a second data terminal; receiving an encrypted sample tag statistic value sent by a second data end, wherein the encrypted sample tag statistic value is obtained by the second data end through statistics on the encrypted sample tag value of each box according to a sample identifier; and decrypting the encrypted sample label statistic to obtain the dependent variable.
As an optional embodiment, before sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to the second data terminal, the method further includes: generating a first key pair, the first key pair comprising a first public key and a first private key; encrypting a sample tag value through the first public key to obtain the encrypted sample tag value; the decrypting the encrypted sample tag statistic includes: and decrypting the encrypted sample label statistic value through the first private key.
As an optional embodiment, if the second data end includes a plurality of independent variables, the step of obtaining the difference between the dependent variable and the dependent variable mean is performed iteratively until a first correlation coefficient corresponding to each independent variable is output.
As an optional embodiment, after outputting the first correlation coefficient corresponding to each argument, the method further includes: and selecting the independent variable of which the first correlation coefficient meets a first preset condition to form a candidate data set.
In a second aspect, the present invention provides another univariate processing method, which is applied to a second data end in a univariate processing system, where the univariate processing system includes a first data end and the second data end, where the first data end stores a dependent variable, and the second data end stores an independent variable; the method comprises the following steps: receiving a difference value between a dependent variable and a dependent variable mean value sent by a first data end, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the difference value; obtaining a third parameter according to the regression coefficient and the independent variable mean value, obtaining a fourth parameter according to the regression coefficient and the independent variable, and encrypting the fourth parameter; sending the third parameter and the encrypted fourth parameter to a first data end, wherein the third parameter and the encrypted fourth parameter are used for calculating and obtaining an encrypted first correlation coefficient corresponding to the argument; and receiving the encrypted first correlation coefficient sent by the first data end, decrypting the encrypted first correlation coefficient, sending the decrypted first correlation coefficient to the first data end, and outputting the first correlation coefficient.
As an optional embodiment, the receiving a difference between a dependent variable sent by the first data end and a mean value of the dependent variable, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the difference includes: calculating according to the independent variable and the independent variable mean value to obtain a first parameter, encrypting the first parameter, and sending the encrypted first parameter to a first data terminal; receiving an encrypted second parameter sent by a first data end, wherein the encrypted second parameter is obtained by calculation according to the encrypted first parameter, a difference value of a dependent variable and a mean value of the dependent variable; and decrypting the encrypted second parameter, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the decrypted second parameter.
As an optional embodiment, before receiving the difference between the dependent variable sent by the first data end and the mean value of the dependent variable, the method further includes: performing box separation processing on the independent variable; receiving a sample identifier sent by a first data end and an encrypted sample tag value corresponding to the sample identifier; and counting the encrypted sample label value of each box according to the sample identification to obtain an encrypted sample label statistical value, and sending the encrypted sample label statistical value to a first data end for decryption to obtain the dependent variable.
In a third aspect, the present invention provides a first data end, including a first processing module, a first sending module, and a first sending module: the first processing module is used for obtaining a difference value between a dependent variable and a dependent variable mean value, and sending the difference value to a second data end through the first sending module, wherein the difference value is used for calculating a regression coefficient of a unary linear regression model constructed by an independent variable and the dependent variable; the first receiving module is configured to receive a third parameter and an encrypted fourth parameter sent by a second data end, where the third parameter is obtained through calculation according to the regression coefficient and an independent variable mean value, and the fourth parameter is obtained through calculation according to the regression coefficient and an independent variable; the first processing module is further used for obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean; obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data end through the first sending module for decryption; the first receiving module is further configured to receive the decrypted correlation coefficient sent by the second data end, and output the correlation coefficient.
In a fourth aspect, the present invention provides a second data terminal, including a second processing module, a second sending module, and a second receiving module; the second receiving module is used for receiving a difference value between the dependent variable and the mean value of the dependent variable sent by the first data terminal; the second processing module is used for calculating a regression coefficient of a unary linear regression type constructed by the independent variable and the dependent variable according to the difference value; obtaining a third parameter according to the regression coefficient and the independent variable mean value, obtaining a fourth parameter according to the regression coefficient and the independent variable, and encrypting the fourth parameter; the second sending module is configured to send the third parameter and the encrypted fourth parameter to the first data end, where the third parameter and the encrypted fourth parameter are used to calculate and obtain an encrypted correlation coefficient corresponding to the argument; the second receiving module is further configured to receive the encrypted first correlation coefficient sent by the first data end, decrypt the encrypted first correlation coefficient through the second processing module, and send the decrypted first correlation coefficient to the first data end through the second sending module.
In a fifth aspect, the present invention provides a univariate processing system, which includes a first data terminal and a second data terminal; wherein the first data terminal is configured to perform the method according to any of the first aspects, and the second data terminal is configured to perform the method according to any of the second aspects.
In a sixth aspect, the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; a memory for storing a computer program; a processor configured to implement the method of any one of the first aspect when executing a program stored in a memory.
In a seventh aspect, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method of any of the first aspect.
In an eighth aspect, the present invention provides a variable screening method, which is applied to a variable screening system, where the variable screening system includes a fifth data end and a sixth data end, where the fifth data end stores a dependent variable and a first independent variable set, and the sixth data end stores a second independent variable set; the method comprises the following steps: a fifth data terminal acquires a first correlation coefficient of each independent variable and dependent variable in the first independent variable set; the fifth data end obtains the difference value of the dependent variable and the mean value of the dependent variable and sends the difference value to the sixth data end; a sixth data end receives the difference value, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the first independent variable set according to the difference value; the sixth data end obtains a third parameter according to the regression coefficient and the mean value of the independent variable, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the first data end; a fifth data terminal receives the third parameter and the encrypted fourth parameter, obtains a constant item of the unary linear regression model according to the third parameter and the dependent variable mean value, obtains an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains a dispersion square sum according to a difference value between the dependent variable and the dependent variable mean value; obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a sixth data end; the sixth data end receives the encrypted first correlation coefficient, decrypts the encrypted first correlation coefficient, and sends the decrypted first correlation coefficient to the fifth data end; a fifth data end receives the decrypted first correlation coefficient and outputs the first correlation coefficient; iteratively executing the step of obtaining the difference value between the dependent variable and the dependent variable mean value until a first correlation coefficient corresponding to each independent variable in the second independent variable is output; selecting independent variables of which the first correlation coefficients in the first independent variable set meet first preset conditions to form a first candidate data set, and selecting independent variables of which the first correlation coefficients in the second independent variable set meet the first preset conditions to form a second candidate data set; selecting any independent variable in the first candidate data set as a target variable, taking other independent variables in the first candidate data set as first input variables, and taking the second candidate data set as second input variables; the sixth data end obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the fifth data end; the fifth data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient; iteratively executing the step of selecting any independent variable in the first candidate data set as a target variable until a corresponding second correlation coefficient when each independent variable in the first candidate data set is output as the target variable; selecting any independent variable in the second candidate data set as a target variable, taking other independent variables in the second candidate data set as third input variables, and taking the first candidate data set as fourth input variables; the fifth data end obtains a fourth input variable calculation value according to a fourth input variable and a fourth model parameter, and sends the fourth input variable calculation value to the sixth data end; the sixth data end obtains a target variable predicted value according to the fourth input variable calculation value, the third input variable and the third model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient; iteratively executing the step of selecting any independent variable in the second candidate data set as a target variable until a corresponding second correlation coefficient when each independent variable in the second candidate data set is output as the target variable; and selecting an independent variable of which the second correlation coefficient in the first candidate data set meets a second preset condition to form a third candidate data set, selecting an independent variable of which the second correlation coefficient in the second candidate data set meets the second preset condition to form a fourth candidate data set, and forming a final candidate data set by the third candidate data set and the fourth candidate data set.
In a ninth aspect, the present invention provides another variable screening method, which is applied to a variable screening system, where the variable screening system includes a fifth data end and a sixth data end, where the fifth data end stores a dependent variable and a first independent variable set, and the sixth data end stores a second independent variable set; the method comprises the following steps: selecting any independent variable in the first independent variable set as a target variable, taking other independent variables in the first independent variable set as first input variables, and taking the second independent variable set as second input variables; the sixth data end obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the fifth data end; the fifth data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient; iteratively executing the step of selecting any independent variable in the first independent variable set as a target variable until a second correlation coefficient corresponding to each independent variable in the first independent variable set as the target variable is output; selecting any independent variable in the second independent variable set as a target variable, taking other independent variables in the second independent variable set as third input variables, and taking the second independent variable set as fourth input variables; the fifth data end obtains a fourth input variable calculation value according to a fourth input variable and a fourth model parameter, and sends the fourth input variable calculation value to the sixth data end; the sixth data end obtains a target variable predicted value according to the fourth input variable calculation value, the third input variable and the third model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient; iteratively executing the step of selecting any independent variable in the second independent variable set as a target variable until a corresponding second correlation coefficient when each independent variable in the second independent variable set is output as the target variable; selecting independent variables of which the second correlation coefficients meet second preset conditions in the first independent variable set to form a fifth candidate data set, and selecting independent variables of which the second correlation coefficients meet second preset conditions in the second independent variable set to form a sixth candidate data set; a fifth data terminal acquires a first correlation coefficient of each independent variable and dependent variable in the fifth candidate data set; the fifth data end obtains the difference value of the dependent variable and the mean value of the dependent variable and sends the difference value to the sixth data end; a sixth data end receives the difference value, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the sixth candidate data set according to the difference value; the sixth data end obtains a third parameter according to the regression coefficient and the mean value of the independent variable, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the fifth data end; a fifth data terminal receives the third parameter and the encrypted fourth parameter, obtains a constant item of the unary linear regression model according to the third parameter and the dependent variable mean value, obtains an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains a dispersion square sum according to a difference value between the dependent variable and the dependent variable mean value; obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a sixth data end; the sixth data end receives the encrypted first correlation coefficient, decrypts the encrypted first correlation coefficient, and sends the decrypted first correlation coefficient to the fifth data end; a fifth data end receives the decrypted first correlation coefficient and outputs the first correlation coefficient; iteratively executing the step of obtaining the difference value between the dependent variable and the dependent variable mean value until a first correlation coefficient corresponding to each independent variable in a sixth candidate data set is output; and selecting an independent variable of which the first correlation coefficient in the fifth candidate data set meets a first preset condition to form a seventh candidate data set, selecting an independent variable of which the first correlation coefficient in the sixth candidate data set meets the first preset condition to form an eighth candidate data set, and forming a final candidate data set by the seventh candidate data set and the eighth candidate data set.
In a tenth aspect, the present invention provides a variable screening system, which includes a fifth data terminal and a sixth data terminal; wherein the fifth data terminal and the sixth data terminal are used for executing the method according to the eighth aspect or the ninth aspect.
The technical scheme provided by the embodiment of the invention at least has part or all of the following advantages:
by transmitting the intermediate parameters between different data owners, the data safety of the two owners is guaranteed, meanwhile, the linear correlation degree of independent variables and dependent variables can be effectively analyzed, and effective basis is provided for screening of candidate characteristic variables.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a system architecture diagram according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a univariate processing method according to an embodiment of the present invention;
fig. 3 is a detailed implementation flowchart of steps S101 and S102 in the first embodiment of the present invention;
fig. 4 is a schematic flowchart of a univariate processing method according to a second embodiment of the present invention;
FIG. 5 is a schematic diagram of a univariate and dependent variable-fitted univariate linear regression model according to an embodiment of the present invention;
fig. 6 is a schematic flowchart of a univariate processing method according to a third embodiment of the present invention;
fig. 7 is a schematic structural diagram of a first data terminal according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a second data terminal according to an embodiment of the present invention;
FIG. 9 is a flowchart illustrating a multi-variable processing method according to a fourth embodiment of the present invention;
FIG. 10 is a diagram illustrating the coefficient of variance of expansion VIF for each feature variable according to an embodiment of the present invention;
FIG. 11 is a flowchart illustrating a multi-variable processing method according to a fifth embodiment of the present invention;
FIG. 12 is a flowchart illustrating a multi-variable processing method according to a sixth embodiment of the present invention;
fig. 13 is a schematic structural diagram of a third data terminal according to an embodiment of the present invention;
fig. 14 is a schematic structural diagram of a fourth data terminal according to an embodiment of the present invention;
fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The federated learning framework is a distributed artificial intelligence model training framework, can help different data owners to realize federated modeling and federated training without sharing private data, and can effectively solve the problems of data safety and data isolated island.
The feature engineering is the most important loop in machine learning modeling, and refers to a process of processing original data into model training data, so that the information of the data is extracted to the greatest extent for the model to use. It generally comprises the following steps:
(1) characteristic pretreatment: the original data is dirty with large probability, and abnormal characteristic sample processing, missing value processing, standardization, normalization and the like are required;
(2) selecting characteristics: after the feature preprocessing is completed, meaningful features need to be screened out and input into the model for training. The screening method comprises characteristic univariate analysis, namely the distribution condition of each characteristic and the prediction capability of the label are analyzed, common indexes comprise WOE, IV, PSI, KS and the like, and the method is widely applied to the modeling of the scoring card in the field of consumption finance.
(3) And (3) feature dimensionality reduction: after the feature selection is completed, the problems of low calculation efficiency, high model complexity and the like caused by overlarge feature matrix can be solved through feature dimension reduction. The specific method comprises PCA principal component analysis, LDA linear discriminant analysis, ICA independent component analysis and the like.
And univariate analysis under the federal scene comprises Evidence Weight (WOE) and Information Value (IV). However, both WOE and IV are used to select more important variables to be added into the model, the prediction strength can be used as a basis for judging whether the variables are important, the linear correlation between the independent variable and the dependent variable is lack of representation, and a measure for the severity of (multiple) collinearity of the multiple linear regression model is also lacking, and if multiple collinearity exists among features, the weight parameter estimation of the model is distorted or difficult to estimate accurately.
In view of the above technical problems, the technical idea of the present invention is as follows: the variance expansion coefficient R is calculated by the transmission of encrypted intermediate parameters between different data owners2The interpretability of the explanatory variable to the dependent variable is analyzed, the calculated variance expansion coefficient VIF is used for analyzing the multiple collinearity among the characteristics, and the interpretability of the model-entering variable is increased.
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention, and as shown in fig. 1, the system architecture 100 includes: the system comprises an a-side server 101, a B-side server 102 and a network 103, wherein the a-side server 101 and the B-side server 102 respectively store partial sample data, encrypted intermediate parameter transmission is realized through the network 103, and the network 103 can be a wired or wireless communication link or an optical fiber cable and the like.
It should be noted that, the a-side server 101 and the B-side server 102 cooperate together to implement the following embodiments, and the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
The invention provides a univariate processing method, which is applied to a univariate processing system, wherein the univariate processing system comprises a first data end and a second data end, the first data end stores a dependent variable, the second data end stores an independent variable, and the method can be referred to as shown in figure 1 (as shown in figure 1, an A-end server is equivalent to the first data end, and a B-end server is equivalent to the second data end).
Fig. 2 is a schematic flowchart of a univariate processing method according to an embodiment of the present invention, and as shown in fig. 2, the univariate processing method includes:
and S101, the first data end obtains a difference value between the dependent variable and the mean value of the dependent variable, and sends the difference value to the second data end.
Wherein the difference is used to calculate a regression coefficient of a unary linear regression model constructed from the independent variables and the dependent variables.
Referring to fig. 1, the a-side server stores a dependent variable y corresponding to n sample quantities, and the B-side server stores an independent variable x corresponding to n sample quantitiestOptionally, the a terminal or the B terminal further stores other independent variables. When it is necessary to analyze a single variable xtFor the explanatory ability of the dependent variable y, the regression calculation is realized by the least square method, namely, a unary linear regression model constructed by the independent variable and the dependent variable is shown as the formula (1):
y′=f(xt,i)=a0+a1xt (1)
wherein y' represents a dependent variable prediction value, xt,iDenotes the independent variable xtThe ith sample value of (a); a is1The regression coefficient of the unary linear regression model is represented by the following formula (2):
Figure BDA0003606035450000091
wherein the content of the first and second substances,
Figure BDA0003606035450000092
denotes xtThe mean value of (a); y isiIndicates the value of the ith sample label,
Figure BDA0003606035450000093
represents the mean value of y.
a0A constant term representing a unary linear regression model, whose calculation formula is shown in (3):
Figure BDA0003606035450000094
then, the degree of linear correlation of the independent variable to the dependent variable, namely, the variance expansion coefficient R can be calculated according to the predicted dependent variable value2As shown in equation (4):
Figure BDA0003606035450000095
wherein, RMSE represents the residual sum of squares of the regression model, and the calculation formula is shown as (5); SST represents the sum of squared deviations of the constructed regression model, and the calculation formula is shown in (6).
Figure BDA0003606035450000096
Figure BDA0003606035450000097
In this step, the A terminal can be according to yiObtaining dependent variable mean
Figure BDA00036060354500000912
Then calculating to obtain yiAnd
Figure BDA00036060354500000913
difference of (2)
Figure BDA0003606035450000098
And will be
Figure BDA0003606035450000099
And sending the data to the B terminal. It should be noted that the difference value is sent from the a terminal to the B terminal
Figure BDA00036060354500000910
The B end cannot back calculate y according to the differenceiAnd the data security of the A end is ensured.
And S102, the second data end receives the difference value between the dependent variable and the mean value of the dependent variable sent by the first data end, and calculates the regression coefficient of the unary linear regression model constructed by the independent variable and the dependent variable according to the difference value.
Specifically, terminal B receives
Figure BDA00036060354500000911
Then, the regression coefficient a of the unary linear regression model can be calculated according to the formula (2)1
Fig. 3 is a detailed implementation flowchart of steps S101 and S102 in the first embodiment of the present invention. As shown in fig. 3, steps S101 and S102 include steps S1011 to S1013 as follows:
step S1011, the second data end calculates to obtain the first parameter according to the independent variable and the independent variable mean, encrypts the first parameter, and sends the encrypted first parameter to the first data end.
Correspondingly, at the first data end side, the first data end receives the encrypted first parameter sent by the second data end, and the first parameter is obtained by calculation according to the independent variable and the mean value of the independent variable.
In this step, the B terminal can be according to xt,iCalculating the mean of the independent variables
Figure BDA0003606035450000101
Then according to xt,i
Figure BDA0003606035450000102
Calculating to obtain the first parameter
Figure BDA0003606035450000103
u1In relation to the regression coefficient, the B-port is assigned to the first parameter u1Encrypting, and then encrypting the encrypted u1And sending the data to the first data terminal.
Optionally, the second data end includes a second key pair, where the second key pair includes a second public key and a second private key; the encrypted first parameter is obtained by encrypting the second public key. Specifically, the B-side server will generate a second key pair comprising a second public key PKBAnd a second private key SKBObtaining u at the B terminal1Thereafter, use of PKBFor u is paired1Encrypting to obtain the encrypted first parameter EncB(u1) And Enc will beB(u1) And sending the data to the first data terminal. Note that EncB(u1) Is through PKBEncrypted only through a second private key SK at the B terminalBPerform decryption, i.e. the A-side cannot solve u back1
Step S1012, the first data end obtains an encrypted second parameter according to the encrypted first parameter, the difference between the dependent variable and the dependent variable mean, and sends the encrypted second parameter to the second data end.
Correspondingly, the encrypted second parameter sent by the first data end is received at the second data end side, wherein the encrypted second parameter is obtained by calculation according to the encrypted first parameter, the difference value between the dependent variable and the mean value of the dependent variable.
In this step, end A is according to
Figure BDA0003606035450000104
Calculating to obtain the encrypted second parameter
Figure BDA0003606035450000105
Note that, because EncB(u1) Is through PKBEncrypted, thus obtained second parameter v1Is also via PKBEncrypted.
And S1013, the second data end decrypts the encrypted second parameter, and calculates a regression coefficient of the unary linear regression model constructed by the independent variable and the dependent variable according to the decrypted second parameter.
Optionally, the decrypted second parameter is obtained by decrypting with the second private key, that is, the B-side utilizes the second private key SKBFor EncB(v1) Decrypting to obtain the decrypted second parameter v1Then the B terminal is according to v1Calculating a regression coefficient a1The calculation formula is shown in (7), and formula (7) is a modification of formula (2).
Figure BDA0003606035450000106
And S103, the second data terminal obtains a third parameter according to the regression coefficient and the independent variable mean value, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the first data terminal.
And the third parameter and the encrypted fourth parameter are used for calculating and obtaining an encrypted first correlation coefficient corresponding to the argument. Correspondingly, at the first data end side, a third parameter and an encrypted fourth parameter sent by the second data end are received, wherein the third parameter is obtained by calculation according to the regression coefficient and the mean of the independent variables, and the fourth parameter is obtained by calculation according to the regression coefficient and the independent variables.
In this step, the B terminal is based on the regression coefficient a1And the mean of the independent variables
Figure BDA0003606035450000111
Obtaining a third parameter
Figure BDA0003606035450000112
According to the regression coefficient a1And independent variable xtObtaining a fourth parameter w1=a1xtAnd to w1Encrypting, and encrypting w1And sending the data to the A terminal. Optionally, the second data end passes through the second public key PKBTo w1Encrypting to obtain an encrypted fourth parameter EncB(w1)。
Optionally, the regression coefficient a can be used1And independent variable xtObtaining a fifth parameter
Figure BDA0003606035450000113
And passes the second public key PKBFor the fifth parameter w2Encrypting to obtain EncB(w2) Will EncB(w2) And sending the data to the first data terminal.
And step S104, the first data end obtains a constant item of the unary linear regression model according to the third parameter and the dependent variable mean value.
In this step, the A terminal is according to u2And
Figure BDA0003606035450000118
obtaining constant terms of a unary linear regression model
Figure BDA0003606035450000114
Is a variation of equation (3).
And S105, the first data end obtains the encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains the dispersion square sum according to the difference value of the dependent variable and the dependent variable mean.
In this step, the first data end is according to EncB(w1)、a0And yiObtaining an encrypted residual sum of squares (RMSE), optionally in accordance with EncB(w1)、EncB(w2)、a0And yiObtaining an encrypted RMSE, wherein the calculation formula is shown as (8) and is a variant of the formula (5); the first data terminal is based on yi
Figure BDA0003606035450000115
The sum of squared deviations SST is obtained and the calculation formula is shown in (6).
Figure BDA0003606035450000116
And S106, the first data end obtains an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sends the encrypted first correlation coefficient to the second data end.
In this step, the first correlation coefficient may also be called variance expansion coefficient R2End A is according to EncB(RMSE), SST obtain encrypted variance expansion coefficients
Figure BDA0003606035450000117
Sending EncB(R2) To the B terminal.
Step S107, the second data end receives the encrypted first correlation coefficient sent by the first data end, decrypts the encrypted first correlation coefficient, sends the decrypted first correlation coefficient to the first data end, and outputs the first correlation coefficient.
In this step, the B-side passes through the second private key SKBFor EncB(R2) Encrypting to obtain decrypted first correlation coefficient, i.e. coefficient of variance expansion R2And R is2Sending to A terminal, and B terminal outputting R2
Step S108, the first data end receives the decrypted first correlation coefficient sent by the second data end, and outputs the first correlation coefficient.
As an optional embodiment, if the second data end includes a plurality of independent variables, the step of obtaining the difference between the dependent variable and the dependent variable mean is performed iteratively until a first correlation coefficient corresponding to each independent variable is output.
Specifically, if the B-side includes independent variables of multiple dimensions, the present embodiment may be adopted to analyze a first correlation coefficient between each independent variable and a dependent variable in the first data side.
As an optional embodiment, after outputting the first correlation coefficient corresponding to each argument, the method further includes: and selecting the independent variable of which the first correlation coefficient meets a first preset condition to form a candidate data set.
In particular, the first correlation coefficient, i.e. the coefficient of variance expansion R2Describes the magnitude of the effect of the independent variable on the dependent variable, if R2Smaller, the degree of interpretation of the dependent variable by the independent variable is lower, and the degree of linear correlation is lower. The first preset condition may be set to be greater than a certain threshold, e.g. a threshold of 0.3, i.e. only if R is present2Above 0.3, the independent variable is relatively more interpreted than the dependent variable, and the independent variable may be selected as the candidate data set. The candidate data set may be used as a trainingThe training data set trains the model, or uses other aspects.
According to the univariate processing method provided by the embodiment of the invention, the difference value of the dependent variable and the mean value of the dependent variable is obtained and sent to the second data terminal, and the difference value is used for calculating the regression coefficient of the univariate linear regression model constructed by the independent variable and the dependent variable; receiving a third parameter and an encrypted fourth parameter sent by a second data end, wherein the third parameter is obtained by calculation according to the regression coefficient and the independent variable mean value, and the fourth parameter is obtained by calculation according to the regression coefficient and the independent variable; obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean; obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data terminal for decryption; receiving a decrypted first correlation coefficient sent by a second data end, and outputting the first correlation coefficient; the embodiment of the invention transmits the encrypted intermediate parameter between the two data providers, ensures the data safety of the two data providers, and can analyze the linear correlation degree of the single variable to the dependent variable, thereby providing an effective basis for the screening of the subsequent characteristic variables.
On the basis of the foregoing embodiment, fig. 4 is a flowchart illustrating a univariate processing method according to a second embodiment of the present invention, in which in this embodiment, the independent variables stored in the second data terminal are subjected to binning processing. As shown in fig. 4, the univariate processing method includes:
step S201, the first data end sends the sample identifier and the encrypted sample tag value corresponding to the sample identifier to the second data end.
Correspondingly, at the second data end side, the sample identifier sent by the first data end and the encrypted sample tag value corresponding to the sample identifier are received.
Step S202, the second data terminal counts the encrypted sample label value of each box according to the sample identification to obtain an encrypted sample label statistic value, and sends the encrypted sample label statistic value to the first data terminal.
Correspondingly, at the first data end side, the encrypted sample tag statistic value sent by the second data end is received, wherein the encrypted sample tag statistic value is obtained by the second data end by counting the encrypted sample tag value of each box according to the sample identifier.
And step S203, the first data end decrypts the encrypted sample label statistic to obtain the dependent variable.
And step S204, the first data end obtains the difference value of the dependent variable and the mean value of the dependent variable, and sends the difference value to the second data end.
Wherein the difference is used to calculate a regression coefficient of a unary linear regression model constructed from the independent variables and the dependent variables.
Step S205, the second data end receives the difference between the dependent variable and the mean of the dependent variable sent by the first data end, and calculates a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the difference.
And S206, the second data terminal obtains a third parameter according to the regression coefficient and the independent variable mean value, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the first data terminal.
And S207, the first data end obtains a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value.
And S208, the first data end obtains the encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains the dispersion square sum according to the difference value of the dependent variable and the dependent variable mean.
Step S209, the first data end obtains an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual sum of squares and the encrypted dispersion sum of squares, and sends the encrypted first correlation coefficient to the second data end.
Step S210, the second data end receives the encrypted first correlation coefficient sent by the first data end, decrypts the encrypted first correlation coefficient, sends the decrypted first correlation coefficient to the first data end, and outputs the first correlation coefficient.
Step S211, the first data end receives the decrypted first correlation coefficient sent by the second data end, and outputs the first correlation coefficient.
The implementation manners of step S204 to step S211 in this embodiment are similar to the implementation manners of step S101 to step S108 in the above embodiment, and are not described herein again.
The difference from the above embodiment is that in order to reduce the risk of overfitting of the unary linear regression model to be constructed and obtain a more stable regression model, in this embodiment, the independent variables stored in the second data end are subjected to binning processing; sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to a second data terminal; receiving an encrypted sample tag statistic value sent by a second data end, wherein the encrypted sample tag statistic value is obtained by the second data end through statistics on the encrypted sample tag value of each box according to a sample identifier; and decrypting the encrypted sample label statistic to obtain the dependent variable.
Specifically, all variables in the B terminal are firstly subjected to box separation, and are divided into m boxes in total, and the ith box has n in totaliTaking sample x in the ith bint,iAnd dependent variable yiAverage value of (d):
Figure BDA0003606035450000141
then, a regression calculation is implemented by using a least square method, namely, a unary linear regression model constructed by the independent variable and the dependent variable is shown as a formula (9):
Y′=f(Xt,i)=a0+a1Xt (9)
wherein, Xt=[Xt,0,Xt,1,ΛXt,m]T
Figure BDA0003606035450000142
Wherein the content of the first and second substances,
Figure BDA0003606035450000143
Figure BDA0003606035450000144
finally, calculating according to the regression prediction value
Figure BDA0003606035450000145
Wherein the content of the first and second substances,
Figure BDA0003606035450000146
Figure BDA0003606035450000147
in this embodiment, the a-side first sends the sample identifier in the sample data and the encrypted sample tag value y to the second data side; then, the B terminal counts y of each box according to the sample identification to obtain an encrypted sample label statistic value y _ bin _ sum corresponding to each box, and sends the encrypted y _ bin _ sum to the A terminal; the A-end decrypts the encrypted y _ bin _ sum to obtain the decrypted y _ bin _ sum corresponding to each box, and then uses the decrypted y _ bin _ sum
Figure BDA0003606035450000148
Obtaining the dependent variable Y corresponding to each boxi(ii) a Then, the a terminal and the B terminal cooperatively perform the method steps as described in the first embodiment, so as to obtain the first correlation coefficient corresponding to the independent variable.
Optionally, before step 201, the method further includes: a first data terminal generates a first key pair, wherein the first key pair comprises a first public key and a first private key; encrypting a sample tag value through the first public key to obtain the encrypted sample tag value; the decrypting the encrypted sample tag statistic in step 210 includes: and decrypting the encrypted sample label statistic value through the first private key.
Specifically, the A-side generates a first key pair comprising a first public key PKAAnd a first private key SKAThe A terminal utilizes PKAEncrypting the sample label value y to obtain EncA(y) and converting EncA(y) and the sample identifier are sent to the B terminal; then, the B terminal counts the encrypted sample label value of each box according to the sample identification to obtain an encrypted sample label statistic value EncA(y _ bin _ sum), and EncA(y _ bin _ sum) is sent to the A end; the A terminal utilizes a second private key SKADecrypting it to obtain Y corresponding to each boxi
Then, the A terminal is based on the dependent variable YiObtaining dependent variable mean
Figure BDA0003606035450000151
Then calculating to obtain a dependent variable YiAnd dependent variable mean value
Figure BDA0003606035450000152
Difference of (2)
Figure BDA0003606035450000153
And will be
Figure BDA0003606035450000154
Sending the data to the B terminal; so that the B terminal is according to
Figure BDA0003606035450000155
Calculating regression coefficients of a unary linear regression model constructed by the independent variables and the dependent variables; then, the B terminal is based on the regression coefficient a1And the mean of the independent variables
Figure BDA0003606035450000156
Obtaining a third parameter
Figure BDA0003606035450000157
According to the regression coefficient a1And independent variable XtObtaining a fourth parameter w1=a1XtAre combined with each otherw1Encrypting, and encrypting w1And sending the data to the A terminal. Optionally, B passes the second public key PKBTo w1Encrypting to obtain an encrypted fourth parameter EncB(w1). Optionally, the regression coefficient a can be used1And independent variable XtObtaining a fifth parameter w2=a1 2(XtXt) And by the second public key PKBFor the fifth parameter w2Encrypting to obtain EncB(w2) Will EncB(w2) Sending the data to the A end; then, the A terminal is according to u2And
Figure BDA00036060354500001513
obtaining constant terms of a unary linear regression model
Figure BDA0003606035450000158
According to EncB(w1)、a0And YiObtaining an encrypted residual sum of squares, RMSE, optionally in accordance with EncB(w1)、EncB(w2)a0And YiObtaining an encrypted RMSE, wherein the calculation formula is shown as (10); the first data terminal is based on Yi
Figure BDA0003606035450000159
The sum of squared deviations SST is obtained and the calculation formula is shown in (11).
Figure BDA00036060354500001510
Figure BDA00036060354500001511
End A is according to EncB(RMSE), SST obtain encrypted variance expansion coefficients
Figure BDA00036060354500001512
Sending EncB(R2) Feeding the B end; b terminalBy means of a second private key SKBFor EncB(R2) Encrypting to obtain a decrypted first correlation coefficient, i.e. coefficient of variance expansion R2And R is2Sending to A terminal, outputting R2While the second data terminal outputs R2
Fig. 5 is a schematic diagram of a univariate linear regression model fitted with a dependent variable and a univariate linear regression model according to an embodiment of the present invention. After fitting a univariate linear regression model as shown in FIG. 5, the coefficient of variance expansion R corresponding to the argument can be calculated2
According to the single variable processing method provided by the embodiment of the invention, the independent variables stored by the second data terminal are subjected to box separation processing; sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to a second data terminal; receiving an encrypted sample label statistic value sent by a second data end, wherein the encrypted sample label statistic value is obtained by counting the encrypted sample label value of each box by the second data end according to a sample identifier; decrypting the encrypted sample tag statistic to obtain the dependent variable; the embodiment of the invention enables the linear regression model to be more stable by performing the box separation processing on the independent variable, and ensures the safety of data by encrypting the sample label value and sending the encrypted sample label value to another data provider.
To further understand the embodiment of the present invention, fig. 6 is a schematic flowchart of a univariate processing method provided in a third embodiment of the present invention, and with reference to fig. 1 and fig. 6, the univariate processing method includes:
step 1.A, a pair of public and private keys PK is generatedA、SKAPublic key PKASending the data to B;
step 2.B generates a pair of public and private keys PKB、SKB(PKA≠PKB,SKA≠SKB) Public key PKBSending to A;
step 3.A, encrypting the label EncA(y) send to B;
step 4.B, counting the number of positive and negative samples of each box characteristic of the own party, and counting the labelInformation EncA(y _ bin _ sum) to a;
step 5.A uses the private key SKADecrypting EncA(y _ bin _ sum), calculating y of the B terminaliA value;
step 6.B calculation
Figure BDA0003606035450000161
Using the public key PKBEncrypting u1Sending EncB(u1) Feeding A;
step 7.A calculation
Figure BDA0003606035450000162
Will EncB(v1) Sending the data to B;
step 8.B uses the private key SKBDecrypting EncB(v1) Calculating
Figure BDA0003606035450000163
Sending u2Feeding A;
step 9.B calculation
Figure BDA0003606035450000164
w2=a1XtUsing the public key PKBEncrypting w1And w2Sending EncB(w1) And EncB(w2) Feeding A;
step 10.A calculation
Figure BDA0003606035450000171
Figure BDA0003606035450000172
Figure BDA0003606035450000173
Sending EncB(R2) Feeding B;
step 11.B private key SKBDecrypting EncB(R2) Sending R2Feeding A;
and 12, calculating and repeating the step 6 to the step 11 until all independent variables at the B end are analyzed.
Specifically, in order to ensure that the information transmitted by the two parties is not cracked by the third party, the a-side server may utilize the private key SK to send the message to the B-side serverAThe B-end server can utilize the received public key PK of the A-end server to perform encryptionADecrypting the message; similarly, the B-side server can utilize SK to send messages to the A-side serverBThe encryption is carried out, and the A-side server can utilize the received public key PK of the B-side serverBThe message is decrypted.
In summary, the server at the a-side and the server at the B-side have different data, the two sides can respectively perform data-related calculation locally, only encrypted intermediate parameters are transmitted between the two sides, and the receiver cannot solve the original data, so that the linear correlation degree of each independent variable and each dependent variable is analyzed on the basis of ensuring the data safety of the two sides, and an effective basis is provided for subsequent variable screening and the like.
The embodiment of the invention also provides a first data terminal. Fig. 7 is a schematic structural diagram of a first data end according to an embodiment of the present invention, and as shown in fig. 7, the first data end includes a first processing module 10, a first sending module 11, and a first receiving module 12;
the first processing module 10 is configured to obtain a difference between a dependent variable and a mean of the dependent variable, and send the difference to a second data end through the first sending module 11, where the difference is used to calculate a regression coefficient of a unary linear regression model constructed by an independent variable and the dependent variable; the first receiving module 12 is configured to receive a third parameter and an encrypted fourth parameter sent by a second data end, where the third parameter is obtained through calculation according to the regression coefficient and an independent variable mean value, and the fourth parameter is obtained through calculation according to the regression coefficient and an independent variable; the first processing module 10 is further configured to obtain a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean; obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data end through the first sending module 11 for decryption; the first receiving module 12 is further configured to receive the decrypted first correlation coefficient sent by the second data end, and output the first correlation coefficient.
As an optional embodiment, the first receiving module 12 is specifically configured to receive an encrypted first parameter sent by a second data end, where the first parameter is obtained according to an independent variable and an independent variable mean value; the first processing module 10 is specifically configured to obtain an encrypted second parameter according to the difference between the encrypted first parameter, the dependent variable, and the mean of the dependent variable, and send the encrypted second parameter to a second data end through the first sending module 11 for decryption, where the decrypted second parameter is used to calculate a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable.
As an alternative embodiment, the second data terminal includes a second key pair, and the second key pair includes a second public key and a second private key; the encrypted fourth parameter and the encrypted first parameter are obtained by encrypting the second public key; and the decrypted first correlation coefficient and the decrypted second parameter are obtained by decrypting through the second private key.
As an alternative embodiment, the independent variables stored in the second data terminal are subjected to binning processing; the first sending module 11 is further configured to send the sample identifier and the encrypted sample tag value corresponding to the sample identifier to a second data terminal; the first receiving module 12 is configured to receive an encrypted sample tag statistic value sent by a second data end, where the encrypted sample tag statistic value is obtained by the second data end by counting encrypted sample tag values of each box according to a sample identifier; the first processing module 10 is configured to decrypt the encrypted sample tag statistic to obtain the dependent variable.
As an alternative embodiment, the first processing module 10 is further configured to generate a first key pair, where the first key pair includes a first public key and a first private key; encrypting a sample tag value through the first public key to obtain the encrypted sample tag value; and decrypting the encrypted sample label statistic value through the first private key.
As an optional embodiment, if the second data end includes a plurality of independent variables, the step of obtaining the difference between the dependent variable and the dependent variable mean is performed iteratively until a first correlation coefficient corresponding to each independent variable is output.
As an alternative embodiment, the first processing module 10 is further configured to select an argument of which the first correlation coefficient satisfies a first preset condition, and constitute the candidate data set.
The implementation principle and technical effect of the first data terminal provided in this embodiment are similar to those of the above embodiments, and are not described herein again.
The embodiment of the invention also provides a second data terminal. Fig. 8 is a schematic structural diagram of a second data end according to an embodiment of the present invention, as shown in fig. 8, the second data end includes a second processing module 20, a second sending module 21, and a second receiving module 22;
the second receiving module 22 is configured to receive a difference between a dependent variable sent by the first data end and a mean value of the dependent variable; the second processing module 20 is configured to calculate a regression coefficient of a unary linear regression model constructed from the independent variable and the dependent variable according to the difference; obtaining a third parameter according to the regression coefficient and the independent variable mean value, obtaining a fourth parameter according to the regression coefficient and the independent variable, and encrypting the fourth parameter; the second sending module 21 is configured to send the third parameter and the encrypted fourth parameter to the first data end, where the third parameter and the encrypted fourth parameter are used to calculate and obtain an encrypted correlation coefficient corresponding to the argument; the second receiving module 22 is further configured to receive the encrypted first correlation coefficient sent by the first data end, decrypt the encrypted first correlation coefficient through the second processing module 20, and send the decrypted first correlation coefficient to the first data end through the second sending module 21.
As an optional embodiment, the second processing module 20 is configured to calculate and obtain a first parameter according to the independent variable and the independent variable mean, encrypt the first parameter, and send the encrypted first parameter to the first data end through the second sending module 21; the second receiving module 22 is configured to receive an encrypted second parameter sent by the first data end, where the encrypted second parameter is obtained by calculation according to the difference between the encrypted first parameter, the dependent variable, and the mean of the dependent variable; the second processing module 20 is further configured to decrypt the encrypted second parameter, and calculate a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the decrypted second parameter.
As an alternative embodiment, the second processing module 20 is further configured to perform binning processing on the independent variables; the second receiving module 22 is further configured to receive a sample identifier sent by the first data end, and an encrypted sample tag value corresponding to the sample identifier; the second processing module is further configured to count the encrypted sample tag values of each box according to the sample identifier to obtain an encrypted sample tag statistical value, and send the encrypted sample tag statistical value to the first data end through the second sending module 21 for decryption to obtain the dependent variable.
The implementation principle and technical effect of the second data terminal provided in this embodiment are similar to those of the above embodiments, and are not described herein again.
The embodiment of the invention also provides a single variable processing system. Referring to fig. 1, the univariate processing system comprises a first data terminal and a second data terminal; wherein the first data terminal and the second data terminal are configured to perform the method of any of the above embodiments.
The implementation principle and technical effect of the single variable processing system provided in this embodiment are similar to those of the above embodiments, and are not described herein again.
The present invention further provides a multi-variable processing method, which is applied to a multi-variable processing system including a third data end and a fourth data end, wherein the third data end stores a first characteristic variable set, and the second data end stores a second characteristic variable set, as shown in fig. 1 (an a-side server in fig. 1 is equivalent to the third data end, and a B-side server in fig. 1 is equivalent to the fourth data end).
Fig. 9 is a flowchart illustrating a multi-variable processing method according to a fourth embodiment of the invention. As shown in fig. 9, the multi-variable processing method includes:
step S301, selecting any one of the characteristic variables in the first characteristic variable set as a target variable, taking other characteristic variables in the first characteristic variable set as first input variables, and taking the second characteristic variable set as second input variables.
Specifically, the A end stores a first characteristic variable set
Figure BDA0003606035450000191
The
Figure BDA0003606035450000192
Is f _ dimAFrom
Figure BDA0003606035450000193
Selecting any one characteristic variable xtAs target variable (dependent variable), i.e.
Figure BDA0003606035450000194
The other characteristic variables in the first characteristic variable set are used as first input variables, namely the first input variables
Figure BDA0003606035450000201
The B terminal stores a second characteristic variable set XBX of the formulaBIs f _ dimB,XBAs a second input variable.
Step S302, the fourth data terminal obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the third data terminal.
And the second input variable calculation value is used for determining a second correlation coefficient corresponding to the target variable. Correspondingly, a second input variable calculation value sent by the fourth data end is received at the third data end side, wherein the second input variable calculation value is obtained through calculation according to a second input variable and a second model parameter.
In this step, the terminal B can be according to the second input variable XBSecond model parameter WBObtaining a second input variable calculation value PB=WBXBAnd transmit PBTo the A-side server.
And step S303, the third data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter.
In this step, the A terminal can calculate a value P according to the second input variableBFirst input variable XAAnd a first model parameter WACalculating to obtain a target variable predicted value Y' ═ WAXA+PB
And S304, the third data end obtains a residual square sum according to the dependent variable and the target variable predicted value, and obtains a dispersion square sum according to the dependent variable and the dependent variable mean value.
In this step, the A terminal can be determined according to the dependent variable YiObtaining the square sum of residual errors of the target variable predicted value Y
Figure BDA0003606035450000202
According to YiAnd dependent variable mean
Figure BDA0003606035450000203
Obtaining the sum of squared deviations
Figure BDA0003606035450000204
Step S305, the third data end determines a second correlation coefficient corresponding to the target variable according to the residual sum of squares and the dispersion sum of squares, and outputs the second correlation coefficient.
In this step, the second correlation coefficient is also calledIs coefficient of variance expansion
Figure BDA0003606035450000205
As an optional embodiment, the step of selecting any one of the characteristic variables in the first characteristic variable set as the target variable is performed iteratively until a second correlation coefficient corresponding to each characteristic variable as the target variable is output.
Specifically, steps S301-S305 are repeatedly executed for f _ dimASecondly, outputting second correlation coefficients corresponding to all the characteristic variables in the first characteristic variable set in the A end; similarly, any feature variable in the second feature variable set may be iteratively selected as the target variable, that is, the a terminal and the B terminal perform roles of exchanging each other to execute the steps in the above embodiment, and f _ dim is executed in totalBAnd outputting second correlation coefficients corresponding to all characteristic variables in a second characteristic variable set in the B terminal. In summary, f _ dim needs to be performed in totalA+f_dimBAnd after secondary analysis, outputting second correlation coefficients corresponding to the characteristic variables of the two parties.
As an optional embodiment, after outputting the second correlation coefficient corresponding to each feature variable as the target variable, the method further includes: and selecting the characteristic variables of which the second correlation coefficients meet second preset conditions to form a candidate data set.
Specifically, the second correlation coefficient, also called coefficient of variance VIF, may describe how many co-linearity between each feature variable is, and in order to obtain a more reliable model, the feature variable with the larger coefficient of variance VIF may be eliminated. Fig. 10 is a schematic diagram of a coefficient of variance expansion VIF corresponding to each feature variable according to an embodiment of the present invention, as shown in fig. 10, by observing the VIF values of all feature variables, if the VIF value is found to be large (significant outlier), the feature variable is removed, so as to obtain a feature combination with low correlation to enhance the interpretability of the model.
The multi-variable processing method provided by the embodiment of the invention is applied to a third data terminal in a multi-variable processing system, wherein the multi-variable processing system comprises the third data terminal and a fourth data terminal, a first characteristic variable set is stored in the third data terminal, and a second characteristic variable set is stored in the second data terminal; selecting any characteristic variable in the first characteristic variable set as a target variable, taking other characteristic variables in the first characteristic variable set as first input variables, and taking the second characteristic variable set as second input variables; receiving a second input variable calculation value sent by the fourth data end, wherein the second input variable calculation value is obtained by calculation according to a second input variable and a second model parameter; obtaining a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient; the embodiment of the invention realizes that the linear correlation degree among the characteristic variables can be analyzed on the basis of ensuring the data safety of different data ownership sets by transmitting the intermediate parameters among different data ownership sets, and provides an effective basis for subsequent characteristic screening.
On the basis of the foregoing embodiment, fig. 11 is a schematic flowchart of a multi-variable processing method according to a fifth embodiment of the present invention, and as shown in fig. 11, the multi-variable processing method includes:
step S401, selecting any one of the characteristic variables in the first characteristic variable set as a target variable, taking other characteristic variables in the first characteristic variable set as first input variables, and taking the second characteristic variable set as second input variables.
And S402, the fourth data terminal constructs a second model according to the initial value of the second model parameter and the second input variable, encrypts the second model and sends the encrypted second model to the third data terminal.
Correspondingly, the encrypted second model sent by the fourth data end is received at the third data end, wherein the second model is constructed and obtained according to the initial value of the second model parameter and the second input variable.
And S403, the third data terminal constructs and obtains a first model according to the initial value of the first model parameter, the first input variable and the target variable.
Optionally, steps S404, S406, S408, S410, and S412 are further included. The first model is encrypted at the third data end, and the encrypted first model is sent to the fourth data end, the encrypted first model is used for the fourth data end to calculate and obtain an encrypted second gradient value of the global loss function on the second model parameter, and the encrypted second gradient value is added with a second random number to obtain an encrypted second gradient related value; and receiving the encrypted second gradient correlation value sent by the fourth data end, decrypting the encrypted second gradient correlation value, sending the decrypted second gradient correlation value to the fourth data end, wherein the decrypted second gradient correlation value is used for the fourth data end to obtain a second gradient value, and updating a second model parameter according to the second gradient value. Correspondingly, at the fourth data end, receiving an encrypted first model sent by the third data end, wherein the first model is constructed and obtained according to a first model parameter initial value, a first input variable and a target variable; calculating and obtaining an encrypted second gradient value of the global loss function to a second model parameter according to the encrypted first model, the encrypted second model and a second input variable; adding the encrypted second gradient value and a second random number to obtain an encrypted second gradient correlation value, and sending the encrypted second gradient correlation value to a third data terminal for decryption; receiving a decrypted second gradient correlation value sent by a third data end, and obtaining a second gradient value according to the second gradient correlation value; and updating the second model parameter according to the second gradient value.
And S404, the third data end encrypts the first model and sends the encrypted first model to the fourth data end.
And S405, the third data end calculates and obtains the encrypted first gradient value of the global loss function to the first model parameter according to the encrypted second model, the encrypted first model and the first input variable.
Step S406, the fourth data end calculates and obtains an encrypted second gradient value of the global loss function to the second model parameter according to the encrypted first model, the encrypted second model, and the second input variable.
And step S407, the third data terminal adds the encrypted first gradient value and the first random number to obtain an encrypted first gradient correlation value, and sends the encrypted first gradient correlation value to the fourth data terminal.
And step S408, the fourth data terminal adds the encrypted second gradient value and the second random number to obtain an encrypted second gradient correlation value, and sends the encrypted second gradient correlation value to the third data terminal.
And step S409, the fourth data terminal receives the encrypted first gradient correlation value, decrypts the encrypted first gradient correlation value, and sends the decrypted first gradient correlation value to the third data terminal.
And step S410, the third data end receives the encrypted second gradient correlation value sent by the fourth data end, decrypts the encrypted second gradient correlation value, and sends the decrypted second gradient correlation value to the fourth data end.
Step S411, the third data end receives the decrypted first gradient correlation value sent by the fourth data end, obtains a first gradient value according to the first gradient correlation value, and updates the first model parameter according to the first gradient value.
Step S412, the fourth data end receives the decrypted second gradient correlation value sent by the third data end, obtains a second gradient value according to the second gradient correlation value, and updates the second model parameter according to the second gradient value.
And iteratively executing the steps S402-S413 until the regression model constructed by the first input variable, the second input variable and the target variable converges and the global loss function obtains the minimum value.
Step S413, the fourth data end obtains a second input variable calculation value according to the second input variable and the second model parameter, and sends the second input variable calculation value to the third data end.
And S414, the third data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter.
And step S415, the third data end obtains a residual square sum according to the dependent variable and the target variable predicted value, and obtains a dispersion square sum according to the dependent variable and the dependent variable mean value.
And S416, the third data end determines a second correlation coefficient of the dependent variable, the first input variable and the second input variable according to the residual sum of squares and the dispersion sum of squares, and outputs the second correlation coefficient.
The implementation manners of step S401, step S413 to step S416 in this embodiment are similar to the implementation manners of step S301 to step S305 in the foregoing embodiment, and are not described herein again.
The difference from the above embodiment is that the present embodiment further defines how to determine the model parameters constructed by the first input variable, the second input variable and the target variable, and in the present embodiment, the following steps are iteratively performed until the regression model constructed by the first input variable, the second input variable and the target variable converges and the global loss function takes the minimum value: receiving an encrypted second model sent by a fourth data end, wherein the second model is constructed and obtained according to a second model parameter initial value and a second input variable; constructing and obtaining a first model according to the initial value of the first model parameter, the first input variable and the target variable; calculating and obtaining an encrypted first gradient value of the global loss function to the first model parameter according to the encrypted second model, the encrypted first model and the first input variable; adding the encrypted first gradient value and a first random number to obtain an encrypted first gradient correlation value, and sending the encrypted first gradient correlation value to a fourth data terminal for decryption; receiving a decrypted first gradient correlation value sent by a fourth data end, and obtaining a first gradient value according to the first gradient correlation value; updating the first model parameter according to the first gradient value; further comprising: the third data end encrypts the first model and sends the encrypted first model to the fourth data end, wherein the encrypted first model is used for the fourth data end to calculate and obtain an encrypted second gradient value of the global loss function on the second model parameter, and the encrypted second gradient value and a second random number are added to obtain an encrypted second gradient correlation value; and receiving the encrypted second gradient correlation value sent by the fourth data end, decrypting the encrypted second gradient correlation value, sending the decrypted second gradient correlation value to the fourth data end, wherein the decrypted second gradient correlation value is used for the fourth data end to obtain a second gradient value, and updating a second model parameter according to the second gradient value.
Specifically, the embodiment of the present invention regresses the linear model by a gradient descent method until the regression model constructed by the first input variable, the second input variable, and the target variable converges and the global loss function obtains the minimum value.
For the first model of the A end, the first model parameter of the A end is initialized to obtain the initial value w of the first model parameterAInitializing the second model parameter of the B terminal to obtain the initial value w of the second model parameterB(ii) a Then B terminal is according to wBAnd a second input variable XBConstruction of the second model FB=wBXBTo FBEncrypting, and converting the encrypted FBAnd sending the data to the A terminal.
Optionally, the fourth data end includes a fourth key pair, where the fourth key pair includes a fourth public key and a fourth private key; wherein the encrypted second model is obtained by the fourth public key encryption. Specifically, the fourth key pair generated by the B-side includes a fourth public key PKBAnd a fourth private key SKBAnd the B terminal utilizes PKBTo F is aligned withBEncrypting to obtain a second encrypted model EncB(FB) Will EncB(FB) And sending the data to the A end, wherein the A end cannot decrypt the data.
Then, the A terminal starts value w according to the first model parameterAFirst input variable XAConstructing a first model with a target variable YFAIn which FA=wAXA-Y。
Then, the A terminal is according to the encrypted second model EncB(FB) First model FAAnd a first input variable XACalculating a first gradient value Enc for obtaining the encryption of the global loss function to the first model parameterB(GA)=(FA+EncB(FB))XA
Then, the A end encrypts the first gradient value EncB(GA) And a first random number RAAddition treatment of RAIs composed of random values, and GAObtaining the encrypted first gradient correlation value Enc by using vectors with the same dimensionB(GA+RA) (ii) a Will EncB(GA+RA) Sending to the B terminal, and the B terminal utilizes the fourth private key SKBFor EncB(GA+RA) Decrypting and sending GA+RATo terminal A, terminal A subtracts RAObtaining GAThen using GAThe first model parameters are updated.
Similarly, for the second model of the B terminal, the first model F is obtained at the A terminalA=wAXAafter-Y, will be paired with FAEncrypting and converting the encrypted FAAnd sending the data to the B terminal.
Optionally, the method further includes: a third data terminal generates a third key pair, wherein the third key pair comprises a third public key and a third private key; the encrypting the first model comprises: and carrying out encryption processing on the first model through the third public key. Specifically, the third key pair generated by the a-side includes the third public key PKAAnd a fourth private key SKAThe A terminal utilizes PKATo FAEncrypting to obtain EncA(FA)。
Then, the B terminal is according to EncA(FA)、FBAnd a second input variable XBCalculating a second gradient value Enc for obtaining the encryption of the global loss function to the second model parameterA(GB)=(EncA(FA)+FB)XB(ii) a Then the B terminal will EncA(GB) And a second random number RBAddition treatment of RBIs a random value, with GBObtaining the encrypted second gradient correlation value Enc by using vectors with the same dimensionA(GB+RB) (ii) a Will EncA(GB+RB) Sending to the A end, the A end utilizes the third private key SKAFor EncA(GB+RB) Decrypting and sending GB+RBTo B terminal, R is subtracted from B terminalBObtaining GBThen using GBThe second model parameters are updated.
As an optional embodiment, the method further comprises: the third data end obtains a first local loss function through calculation according to the first model, obtains an encrypted third local loss function through calculation according to the first model and the encrypted second model, and receives an encrypted second local loss function obtained through calculation by the fourth data end according to the second model; obtaining an encrypted global loss function value according to the first local loss function, the encrypted second local loss function and the third local loss function, and sending the encrypted global loss function value to a fourth data terminal for decryption; and receiving the decrypted global loss function value and the regression model convergence identifier sent by the fourth data terminal.
Specifically, the global penalty function is shown in equation (12):
Figure BDA0003606035450000251
the global Loss function Loss is decomposed and divided into local Loss functions F related to the first modelsqrALocal loss function F associated with the second modelsqrBAnd a local loss function comprising the first model and the second model, as shown in equation (13):
Figure BDA0003606035450000252
thus, to obtain the global penalty function, terminal A is based on the first model FACalculating to obtain a first local loss function FsqrAAccording to said FAEncrypted second model EncB(FB) Calculating to obtain an encrypted third local loss function, and receiving the B terminal according to a second model FBCalculating the obtained encrypted second local loss function FsqrB(ii) a Obtaining the value of an encrypted global Loss function Loss according to the first local Loss function, the encrypted second local Loss function and the third local Loss function, and sending the encrypted global Loss function value Loss to the terminal B for decryption; and receiving the decrypted global loss function value and the regression model convergence identifier sent by the terminal B, and determining the updated first model parameter and second model parameter as final model parameters when the regression model converges and the global loss function obtains the minimum value, wherein the final model parameters are used for calculating the second correlation coefficient of each independent variable.
The multivariate processing method provided by the embodiment of the invention iteratively executes the following steps until a regression model constructed by the first input variable, the second input variable and the target variable converges and a global loss function obtains a minimum value: receiving an encrypted second model sent by a fourth data end, wherein the second model is constructed and obtained according to a second model parameter initial value and a second input variable; constructing and obtaining a first model according to the initial value of the first model parameter, the first input variable and the target variable; calculating and obtaining an encrypted first gradient value of the global loss function to the first model parameter according to the encrypted second model, the encrypted first model and the first input variable; adding the encrypted first gradient value and a first random number to obtain an encrypted first gradient correlation value, and sending the encrypted first gradient correlation value to a fourth data terminal for decryption; receiving a decrypted first gradient correlation value sent by a fourth data end, and obtaining a first gradient value according to the first gradient correlation value; updating the first model parameter according to the first gradient value; further comprising: the third data end encrypts the first model and sends the encrypted first model to the fourth data end, the encrypted first model is used for the fourth data end to calculate and obtain an encrypted second gradient value of the global loss function to the second model parameter, and the encrypted second gradient value and a second random number are added to obtain an encrypted second gradient correlation value; receiving the encrypted second gradient correlation value sent by a fourth data end, decrypting the encrypted second gradient correlation value, and sending the decrypted second gradient correlation value to the fourth data end, wherein the decrypted second gradient correlation value is used for the fourth data end to obtain a second gradient value, and a second model parameter is updated according to the second gradient value; namely, in the embodiment of the present invention, the respective models are regressed by using a gradient descent method to obtain more stable model parameters.
For further understanding of the embodiments of the present invention, fig. 12 is a schematic flowchart of a multi-variable processing method according to a sixth embodiment of the present invention, which is combined with fig. 1 and 12, and the multi-variable processing method includes:
step 1, B generating a pair of public and private keys PKB、SKBPublic key PKBSending to A;
step 2, A generates a pair of public and private keys PKA、SKA(PKA≠PKB,SKA≠SKB) Public key PKASending the data to B;
step 3, B, comparing the feature dimension f _ dim of own partyBSending the information to A;
step 4, A, determining the feature dimension f _ dim of the own partyASending the information to the B;
step 5, A from the full data
Figure BDA0003606035450000261
Is prepared from
Figure BDA0003606035450000262
Step 6, B, calculating FB=wBXBUsing own public key PKBEncryption FBSending EncB(FB) Feeding A;
step 7, A calculationFA=wAXAY, using own public key PKAEncryption FASending EncA(FA) Feeding B;
step 8, B, calculating EncA(GB)=(EncA(FA)+FB)XBSending EncA(GB+RB) To A, wherein RBVector of random values (and G)BThe dimensions are the same);
step 9, A, calculating EncB(GA)=(FA+EncB(FB))XASending EncB(GA+RA) To B, wherein RAVector of random values (and G)AThe dimensions are the same);
step 10, B private key SKBDecrypting EncB(GA+RA) Sending GA+RAFeeding A;
step 11, A private key SKADecrypting EncA(GB+RB) Sending GB+RBFeeding B;
step 12, B calculation
Figure BDA0003606035450000271
Sending EncB(FsqrB) Feeding A;
step 13, A calculation
Figure BDA0003606035450000272
Figure BDA0003606035450000273
Sending EncB(L)+EncB(LnormA) Feeding B;
step 14.B decrypt EncB(L)+EncB(LnormA) Calculating Ltotal=L+LnormA+LnormBAnd judging whether the model is fit at present, and sending LtotalAnd fit the label (true/false) to A;
wherein L isnormAIs a regular term of the A terminal, LnormBAs a regularization term at the B-terminalOverfitting the constrained model.
Step 15, carrying out gradient optimization on the two parties locally respectively, and updating the model weight;
step 16, iterating the step 5 to the step 15 until the model effect meets the requirement;
step 17.B calculates PB=WBXBSending PBFeeding A;
step 18.a calculates Y ═ WAXA+PB
Figure BDA0003606035450000274
Figure BDA0003606035450000275
(if current Y is the feature of party B, then B calculates VIFAThen needs to be sent to a);
step 19, repeating the steps 5 to 8 until all the characteristic variables are analyzed, namely, all the characteristics of the A side are sequentially analyzed, and then all the characteristics of the B side are sequentially analyzed, wherein (f _ dim) is needed to be performed in totalA+f_dimB) And (5) secondary analysis.
Step 1 and step 2 are to ensure that the information transmitted by the two parties is not decrypted by the third party, and the a-side server may utilize the private key SK to send the message to the B-side serverAThe B-end server can utilize the received public key PK of the A-end server to perform encryptionADecrypting the message; similarly, the B-side server can utilize SK to send messages to the A-side serverBThe encryption is carried out, and the A-side server can utilize the received public key PK of the B-side serverBThe message is decrypted.
To sum up, the server at the a end and the server at the B end respectively have different data, the two parties can respectively perform data-related calculation locally, only encrypted intermediate parameters are transmitted between the two parties, and the receiver cannot solve the original data reversely, so that the multiple co-linearity degree corresponding to each independent variable is analyzed on the basis of ensuring the data safety of the two parties, and an effective basis is provided for subsequent variable screening and the like.
The embodiment of the invention also provides a third data terminal. Fig. 13 is a schematic structural diagram of a third data end according to an embodiment of the present invention, as shown in fig. 13, the third data end includes a third processing module 30 and a third receiving module 31;
the third processing module 30 is configured to select any one characteristic variable in the first characteristic variable set as a target variable, use other characteristic variables in the first characteristic variable set as first input variables, and use the second characteristic variable set as second input variables; a third receiving module 31, configured to receive a second input variable calculation value sent by the fourth data end, where the second input variable calculation value is obtained through calculation according to a second input variable and a second model parameter; the third processing module 30 is further configured to obtain a predicted value of the target variable according to the calculated value of the second input variable, the first input variable, and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; and determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient.
As an alternative embodiment, the multi-variable processing system further comprises a third sending module 32; the third processing module 30, the third receiving module 31 and the third sending module 32 are further configured to: iteratively executing the following steps until the regression model constructed by the first input variable, the second input variable and the target variable converges and the global loss function takes the minimum value: the third receiving module 31 receives an encrypted second model sent by a fourth data end, wherein the second model is constructed and obtained according to a second model parameter initial value and a second input variable; the third processing module 30 builds a first model according to the initial value of the first model parameter, the first input variable and the target variable; calculating and obtaining an encrypted first gradient value of the global loss function to the first model parameter according to the encrypted second model, the encrypted first model and the first input variable; adding the encrypted first gradient value and the first random number to obtain an encrypted first gradient correlation value, and sending the encrypted first gradient correlation value to a fourth data end through the third sending module 32 for decryption; the third receiving module 31 receives the decrypted first gradient correlation value sent by the fourth data end, and obtains a first gradient value according to the first gradient correlation value through the third processing module 30; and updating the first model parameter according to the first gradient value.
As an optional implementation manner, the fourth data end includes a fourth key pair, where the fourth key pair includes a fourth public key and a fourth private key; wherein the encrypted second model is obtained by the fourth public key encryption; the decrypted first gradient correlation value is obtained by decrypting the fourth private key.
As an optional implementation manner, the third processing module 30 is further configured to encrypt the first model, and send the encrypted first model to the fourth data end, where the encrypted first model is used by the fourth data end to calculate and obtain an encrypted second gradient value of the global loss function on the second model parameter, and add the encrypted second gradient value and the second random number to obtain an encrypted second gradient correlation value; the third receiving module 31 is further configured to receive the encrypted second gradient correlation value sent by the fourth data end, decrypt the encrypted second gradient correlation value, and send the decrypted second gradient correlation value to the fourth data end, where the decrypted second gradient correlation value is used by the fourth data end to obtain a second gradient value, and update the second model parameter according to the second gradient value.
As an optional implementation manner, the third processing module 30 is further configured to: generating a third key pair, the third key pair comprising a third public key and a third private key; encrypting the first model through the third public key; and decrypting the encrypted second gradient correlation value by the third private key.
As an optional implementation manner, the third processing module 30 is further configured to obtain a first local loss function according to a first model calculation, obtain an encrypted third local loss function according to the first model and an encrypted second model calculation, and receive, by the third receiving module 31, an encrypted second local loss function obtained by a fourth data terminal according to the second model calculation; the third processing module 30 obtains an encrypted global loss function value according to the first local loss function, the encrypted second local loss function, and the third local loss function, and sends the encrypted global loss function value to the fourth data end through the third sending module 32 for decryption; the third receiving module 31 receives the decrypted global loss function value and the regression model convergence identifier sent by the fourth data terminal.
As an optional implementation manner, the third processing module 30 is configured to iteratively execute the step of selecting any one of the feature variables in the first feature variable set as a target variable until a second correlation coefficient corresponding to each feature variable as a target variable is output.
As an alternative embodiment, the third processing module 30 is configured to select a feature variable of which the second correlation coefficient satisfies a second preset condition, and constitute a candidate data set.
The implementation principle and technical effect of the third data terminal provided in this embodiment are similar to those of the above embodiments, and are not described herein again.
The embodiment of the invention also provides a fourth data terminal. Fig. 14 is a schematic structural diagram of a fourth data end according to an embodiment of the present invention, and as shown in fig. 14, the fourth data end includes a fourth processing module 40 and a fourth sending module 41; the fourth processing module 40 is configured to obtain a second input variable calculation value according to a second input variable and a second model parameter; the fourth sending module 41 is configured to send the second input variable calculated value to the third data end, where the second input variable calculated value is used to determine a second correlation coefficient corresponding to the target variable.
As an optional embodiment, the fourth data end further includes a fourth receiving module 42; the fourth processing module 40 is further configured to iteratively execute the following steps until the regression model constructed by the first input variable, the second input variable, and the target variable converges and the global loss function takes a minimum value: a second model is constructed according to a second model parameter initial value and a second input variable, the second model is encrypted, the encrypted second model is sent to a third data end through a fourth sending module 41, the encrypted second model is used for the third data end to obtain an encrypted first gradient value of the global loss function on the first model parameter, and the encrypted first gradient value and a first random number are added to obtain an encrypted first gradient correlation value; the fourth receiving module 42 is configured to receive the encrypted first gradient correlation value sent by the third data end, decrypt the encrypted first gradient correlation value through the fourth processing module 40, and send the decrypted first gradient correlation value to the third data end through the fourth invention module 41, where the decrypted first gradient correlation value is used by the third data end to obtain a first gradient value, and update the first model parameter according to the first gradient value.
As an optional embodiment, the fourth receiving module 41 is further configured to receive an encrypted first model sent by a third data end, where the first model is constructed and obtained according to a first model parameter initial value, a first input variable, and a target variable; the fourth processing module 40 is further configured to calculate, according to the encrypted first model, the encrypted second model, and the second input variable, an encrypted second gradient value of the global loss function to the second model parameter; adding the encrypted second gradient value and a second random number to obtain an encrypted second gradient correlation value, and sending the encrypted second gradient correlation value to a third data terminal for decryption; the fourth receiving module 42 is configured to receive the decrypted second gradient correlation value sent by the third data end, and obtain a second gradient value according to the second gradient correlation value through the fourth processing module 40; and updating the second model parameter according to the second gradient value.
The implementation principle and technical effect of the fourth data terminal provided in this embodiment are similar to those of the above embodiments, and are not described herein again.
The embodiment of the invention also provides a multi-variable processing system. Referring to FIG. 1, the multivariable system comprises a third data terminal and a fourth data terminal; wherein the third data terminal and the fourth data terminal cooperate to perform the method of any one of the embodiments.
The multi-variable processing system provided in this embodiment has similar implementation principles and technical effects as those of the above embodiments, and will not be described herein again.
The invention further provides a variable screening method, which is applied to a variable screening system, wherein the variable screening system comprises a fifth data end and a sixth data end, the fifth data end stores a dependent variable and a first independent variable set, and the sixth data end stores a second independent variable set, which can be shown in fig. 1 (an a-end server in fig. 1 is equivalent to the fifth data end, and a B-end server in fig. 1 is equivalent to the sixth data end).
The variable screening method comprises the following steps of S501-S517:
step S501, the fifth data end obtains a first correlation coefficient of each independent variable and dependent variable in the first independent variable set.
Specifically, the a-side stores a dependent variable and a partial independent variable (i.e., a first independent variable set) corresponding to the sample data, and the B-side stores another partial independent variable (i.e., a second independent variable set) corresponding to the sample data. First, the a-terminal obtains the linear correlation degree (i.e. the first correlation coefficient, the variance expansion coefficient R in the above embodiment) between the independent variable and the dependent variable of the a-terminal2)。
Step S502, the fifth data end obtains the difference value of the dependent variable and the mean value of the dependent variable, and sends the difference value to the sixth data end.
Step S503, the sixth data end receives the difference, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the first independent variable set according to the difference.
Step S504, the sixth data end obtains a third parameter according to the regression coefficient and the independent variable mean value, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the first data end.
Step S505, the fifth data terminal receives the third parameter and the encrypted fourth parameter, obtains a constant item of the unary linear regression model according to the third parameter and the mean value of the dependent variable, obtains an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains a dispersion square sum according to a difference value between the mean value of the dependent variable and the mean value of the dependent variable; and obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a sixth data terminal.
Step S506, the sixth data end receives the encrypted first correlation coefficient, decrypts the encrypted first correlation coefficient, and sends the decrypted first correlation coefficient to the fifth data end.
Step S507, the fifth data end receives the decrypted first correlation coefficient, and outputs the first correlation coefficient; and iteratively executing the step of obtaining the difference value between the dependent variable and the dependent variable mean value until a first correlation coefficient corresponding to each independent variable in the second independent variable is output.
Step S508, selecting an independent variable in the first independent variable set whose first correlation coefficient satisfies a first preset condition to form a first candidate data set, and selecting an independent variable in the second independent variable set whose first correlation coefficient satisfies the first preset condition to form a second candidate data set.
Specifically, the implementation manners of steps S502 to S507 are similar to the implementation manner of the univariate processing method described in the first aspect, and are used for obtaining the variance expansion coefficient R of each independent variable at the B end and the dependent variable at the a end2Then selecting a coefficient of variance R from the first set of arguments2Arguments greater than a certain threshold (e.g., 0.3) constitute a first candidate data set, and the expansion of variance factor R is selected from a second set of arguments2Arguments greater than a certain threshold (e.g., 0.3) constitute a second candidate data set.
Step S509, selecting any independent variable in the first candidate data set as a target variable, taking other independent variables in the first candidate data set as first input variables, and taking the second candidate data set as second input variables.
And step S510, the sixth data end obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the fifth data end.
Step S511, the fifth data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; and determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient.
Step S512, iteratively executing the step of selecting any one of the independent variables in the first candidate data set as the target variable until a second correlation coefficient corresponding to each independent variable in the first candidate data set as the target variable is output.
Step S513, selecting any independent variable in the second candidate data set as a target variable, taking other independent variables in the second candidate data set as third input variables, and taking the first candidate data set as a fourth input variable.
And step S514, the fifth data end obtains a fourth input variable calculation value according to a fourth input variable and a fourth model parameter, and sends the fourth input variable calculation value to the sixth data end.
Step S515, the sixth data end obtains a target variable predicted value according to the fourth input variable calculation value, the third input variable and the third model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; and determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient.
Step S516, iteratively executing the step of selecting any independent variable in the second candidate data set as the target variable until a second correlation coefficient corresponding to each independent variable in the second candidate data set as the target variable is output.
Step S517, selecting an argument whose second correlation coefficient in the first candidate data set satisfies a second preset condition to form a third candidate data set, selecting an argument whose second correlation coefficient in the second candidate data set satisfies the second preset condition to form a fourth candidate data set, and forming a final candidate data set by using the third candidate data set and the fourth candidate data set.
Specifically, after obtaining the first candidate data set and the second candidate data set, the multivariate processing method as described in the above embodiment is then performed to obtain the degree of multicollinearity of each independent variable with other independent variables in the two candidate data sets, i.e., the coefficient of variance expansion VIF, and to reject the independent variable with a larger VIF value, thereby forming the final candidate data set.
To sum up, the variable screening method provided by the embodiment of the present invention firstly analyzes the variance expansion coefficient R of each independent variable and each dependent variable2Obtaining R2And (3) selecting a characteristic combination with a larger VIF value by analyzing the multiple collinearity degree VIF of each independent variable, thereby providing an effective basis for subsequent characteristic screening.
The embodiment of the invention also provides another variable screening method, which comprises the following steps:
step S601, selecting any one of the independent variables in the first independent variable set as a target variable, taking other independent variables in the first independent variable set as first input variables, and taking the second independent variable set as second input variables.
Step S602, the sixth data end obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the fifth data end.
Step S603, the fifth data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; and determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient.
Step S604, iteratively executing the step of selecting any one of the independent variables in the first independent variable set as a target variable until a second correlation coefficient corresponding to each independent variable in the first independent variable set as a target variable is output.
And step S605, selecting any independent variable in the second independent variable set as a target variable, using other independent variables in the second independent variable set as third input variables, and using the second independent variable set as fourth input variables.
Step S606, the fifth data end obtains a fourth input variable calculation value according to a fourth input variable and a fourth model parameter, and sends the fourth input variable calculation value to the sixth data end.
Step S607, the sixth data end obtains a target variable predicted value according to the fourth input variable calculation value, the third input variable and the third model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; and determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient.
Step S608, iteratively executing the step of selecting any one of the independent variables in the second independent variable set as the target variable until a second correlation coefficient corresponding to each independent variable in the second independent variable set as the target variable is output.
Step S609, selecting an argument in which the second correlation coefficient in the first argument set satisfies a second preset condition to form a fifth candidate data set, and selecting an argument in which the second correlation coefficient in the second argument set satisfies the second preset condition to form a sixth candidate data set.
Step S610, the fifth data end obtains a first correlation coefficient between each independent variable and dependent variable in the fifth candidate data set.
Step S611, the fifth data end obtains a difference between the dependent variable and the mean of the dependent variable, and sends the difference to the sixth data end.
Step S612, the sixth data end receives the difference, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the sixth candidate data set according to the difference.
Step S613, the sixth data end obtains a third parameter according to the regression coefficient and the mean of the independent variables, obtains a fourth parameter according to the regression coefficient and the independent variables, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the fifth data end.
Step S614, a fifth data end receives the third parameter and the encrypted fourth parameter, obtains a constant item of the unary linear regression model according to the third parameter and the mean value of the dependent variable, obtains an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains a dispersion square sum according to the difference value of the mean value of the dependent variable and the mean value of the dependent variable; and obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a sixth data terminal.
Step S615, the sixth data end receives the encrypted first correlation coefficient, decrypts the encrypted first correlation coefficient, and sends the decrypted first correlation coefficient to the fifth data end.
Step S616, the fifth data end receives the decrypted first correlation coefficient and outputs the first correlation coefficient;
step S617, iteratively executing the step of obtaining the difference between the dependent variable and the dependent variable mean until the first correlation coefficient corresponding to each independent variable in the sixth candidate data set is output.
Step S618, selecting an argument whose first correlation coefficient in the fifth candidate data set satisfies a first preset condition to form a seventh candidate data set, selecting an argument whose first correlation coefficient in the sixth candidate data set satisfies the first preset condition to form an eighth candidate data set, and forming a final candidate data set by the seventh candidate data set and the eighth candidate data set
In summary, the changes provided by the embodiments of the present inventionFirstly, selecting a characteristic combination with a larger VIF value by analyzing the multiple collinearity degree VIF of each independent variable; then, the variance expansion coefficient R of each independent variable and dependent variable in the feature combination with larger VIF value is analyzed2Obtaining R2And the larger independent variable provides an effective basis for subsequent feature screening.
As shown in fig. 1, the variable screening system includes a fifth data terminal and a sixth data terminal; and the fifth data end and the sixth data end are used for realizing the variable screening method.
As shown in fig. 15, an embodiment of the present invention provides an electronic device, which includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete mutual communication via the communication bus 114,
a memory 113 for storing a computer program;
in an embodiment of the present invention, the processor 111, when executing the program stored in the memory 113, is configured to implement the steps of the single variable processing method or the multi-variable processing method provided in any one of the foregoing method embodiments.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the univariate processing method or the multivariate processing method provided in any of the foregoing method embodiments.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (18)

1. The univariate processing method is characterized by being applied to a first data terminal in a univariate processing system, wherein the univariate processing system comprises the first data terminal and a second data terminal, the first data terminal stores a dependent variable, and the second data terminal stores an independent variable; the method comprises the following steps:
obtaining a difference value between the dependent variable and the mean value of the dependent variable, and sending the difference value to a second data terminal, wherein the difference value is used for calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable;
receiving a third parameter and an encrypted fourth parameter sent by a second data end, wherein the third parameter is obtained by calculation according to the regression coefficient and the independent variable mean value, and the fourth parameter is obtained by calculation according to the regression coefficient and the independent variable;
obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value;
obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean value;
obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data terminal for decryption;
and receiving the decrypted first correlation coefficient sent by the second data terminal, and outputting the first correlation coefficient.
2. The method according to claim 1, wherein the obtaining a difference value between the dependent variable and a mean value of the dependent variable, and sending the difference value to a second data terminal, wherein the difference value is used for calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable, comprises:
receiving an encrypted first parameter sent by a second data end, wherein the first parameter is obtained by calculation according to an independent variable and an independent variable mean value;
and obtaining an encrypted second parameter according to the difference value of the encrypted first parameter, the dependent variable and the mean value of the dependent variable, and sending the encrypted second parameter to a second data terminal for decryption, wherein the decrypted second parameter is used for calculating a regression coefficient of a unitary linear regression model constructed by the independent variable and the dependent variable.
3. The method of claim 2, wherein the second data peer comprises a second key pair, the second key pair comprising a second public key and a second private key;
the encrypted fourth parameter and the encrypted first parameter are obtained by encrypting the second public key;
and the decrypted first correlation coefficient and the decrypted second parameter are obtained by decrypting through the second private key.
4. The method according to any one of claims 1 to 3, wherein the arguments stored at the second data terminal are subjected to binning; before obtaining the difference value between the dependent variable and the dependent variable mean, the method further includes:
sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to a second data terminal;
receiving an encrypted sample tag statistic value sent by a second data end, wherein the encrypted sample tag statistic value is obtained by the second data end through statistics on the encrypted sample tag value of each box according to a sample identifier;
and decrypting the encrypted sample label statistic to obtain the dependent variable.
5. The method of claim 4, wherein before sending the sample identifier and the encrypted sample tag value corresponding to the sample identifier to the second data terminal, the method further comprises:
generating a first key pair, the first key pair comprising a first public key and a first private key;
encrypting a sample tag value through the first public key to obtain the encrypted sample tag value;
the decrypting the encrypted sample tag statistic includes:
and decrypting the encrypted sample label statistic value through the first private key.
6. The method according to any one of claims 1 to 3, wherein if the second data terminal includes a plurality of independent variables, the step of obtaining the difference between the dependent variable and the mean of the dependent variable is performed iteratively until a first correlation coefficient corresponding to each independent variable is output.
7. The method of claim 6, wherein after outputting the first correlation coefficient corresponding to each argument, further comprising:
and selecting the independent variable of which the first correlation coefficient meets a first preset condition to form a candidate data set.
8. The univariate processing method is characterized by being applied to a second data terminal in a univariate processing system, wherein the univariate processing system comprises a first data terminal and the second data terminal, the first data terminal stores a dependent variable, and the second data terminal stores an independent variable; the method comprises the following steps:
receiving a difference value of a dependent variable and a dependent variable mean value sent by a first data end, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the difference value;
obtaining a third parameter according to the regression coefficient and the independent variable mean value, obtaining a fourth parameter according to the regression coefficient and the independent variable, and encrypting the fourth parameter;
the third parameter and the encrypted fourth parameter are sent to a first data end, and the third parameter and the encrypted fourth parameter are used for calculating and obtaining an encrypted first correlation coefficient corresponding to an argument;
and receiving the encrypted first correlation coefficient sent by the first data end, decrypting the encrypted first correlation coefficient, sending the decrypted first correlation coefficient to the first data end, and outputting the first correlation coefficient.
9. The method according to claim 8, wherein the receiving a difference between a dependent variable sent by the first data terminal and a mean value of the dependent variable, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the difference comprises:
calculating according to the independent variable and the independent variable mean value to obtain a first parameter, encrypting the first parameter, and sending the encrypted first parameter to a first data terminal;
receiving an encrypted second parameter sent by a first data end, wherein the encrypted second parameter is obtained by calculation according to the encrypted first parameter, a difference value of a dependent variable and a mean value of the dependent variable;
and decrypting the encrypted second parameter, and calculating a regression coefficient of a unary linear regression model constructed by the independent variable and the dependent variable according to the decrypted second parameter.
10. The method according to claim 8 or 9, wherein before receiving the difference between the dependent variable and the mean of the dependent variable sent by the first data terminal, the method further comprises:
performing box separation processing on the independent variable;
receiving a sample identifier sent by a first data end and an encrypted sample label value corresponding to the sample identifier;
and counting the encrypted sample label value of each box according to the sample identification to obtain an encrypted sample label statistical value, and sending the encrypted sample label statistical value to a first data end for decryption to obtain the dependent variable.
11. A first data end is characterized by comprising a first processing module, a first sending module and a first receiving module:
the first processing module is used for obtaining a difference value between a dependent variable and a dependent variable mean value, and sending the difference value to a second data end through the first sending module, wherein the difference value is used for calculating a regression coefficient of a unary linear regression model constructed by an independent variable and the dependent variable;
the first receiving module is configured to receive a third parameter and an encrypted fourth parameter sent by a second data end, where the third parameter is obtained through calculation according to the regression coefficient and an independent variable mean value, and the fourth parameter is obtained through calculation according to the regression coefficient and an independent variable;
the first processing module is further used for obtaining a constant term of the unary linear regression model according to the third parameter and the dependent variable mean value; obtaining an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtaining a dispersion square sum according to a difference value of the dependent variable and the dependent variable mean; obtaining an encrypted first correlation coefficient corresponding to the independent variable according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a second data end through the first sending module for decryption;
the first receiving module is further configured to receive the decrypted correlation coefficient sent by the second data end, and output the correlation coefficient.
12. The second data terminal is characterized by comprising a second processing module, a second sending module and a second receiving module;
the second receiving module is used for receiving a difference value between the dependent variable and the mean value of the dependent variable sent by the first data terminal;
the second processing module is used for calculating a regression coefficient of a unary linear regression model constructed by independent variables and dependent variables according to the difference value; obtaining a third parameter according to the regression coefficient and the independent variable mean value, obtaining a fourth parameter according to the regression coefficient and the independent variable, and encrypting the fourth parameter;
the second sending module is configured to send the third parameter and the encrypted fourth parameter to the first data end, where the third parameter and the encrypted fourth parameter are used to calculate and obtain an encrypted correlation coefficient corresponding to the argument;
the second receiving module is further configured to receive the encrypted first correlation coefficient sent by the first data end, decrypt the encrypted first correlation coefficient through the second processing module, and send the decrypted first correlation coefficient to the first data end through the second sending module.
13. A univariate processing system is characterized by comprising a first data terminal and a second data terminal;
wherein the first data terminal is configured to perform the method according to any one of claims 1 to 7, and the second data terminal is configured to perform the method according to any one of claims 8 to 10.
14. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method of any one of claims 1 to 10 when executing a program stored on a memory.
15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-10.
16. The variable screening method is characterized by being applied to a variable screening system, wherein the variable screening system comprises a fifth data end and a sixth data end, the fifth data end stores a dependent variable and a first independent variable set, and the sixth data end stores a second independent variable set; the method comprises the following steps:
a fifth data terminal acquires a first correlation coefficient of each independent variable and dependent variable in the first independent variable set;
the fifth data end obtains the difference value of the dependent variable and the mean value of the dependent variable and sends the difference value to the sixth data end;
a sixth data end receives the difference value, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the first independent variable set according to the difference value;
the sixth data end obtains a third parameter according to the regression coefficient and the mean value of the independent variable, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the first data end;
a fifth data terminal receives the third parameter and the encrypted fourth parameter, obtains a constant item of the unary linear regression model according to the third parameter and the dependent variable mean value, obtains an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains a dispersion square sum according to a difference value between the dependent variable and the dependent variable mean value; obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a sixth data end;
the sixth data end receives the encrypted first correlation coefficient, decrypts the encrypted first correlation coefficient, and sends the decrypted first correlation coefficient to the fifth data end;
a fifth data end receives the decrypted first correlation coefficient and outputs the first correlation coefficient; iteratively executing the step of obtaining the difference value between the dependent variable and the dependent variable mean value until a first correlation coefficient corresponding to each independent variable in the second independent variable is output;
selecting independent variables of which the first correlation coefficients in the first independent variable set meet first preset conditions to form a first candidate data set, and selecting independent variables of which the first correlation coefficients in the second independent variable set meet the first preset conditions to form a second candidate data set;
selecting any independent variable in the first candidate data set as a target variable, taking other independent variables in the first candidate data set as first input variables, and taking the second candidate data set as second input variables;
the sixth data end obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the fifth data end;
the fifth data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient;
iteratively executing the step of selecting any independent variable in the first candidate data set as a target variable until a corresponding second correlation coefficient when each independent variable in the first candidate data set is output as the target variable;
selecting any independent variable in the second candidate data set as a target variable, taking other independent variables in the second candidate data set as third input variables, and taking the first candidate data set as fourth input variables;
the fifth data end obtains a fourth input variable calculation value according to a fourth input variable and a fourth model parameter, and sends the fourth input variable calculation value to the sixth data end;
the sixth data end obtains a target variable predicted value according to the fourth input variable calculation value, the third input variable and the third model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient;
iteratively executing the step of selecting any independent variable in the second candidate data set as a target variable until a corresponding second correlation coefficient when each independent variable in the second candidate data set is output as the target variable;
and selecting an independent variable of which the second correlation coefficient in the first candidate data set meets a second preset condition to form a third candidate data set, selecting an independent variable of which the second correlation coefficient in the second candidate data set meets the second preset condition to form a fourth candidate data set, and forming a final candidate data set by the third candidate data set and the fourth candidate data set.
17. The variable screening method is characterized by being applied to a variable screening system, wherein the variable screening system comprises a fifth data end and a sixth data end, the fifth data end stores a dependent variable and a first independent variable set, and the sixth data end stores a second independent variable set; the method comprises the following steps:
selecting any independent variable in the first independent variable set as a target variable, taking other independent variables in the first independent variable set as first input variables, and taking the second independent variable set as second input variables;
the sixth data end obtains a second input variable calculation value according to a second input variable and a second model parameter, and sends the second input variable calculation value to the fifth data end;
the fifth data end obtains a target variable predicted value according to the second input variable calculation value, the first input variable and the first model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient;
iteratively executing the step of selecting any independent variable in the first independent variable set as a target variable until a second correlation coefficient corresponding to each independent variable in the first independent variable set as the target variable is output;
selecting any independent variable in the second independent variable set as a target variable, taking other independent variables in the second independent variable set as third input variables, and taking the second independent variable set as fourth input variables;
the fifth data end obtains a fourth input variable calculation value according to a fourth input variable and a fourth model parameter, and sends the fourth input variable calculation value to the sixth data end;
the sixth data end obtains a target variable predicted value according to the fourth input variable calculation value, the third input variable and the third model parameter; obtaining a residual square sum according to the target variable and the target variable predicted value, and obtaining a dispersion square sum according to the target variable and the target variable mean value; determining a second correlation coefficient corresponding to the target variable according to the residual square sum and the dispersion square sum, and outputting the second correlation coefficient;
iteratively executing the step of selecting any independent variable in the second independent variable set as a target variable until a corresponding second correlation coefficient when each independent variable in the second independent variable set is output as the target variable;
selecting independent variables of which the second correlation coefficients in the first independent variable set meet second preset conditions to form a fifth candidate data set, and selecting independent variables of which the second correlation coefficients in the second independent variable set meet the second preset conditions to form a sixth candidate data set;
a fifth data terminal acquires a first correlation coefficient of each independent variable and dependent variable in the fifth candidate data set;
the fifth data end obtains the difference value of the dependent variable and the mean value of the dependent variable and sends the difference value to the sixth data end;
a sixth data end receives the difference value, and calculates a regression coefficient of a unary linear regression model constructed by any independent variable and dependent variable in the sixth candidate data set according to the difference value;
the sixth data end obtains a third parameter according to the regression coefficient and the mean value of the independent variable, obtains a fourth parameter according to the regression coefficient and the independent variable, encrypts the fourth parameter, and sends the third parameter and the encrypted fourth parameter to the fifth data end;
a fifth data terminal receives the third parameter and the encrypted fourth parameter, obtains a constant item of the unary linear regression model according to the third parameter and the dependent variable mean value, obtains an encrypted residual square sum according to the encrypted fourth parameter, the constant item and the dependent variable, and obtains a dispersion square sum according to a difference value between the dependent variable and the dependent variable mean value; obtaining an encrypted first correlation coefficient corresponding to the argument according to the encrypted residual square sum and the encrypted dispersion square sum, and sending the encrypted first correlation coefficient to a sixth data end;
the sixth data end receives the encrypted first correlation coefficient, decrypts the encrypted first correlation coefficient, and sends the decrypted first correlation coefficient to the fifth data end;
a fifth data end receives the decrypted first correlation coefficient and outputs the first correlation coefficient;
iteratively executing the step of obtaining the difference value between the dependent variable and the dependent variable mean value until a first correlation coefficient corresponding to each independent variable in a sixth candidate data set is output;
and selecting an independent variable of which the first correlation coefficient in the fifth candidate data set meets a first preset condition to form a seventh candidate data set, selecting an independent variable of which the first correlation coefficient in the sixth candidate data set meets the first preset condition to form an eighth candidate data set, and forming a final candidate data set by the seventh candidate data set and the eighth candidate data set.
18.A variable screening system is characterized by comprising a fifth data end and a sixth data end; wherein the fifth data terminal and the sixth data terminal are adapted to perform the method according to claim 16 or 17.
CN202210418824.5A 2022-04-20 2022-04-20 Single variable processing method and variable screening method Pending CN114692089A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210418824.5A CN114692089A (en) 2022-04-20 2022-04-20 Single variable processing method and variable screening method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210418824.5A CN114692089A (en) 2022-04-20 2022-04-20 Single variable processing method and variable screening method

Publications (1)

Publication Number Publication Date
CN114692089A true CN114692089A (en) 2022-07-01

Family

ID=82145711

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210418824.5A Pending CN114692089A (en) 2022-04-20 2022-04-20 Single variable processing method and variable screening method

Country Status (1)

Country Link
CN (1) CN114692089A (en)

Similar Documents

Publication Publication Date Title
EP3965023A1 (en) Method and device for constructing decision trees
WO2021179720A1 (en) Federated-learning-based user data classification method and apparatus, and device and medium
US20210409191A1 (en) Secure Machine Learning Analytics Using Homomorphic Encryption
CN111723404B (en) Method and device for jointly training business model
WO2015155896A1 (en) Support vector machine learning system and support vector machine learning method
CN111563267B (en) Method and apparatus for federal feature engineering data processing
CN113542228B (en) Data transmission method and device based on federal learning and readable storage medium
Ke et al. Steganography security: Principle and practice
CN114401079A (en) Multi-party joint information value calculation method, related equipment and storage medium
WO2021106077A1 (en) Update method for neural network, terminal device, calculation device, and program
CN112101531B (en) Neural network model training method, device and system based on privacy protection
CN115049070A (en) Screening method and device of federal characteristic engineering data, equipment and storage medium
CN113609781A (en) Automobile production mold optimization method, system, equipment and medium based on federal learning
CN111522973A (en) Privacy protection image retrieval method fusing compressed sensing
CN114881247A (en) Longitudinal federal feature derivation method, device and medium based on privacy computation
CN114417388B (en) Power load prediction method, system, equipment and medium based on longitudinal federal learning
CN115333775A (en) Data processing method and device based on privacy calculation, equipment and storage medium
CN111177740A (en) Data confusion processing method, system and computer readable medium
CN113935050A (en) Feature extraction method and device based on federal learning, electronic device and medium
CN113033824B (en) Model hyper-parameter determination method, model training method and system
CN113792892A (en) Federal learning modeling optimization method, apparatus, readable storage medium, and program product
CN111523674A (en) Model training method, device and system
CN114692089A (en) Single variable processing method and variable screening method
US20210266383A1 (en) Conversion system, method and program
CN112948883A (en) Multi-party combined modeling method, device and system for protecting private data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination