WO2022247620A1 - 保护隐私的确定业务数据特征有效值的方法及装置 - Google Patents
保护隐私的确定业务数据特征有效值的方法及装置 Download PDFInfo
- Publication number
- WO2022247620A1 WO2022247620A1 PCT/CN2022/091637 CN2022091637W WO2022247620A1 WO 2022247620 A1 WO2022247620 A1 WO 2022247620A1 CN 2022091637 W CN2022091637 W CN 2022091637W WO 2022247620 A1 WO2022247620 A1 WO 2022247620A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- slices
- matrix
- multiple participants
- participant
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 239000012634 fragment Substances 0.000 claims abstract description 50
- 238000010998 test method Methods 0.000 claims abstract description 19
- 230000000694 effects Effects 0.000 claims abstract description 18
- 239000011159 matrix material Substances 0.000 claims description 266
- 238000013467 fragmentation Methods 0.000 claims description 80
- 238000006062 fragmentation reaction Methods 0.000 claims description 80
- 230000003993 interaction Effects 0.000 claims description 48
- 239000013598 vector Substances 0.000 claims description 48
- 238000012360 testing method Methods 0.000 claims description 35
- 230000014509 gene expression Effects 0.000 claims description 33
- 238000004364 calculation method Methods 0.000 claims description 28
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 238000007477 logistic regression Methods 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 14
- 230000006399 behavior Effects 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 description 14
- 238000001772 Wald test Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 11
- 238000009826 distribution Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 230000001419 dependent effect Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000010224 classification analysis Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 238000000611 regression analysis Methods 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/067—Enterprise or organisation modelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- the embodiment provides a privacy-protecting method for determining the effective value of a service data feature.
- the service data is distributed among multiple participants, and the respective service data of the multiple participants constitute joint data under the assumption of concatenation.
- the joint data includes feature values of multiple objects for multiple feature items; the method is executed by any first participant device, including: obtaining the joint data slice of the first participant, and obtaining the corresponding Predicted value slices, and model parameter slices corresponding to multiple feature items; the predicted value slices and the model parameter slices are obtained based on the trained business forecast model; using multi-party secure computing, through multiple participants
- the interaction between devices, based on the joint data fragmentation and prediction value fragmentation of multiple participants determines the correlation data fragmentation corresponding to multiple participants, including the correlation data between multiple feature items;
- the significance test method through the secure interaction between multiple participant devices, based on the model parameter fragmentation of multiple participants and the corresponding data in the correlation data fragmentation, it is determined that the feature items corresponding to the model parameters are in the promotion Effective value on the effect of the
- the step of obtaining the joint data fragmentation of the first participant includes: using secret sharing and addition, through interaction with other participant devices, splitting and summarizing based on the business data of multiple participants
- the splicing operation enables multiple participants to obtain joint data fragments respectively; the joint data fragments of multiple participants obtain the joint data under the assumption of reconstruction.
- the step of obtaining the predicted value slices corresponding to multiple objects and the model parameter slices corresponding to multiple feature items respectively includes: acquiring the trained service prediction model in the first The local model parameter fragmentation of the participant's device; through the interaction of multiple participant devices, based on the joint data fragmentation of multiple participants and the trained business prediction model, multiple participants determine the prediction value fragmentation of the object respectively .
- the correlation data includes covariance matrix data
- the correlation data slices include covariance matrix slices
- the step of determining the respective correlation data slices corresponding to a plurality of participants Including: based on the joint data sharding and prediction value sharding of multiple participants, and the functional relationship in the business forecasting model, determining the intermediate matrix shards corresponding to the multiple participants; Matrix slicing, calculate the slicing of the intermediate matrix inverse corresponding to multiple participants, and obtain the covariance matrix slicing corresponding to multiple participants.
- the step of determining the Hessian matrix slices corresponding to multiple participants respectively includes: using secret sharing multiplication, based on the expression of the predictive value matrix, to slice the predicted values of multiple participants Carry out corresponding multiplication of vector elements, so that a plurality of participants obtain intermediate vector slices respectively; use the elements in the intermediate vector slices of the first participant as diagonal elements, and construct the diagonalized first Predicted value matrix slices of participants; based on joint data slices of multiple participants, predictive value matrix slices and the Hessian matrix expression, Hessian matrix slices corresponding to multiple participants are determined.
- the step of determining the Hessian matrix slices corresponding to multiple participants based on the joint data slices of multiple participants, the predictive value matrix slices and the Hessian matrix expression Including: when calculating the safe multiplication operation of the joint data slice and the predictive value matrix slice of multiple parties, the column vectors in the joint data slice are respectively safely multiplied by the corresponding diagonal elements in the predictive value matrix slice operate.
- the step of calculating the fragmentation of the inverse of the intermediate matrix corresponding to the plurality of participants based on the intermediate matrix fragmentation of the plurality of participants, and obtaining the fragmentation of the covariance matrix corresponding to the plurality of participants including: using the secret sharing matrix inverse algorithm SMI, based on the intermediate matrix slices of multiple participants, and through iterative calculations, the covariance matrix slices corresponding to the multiple participants are respectively obtained.
- the step of determining the effective value of the characteristic item corresponding to the model parameter in improving the effect of the business prediction model includes: dividing the diagonal elements in the covariance matrix slices of multiple participants, As a variance slice corresponding to multiple model parameters; for any model parameter, use the secret sharing root sign inverse algorithm SNSI and the significance test method, based on the corresponding model parameter slice of the first participant and multiple participants
- the corresponding variance fragmentation of the first participant through the interaction between multiple participant devices, jointly perform the inverse operation of the security root number, and determine the significance test value fragmentation of the first participant for the model parameters; based on multiple participants
- the policy slices the significance test value of the model parameter, and determines the effective value of the feature item corresponding to the model parameter.
- the method further includes: for any first feature item, obtaining the effective value slice of the first feature item from other participant devices; based on the local effective value of the first feature item Fragmentation and the obtained effective value fragmentation determine the reconstructed effective value of the first feature item.
- the object includes one of users, commodities, and events; the feature item includes at least one of the following: basic attribute information, association relationship information, interaction information, and historical behavior information; the business forecast Models are used to make business predictions about objects.
- the service prediction model is obtained based on a logistic regression model.
- the embodiment provides a privacy-protecting device for determining the effective value of service data features.
- the service data is distributed among multiple participants, and the respective service data of the multiple participants constitute joint data under the assumption of concatenation.
- the joint data includes feature values of multiple objects for multiple feature items;
- the device is deployed in any first participant's device, and includes: an acquisition module configured to acquire joint data fragments of the first participant, Obtain the predicted value slices corresponding to multiple objects, and the model parameter slices corresponding to multiple feature items respectively; the predicted value slices and the model parameter slices are obtained based on the trained business prediction model; the interaction module,
- the configuration is to use multi-party secure computing, through the interaction between multiple participant devices, based on the joint data shards and predicted value shards of multiple participants, to determine the correlation data shards corresponding to multiple participants, where Including the correlation data between multiple feature items;
- the verification module is configured to adopt the significance test method, through the security interaction between multiple participant devices, based on the model parameter fragmentation of multiple participants and the
- the acquisition module when acquiring the joint data slice of the first participant, includes: using secret sharing and addition, and interacting with other participant devices based on the business data of multiple participants Splitting and splicing operations enable multiple participants to obtain joint data fragments respectively; the joint data fragments of multiple participants obtain the joint data under the assumption of reconstruction.
- the service prediction model is obtained through security joint training based on joint data slices of multiple participants; the service prediction model is used to perform service prediction on objects.
- the acquiring module when acquiring the predicted value slices corresponding to multiple objects and the model parameter slices corresponding to multiple feature items, includes: acquiring the trained business forecasting model in the Describe the local model parameter fragmentation of the first participant device; through the interaction of multiple participant devices, based on the joint data fragmentation of multiple participants and the trained service prediction model, multiple participants can determine the prediction of the object Value sharding.
- the correlation data includes covariance matrix data
- the correlation data slices include covariance matrix slices
- the interaction module includes: a determination submodule configured to, based on multiple participating The joint data fragmentation and prediction value fragmentation of the party, as well as the functional relationship in the business forecasting model, determine the intermediate matrix fragmentation corresponding to the multiple participants;
- the calculation sub-module is configured to, based on the multiple participants Slice the intermediate matrix, calculate the slices of the inverse of the intermediate matrix corresponding to multiple participants, and obtain the slices of the covariance matrix corresponding to the multiple participants.
- the embodiment provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed in a computer, the computer is instructed to execute the method described in any one of the first aspect.
- the embodiment provides a computing device, including a memory and a processor, wherein executable code is stored in the memory, and when the processor executes the executable code, any one of the steps in the first aspect is implemented. described method.
- multiple participants can obtain correlation data slices, and then use model parameter slices and correlation data slices to determine the effect value of feature items on improving the model effect.
- Multiple participants use various types of data fragmentation for multi-party security calculations, and the obtained data is also fragmented.
- private data such as correlation data between feature items will not be reconstructed, which improves the process of processing. Data privacy and security.
- FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification
- FIG. 2 is a schematic flowchart of a method for determining an effective value of a service data feature to protect privacy provided by this embodiment
- Fig. 4 is a schematic block diagram of an apparatus for determining an effective value of a feature of service data to protect privacy provided by an embodiment.
- Fig. 1 is a schematic diagram of an implementation scene of an embodiment disclosed in this specification.
- the data set is jointly provided by multiple participants 1, 2, ..., W (W is a natural number), and each participant owns a part of the data in the data set, which constitutes the participant’s Business data (ie the original matrix).
- the data set can be a training data set for training the model, a test data set for testing the model, or a data set to be predicted.
- the data set may include characteristic data of an object, and the object may be one of various business objects to be analyzed such as users, commodities, and events.
- the above-mentioned model may include a service prediction model trained by machine learning.
- each participant has different characteristic data for all objects.
- each participant has the same sample of N objects, and the private data of each sample contains D features, which are distributed among W participants, and each participant has D/W features.
- D features are distributed among W participants, and each participant has D/W features.
- the two platforms have the same batch of users, but the user characteristics in their business data are different.
- the types of features owned by each participant are different, and the number of features owned may be the same (for example, they each have D/W features), or they may be different.
- N, D and W are all natural numbers. This is a scenario where the data in the dataset is vertically split. Table 1 shows the distribution of business data in the scenario of vertical data splitting.
- xx represents a specific characteristic value, which belongs to the private data of the participant.
- Each row in Table 1 represents a piece of sample data, each column represents the characteristic value of a feature item of N objects, and D feature items belong to W participants.
- the feature values of the D feature items of the N objects constitute all the business data.
- each participant has all the characteristic data of different objects. For example, there are a total of N object samples, and the business data of each sample contains D feature items. These N pieces of business data are distributed among W participants, and each participant owns a part of all N samples. The samples contain the same feature items. The number of object samples stored by different parties can be the same or different. For another example, there are two banks that serve different user groups, but they both have the same user credit characteristics. This is the scenario of horizontal data segmentation in the data set, and Table 2 shows the business data distribution of the horizontal data segmentation scenario.
- xx represents a specific characteristic value, which belongs to the private data of the participant.
- Each row in Table 2 represents a piece of sample data, and each column represents the feature value of a feature item of N objects, and N pieces of sample data belong to W participants. Different parties have different object samples.
- the feature values of the D feature items of the N objects constitute all the business data.
- the business data owned by the participants may include multiple feature items.
- the feature item of the object may include at least one of the following: basic attribute information of the object, association relationship information, interaction information, historical behavior information, and the like.
- basic attribute information may include the user's gender, age, income, etc.
- the user's association relationship information may include other users, companies, regions, etc. that are associated with the user
- the user's interaction information may include Information such as clicks, views, and participation in certain activities performed by users on a certain website
- historical behavior information of users may include historical transaction behaviors, payment behaviors, and purchase behaviors of users.
- its basic attribute information can include the category, place of origin, price, etc. of the commodity, and the relationship information of the commodity can include users, shops, or other commodities that are associated with the commodity, and the interaction information of the commodity can include user 1.
- the historical behavior information of the product can include information such as the purchase, transfer, and return of the product.
- the event may include a transaction event, a login event, a purchase event, a social event, and the like.
- the basic attribute information of an event can be text information used to describe the event, and the association relationship information can include text that has a contextual relationship with the event, other event information related to the event, etc., and historical behavior information can include the event. Record information that develops and changes in the time dimension, etc.
- Each participant may correspond to a different service platform, and the service platform may include various enterprises, institutions, organizations, and the like.
- Business data is often the private data of the service platform, and requires high privacy and security during processing.
- the feature values (ie, feature data) corresponding to the feature items of the object belong to private data and can be stored as a private data matrix.
- each participant needs to keep its private data locally, and does not output plaintext data or perform plaintext aggregation.
- each participant can adopt a multi-party secure calculation method, use its own predicted value and original matrix, and interact with the third party to enable the third party to obtain Covariance matrix data representing correlation data among multiple feature items.
- the third party uses the covariance matrix data and model parameters to determine the effective value of the feature items corresponding to the model parameters in improving the effect of the business prediction model by using the significance test method.
- each participant stores its own data slices, including their joint data slices, predicted value slices corresponding to multiple objects, and model parameter slices corresponding to multiple features Etc.
- Multiple participant devices interact based on multi-party security computing, and use joint data sharding and predicted value sharding to determine the correlation data shards corresponding to multiple participants.
- the correlation data shards include For the correlation data between multiple feature items, each participant adopts the significance test method, based on the corresponding data in the model parameter sharding and correlation data sharding of multiple participants, to determine the role of feature items in improving the business prediction model A valid value in effect.
- Multiple participants use various types of data shards to perform multi-party security calculations.
- the correlation data obtained are also shards, and private data such as correlation data between feature items will not be reconstructed. Therefore, data security during processing can be improved. Privacy and Security.
- participant equipment includes but is not limited to any device, equipment, platform, equipment cluster, etc. with computing and processing capabilities. Embodiments of the present invention will be described below in combination with specific embodiments.
- FIG. 2 is a schematic flowchart of a privacy-protecting method for determining an effective value of a feature of service data provided by this embodiment.
- Business data is distributed among multiple participants, and the respective business data of multiple participants constitute joint data under the assumption of concatenation.
- the business data of the participants is private data with high privacy, and the business data will not be sent in clear text between multiple participants, and the business data will not be truly spliced to form joint data.
- Federated data is simply a hypothetical dataset of business data from multiple parties.
- Table 1 and Table 2 are the specific forms of joint data in the scenarios of vertical data segmentation and horizontal data segmentation respectively.
- the joint data includes feature values of multiple objects for multiple feature items, for example, may include feature values of N objects for D feature items, where N and D are both natural numbers.
- the following examples mostly use two participants as examples for illustration.
- the two participants are a first participant A and a second participant B respectively, the first participant A corresponds to the first participant device, and the second participant B corresponds to the second participant device.
- the participant's device is used to perform the operations of the participant and store the data of the participant.
- the participant device may also obtain the participant's data from other devices.
- the method in this embodiment specifically includes the following steps S210-S230.
- step S210 the first participant device acquires joint data slices of the first participant A, and acquires slices of predicted values corresponding to multiple objects, and slices of model parameters corresponding to multiple feature items.
- the second participant device obtains the joint data slice of the second participant B, obtains the predicted value slices corresponding to multiple objects, and the model parameter slices corresponding to multiple feature items respectively.
- the feature items among multiple participants are different, but the objects are the same.
- Multiple participants can respectively represent their original data in original matrices.
- X (X A , X B )
- the columns in the original matrix represent feature items and the rows represent samples, corresponding to the data distribution in Table 1.
- the columns in the original matrix can represent objects, and the rows represent feature items.
- the business data of multiple parties such as the first party A and the second party B are hypothetically spliced vertically
- the joint data can be obtained in the form of
- the feature items among multiple participants are the same, but the objects are different.
- the original matrices of the first participant A and the second participant B are respectively X A and X B
- the numbers of objects are n A and n B respectively
- the joint data can be obtained by hypothetical horizontal splicing of the business data of multiple parties such as the first party A and the second party B, in the form of
- multiple participants can use secret sharing and addition to split the participant's business data into random numbers, which are completed by passing random numbers among multiple participants Fragmentation.
- the first participant device obtains the joint data fragmentation of the first participant A, it can use secret sharing and addition to split and splice based on the business data of multiple participants through interaction with other participant devices Operation, so that multiple participants can obtain joint data shards respectively.
- the second participant B also obtains its joint data fragments.
- the secret sharing addition can split the original matrix into random matrices, and complete the fragmentation by passing the random matrices among multiple participants.
- the first participant A and the second participant B respectively own the original matrices X A and X B of business data.
- the first participant device can splice RA and the received X 3 sent by the second participant device into a joint data segment, and the second participant device can combine RB and the received X sent by the first participant device 2 Spliced into joint data fragments.
- the data sent between multiple participants is a random matrix, and the private data of the original matrix is not revealed.
- the joint data fragments of multiple participants obtain joint data under the assumption of reconstruction.
- the reconstruction can be realized based on the addition of the data slices of all parties.
- the specific reconstruction can be based on adding other matrix transformation operations.
- the matrix transformation includes, for example, multiplication by preset values.
- the joint data contains private data, and each participant does not directly aggregate the private data in plaintext.
- the joint data is only a representation of a hypothetical situation, and the data fragments of the parties will not be directly reconstructed together in practice. The following meanings about refactoring apply to the description here.
- the joint data fragment of the first participant A can be represented by ⁇ X> A
- the joint data fragment of the first participant B can be represented by ⁇ X> B
- the joint data X ⁇ X> A + ⁇ X> B
- ⁇ X> represents the fragmentation of the parameter X
- its subscript represents the participant to which the fragment belongs.
- angle brackets + subscript is used in the following to indicate the fragmentation of data in a certain party.
- the joint data sharding of a participant is obtained based on the business data of multiple participants, and the sum of the joint data shards of multiple participants is conceptually or theoretically equal to the joint data.
- the predicted value slice and the model parameter slice are based on the data obtained from the trained service forecast model.
- the business prediction model is a model obtained through security joint training based on joint data shards of multiple participants.
- the business forecasting model can be pre-trained.
- the business prediction model may be a model trained based on a logistic regression model, or may be trained based on other types of models.
- the business prediction model is used to perform business prediction on the object, for example, classification prediction or regression prediction can be performed on the characteristic data of the input object.
- Multiple participant devices can obtain predicted value slices and model parameter slices through the trained business forecasting model.
- the first participant device can obtain the model parameter fragmentation of the trained service prediction model locally on the first participant device, and through secure interaction between multiple participant devices, based on the joint data analysis of multiple participants
- the slice and the trained business forecasting model enable multiple participants to determine the slice of the predicted value of the object.
- Multiple participant devices use the N objects in the joint data shard as samples to train the service prediction model. After training, the model parameter fragmentation of the service prediction model in the participant's device can be obtained. Through the secure interaction between multiple participant devices, the joint data fragments of multiple participants are input into the service prediction model, and each participant device can determine the prediction value fragments of multiple objects of this participant.
- the trained service prediction model includes a plurality of model parameters, which respectively correspond to the D feature items.
- the corresponding predictive value shards owned by multiple participants obtain the predictive value data under the assumption of reconstruction.
- the corresponding model parameter slices owned by multiple parties obtain the model parameter under the assumption of reconstruction.
- Step S220 using multi-party security computing, through the interaction between multiple participant devices, based on the joint data fragments and predicted value fragments of multiple participants, determine the correlation data fragments corresponding to multiple participants, where Includes correlation data between multiple feature items.
- correlation data slices of multiple participants are assumed to be reconstructed to obtain correlation data, that is, correlation data between feature items, which include the correlation between feature items owned by the same participant
- the data also includes the correlation data between feature items owned by different parties, including the correlation data between different feature items and the correlation data between the same feature items.
- the joint data fragmentation and prediction value fragmentation can be used, and the correlation data corresponding to multiple participants can be determined through multi-party security calculation.
- Formulas that can represent correlation data between feature items can include covariance matrix formulas, correlation coefficient formulas, and so on.
- Multi-party secure computing is an existing data privacy protection technology that can be used for multi-party participation. Its specific implementation methods include homomorphic encryption, obfuscated circuits, inadvertent transmission, secret sharing and other technologies.
- the method of multi-party security calculation can realize the secure interactive calculation of joint data shards and predicted value shards among multiple participant devices, so that multiple participants can determine the corresponding correlation data shards.
- Step S230 using the significance test method, through the secure interaction between multiple participant devices, based on the corresponding data in the model parameter slices and correlation data slices of multiple participants, determine that the feature items corresponding to the model parameters are in Improve the effective value of the business forecasting model.
- the significance test method may include Wald test, likelihood ratio (LR) test, Lagrangian multiplier (LM) test and the like.
- feature items correspond to model parameters
- data corresponding to feature items exist in both the model parameter slice and the correlation data slice.
- the significance test method can be used to determine the significance test value slices corresponding to multiple model parameters, that is, the significance test of the corresponding multiple feature items Value fragmentation, and the above effective value fragmentation can be determined based on the significance test value fragmentation.
- the first participant device can obtain the effective value slice of the first feature item from other participant devices, based on the first feature item in The local effective value fragment and the obtained effective value fragment determine the reconstructed effective value of the first characteristic item.
- the effective value of the characteristic item may also be reconstructed in the second participant device or other participant devices, and this embodiment only takes the reconstruction of the effective value in the first participant device as an example for illustration.
- the first participant device may also remove feature items whose effective values do not meet the preset conditions from the multiple feature items based on the multiple effective values, so that multiple participants adopt the removal
- the business data after the feature item is used for safe joint training of the business prediction model.
- the business data after removing the characteristic items realizes the dimension reduction processing of the original matrix, which makes the characteristic items more refined, and at the same time ensures that the privacy data is not leaked.
- step S220 determine the manner of correlation data fragmentation
- step S230 determine the specific implementation manner of the effect value of the feature item.
- the calculation formula of the predicted value includes:
- X is the characteristic data of the sample, which can be used as an independent variable
- ⁇ (X) is the predicted value function of the sample, which can be used as a dependent variable
- ⁇ is the model parameter, which is the coefficient of the feature item
- e is a natural constant.
- Waldk is the significance test value, which conforms to the chi-square distribution with 1 degree of freedom. in, is the model parameter
- the standard deviation of is also equal to the square root of the diagonal elements of the covariance matrix:
- the diagonal elements of the covariance matrix are the variances of the feature items.
- Covariance Matrix of Model Parameters The negative Hessian matrix of the log-likelihood function is in value at
- Xi i represents the characteristic data of the i-th sample.
- N is the total number of samples, that is, the total number of objects
- D is the dimension of feature data
- ⁇ (X N ) is the predicted value of the logistic regression model for sample X N
- M is a diagonal matrix obtained based on the predicted value, Also known as the predictor matrix.
- the null hypothesis is rejected, the model parameter can be retained for modeling, and the effective value of the feature item corresponding to the model parameter can be taken as 1 or other higher values; when the p value is not less than the significance If the horizontal threshold is set, the null hypothesis is accepted, the model parameter is not retained, and the effective value of the feature item corresponding to the model parameter can be taken as 0 or other lower values.
- the significance level threshold can usually be 0.05 or 0.01, etc.
- Logistic regression analysis is a statistical method for analyzing independent variables and dependent variables and clarifying the relationship between them.
- the established regression equation is meaningful only when there is indeed some relationship between the independent variable and the dependent variable. Therefore, whether the factor as an independent variable is related to the predicted object as a dependent variable, what is the degree of correlation, and how sure is it to judge the degree of correlation are the problems to be solved by regression analysis.
- Logistic regression analysis can use the Wald test to test the value of the coefficient of the regression item one by one. If the Wald test shows that these independent variables are important for specific independent variables, they should be included in the model. These independent variables can be omitted from the model if the Wald test indicates that they are not significant.
- the model parameters of the business prediction model can be evaluated by using logistic regression analysis and Wald test, and then based on the evaluation results, the feature items of the object samples can be screened to achieve the purpose of dimensionality reduction processing of business data.
- the correlation data includes covariance matrix data
- the correlation data slices include covariance matrix slices.
- Slices of the covariance matrix of multiple parties can form a covariance matrix assuming reconstruction.
- the covariance matrix is a matrix composed of the covariance between two feature items among multiple feature items in the joint data.
- the elements on the main diagonal are the variances of multiple feature items, and the elements on the off-diagonal are two Covariance between two feature items.
- the covariance matrix is a symmetric matrix. When there are D feature items in the joint data, the covariance matrix can be a D*D dimensional symmetric matrix.
- the participant devices of multiple participants can perform the following steps 1 and 1. 2.
- Step 1 Based on the joint data slices and predicted value slices of multiple participants, as well as the functional relationship in the business forecast model, determine the intermediate matrix slices corresponding to the multiple participants. For example, the first participant A obtains the intermediate matrix slice ⁇ H> A , the second participant B obtains the intermediate matrix slice ⁇ H> B , and multiple intermediate matrix slices obtain the intermediate matrix H under the assumption of reconstruction. Multiple participants will not actually perform the reconstruction of the intermediate matrix slices, but here only represent the relationship between multiple intermediate matrix slices.
- Step 2 based on the intermediate matrix slices of multiple participants, calculate the slices of the intermediate matrix inverses corresponding to the multiple participants, and obtain the covariance matrix slices corresponding to the multiple participants.
- the first participant A obtains the slice ⁇ H -1 > A of the intermediate matrix inverse
- the second participant B obtains the slice ⁇ H -1 > B of the intermediate matrix inverse
- the slices of multiple intermediate matrix inverses are assumed
- the inverse H -1 of the intermediate matrix is obtained.
- Multiple participants will not actually reconstruct the slices of the intermediate matrix inverse, but here only represent the relationship between multiple slices of the intermediate matrix inverse.
- step 1 when determining the intermediate matrix slices corresponding to multiple participants, it can be based on the joint data slices and predicted value slices of multiple participants, as well as the Hessian obtained based on the functional relationship in the business forecasting model
- a matrix expression is used to determine Hessian matrix slices corresponding to multiple participants as intermediate matrix slices; the Hessian matrix expression includes a joint data matrix and a predictive value matrix.
- the functional relational expression of the business forecasting model that is, the functional relational expression of the model prediction value is shown in the above formula (1).
- the corresponding model parameters are obtained, for example beta.
- the Hessian matrix expression is actually the second derivative with respect to the model parameter ⁇ . From the above formulas (1) to (5), it can be seen that the expression of the Hessian matrix obtained based on the functional relationship in the business forecasting model is
- Equation (9) Through secure interaction between devices of multiple participants, based on the joint data shards ⁇ X> owned by multiple participants, and the matrix M shards obtained based on multiple predicted value ⁇ (X N ) shards, using the above Equation (9) can enable multiple participants to determine the slices of H respectively, and the slices of H are used as intermediate matrix slices. Among them, M can be called the predictor matrix.
- the form of the matrix M can be transformed to simplify the process of determining the H slice by multiple participants.
- the first participant device uses the joint data slice ⁇ X> A , multiple predicted value slices, and the above formula (9) to determine the Hessian matrix slices ⁇ H> corresponding to multiple participants, they can Perform the following steps 1a to 3a.
- Step 1a using secret sharing multiplication, and based on the expression of the predictive value matrix, perform corresponding multiplication of vector elements on the predictive value slices of multiple participants, so that multiple participants can obtain intermediate vector slices respectively.
- the first participant A and the second participant B can use the secret sharing multiplication to perform corresponding multiplication of the vector elements on the predicted value slices to obtain the intermediate value of the first participant A Vector slice, the intermediate vector slice of the second party B.
- Sharding of intermediate vectors by multiple parties results in intermediate vectors when assumed to be reconstructed. Multiple participants do not actually reconstruct the intermediate vector, but here only represent the relationship between multiple intermediate vector slices.
- Step 2a Using the elements in the intermediate vector slice of the first participant A as diagonal elements, construct a diagonalized predictive value matrix slice of the first participant A.
- the second participant device also uses the elements in the intermediate vector slice of the second participant B as diagonal elements to construct a diagonalized predictive value matrix slice of the second participant B.
- Step 3a based on the joint data slice ⁇ X>, predictive value matrix slice and Hessian matrix expression of multiple participants, determine Hessian matrix slices corresponding to multiple participants.
- Hessian matrix slices ⁇ H> A and ⁇ H> B can be determined between the first participant A and the second participant B through secret sharing matrix multiplication, for example.
- step 1a the expression for the predictor matrix M includes
- the predicted value slices owned by multiple participants can be used, for example, the predicted value slice ⁇ > A of the first participant A, and the predicted value slice ⁇ > B of the second participant B, to obtain the above formula Another expression of (10)
- Multiple participants can use secret sharing multiplication to perform corresponding multiplication of vector elements according to formula (11). That is, for any group of predicted value slices among multiple participants, this group of predicted value slices is used as the input of the secret sharing multiplication, and the secret sharing multiplication is carried out in the form of the predicted value matrix expression, and multiple participating parties are output elements in the square's respective intermediate vector slice.
- the intermediate vector slice elements corresponding to multiple sets of predicted value slices form the intermediate vector slice. Multiple intermediate vector slices result in intermediate vectors when assumed to be reconstructed.
- each predicted value slice ⁇ > A of the first participant A and the corresponding predicted value slice ⁇ > B of the second participant B can be used as the input of the secret sharing multiplication, and the secret sharing multiplication is according to formula (11) Proceed, and output the elements in the ⁇ intermediate vector> A slice and the elements in the ⁇ intermediate vector> B slice corresponding to the first participant A and the second participant B respectively.
- the first participant A uses the elements in the ⁇ intermediate vector> A slice as diagonal elements to construct a diagonal matrix ⁇ > A , which is the diagonalized prediction of the first participant A Value matrix sharding.
- the second participant B uses the elements in the ⁇ intermediate vector> B slice as diagonal elements to construct a diagonal matrix ⁇ > B , which is the diagonalized predicted value matrix slice.
- the dimension of ⁇ intermediate vector> A slice is N
- the dimension of the constructed diagonal matrix is N*N.
- the diagonal elements of the predicted value matrix slice ⁇ > A are the elements in the ⁇ intermediate vector> A slice, and the off-diagonal elements of the predicted value matrix slice ⁇ > A are all 0 .
- SMM Secret Matrix Multiplication
- the predictor matrix slice is a diagonal matrix, which contains a large number of 0 elements, and the matrix dimension is N*N.
- the magnitude of the sample size N is very large, such as one hundred thousand, one million or more, that is, the dimensionality of the joint data X is very high.
- the column vectors in the joint data slice are respectively compared with the corresponding diagonal elements in the predictive value matrix slice. multiply operation.
- Multiple predicted value matrix slices are all diagonal matrices, the elements on the main diagonal are not 0, and the elements on the non-main diagonal are all .
- the matrix multiplication operation is performed between the joint data slice and the predictive value matrix slice, it can be divided into the multiplication operation of the column vector of the joint data slice and the diagonal elements in the predictive value matrix slice respectively, that is, the column vector and the non 0-element multiplication operation.
- the result of the multiplication operation of column vector and 0 element is 0, which can be omitted and not calculated. In this way, the high-dimensional matrix multiplication operation between multiple participants can be dismantled, saving a lot of calculations, thereby reducing the amount of communication between many participants. Communication volume plays a decisive role in processing efficiency in privacy protection scenarios.
- X is the joint data
- T is the matrix transpose symbol
- each element of the vector x (x 11 ... x 1D ) needs to be multiplied by Taking the multiplication operation between the first participant A and the second participant B as an example for illustration, refer to the flowchart shown in FIG. 3 , which is a schematic diagram of a calculation flow for the secret sharing matrix multiplication application in this embodiment.
- the first participant A has a D*1-dimensional vector slice ⁇ x> A , and a 1*1-dimensional numerical slice ⁇ m> A , where m is used instead as a shorthand.
- the second participant B has a D*1-dimensional vector slice ⁇ x> B , and a 1*1-dimensional numerical slice ⁇ m> B .
- step 1 both parties obtain triples of random numbers respectively.
- the first participant A obtains ⁇ u> A(D*1) , ⁇ v> A (1*1) , ⁇ z> A(D*1)
- the second participant B obtains ⁇ u> B(D*1 ) , ⁇ v> B(1*1) , ⁇ z> B(D*1)
- D*1 and 1*1 are matrix dimensions.
- the first participant A uses random numbers to split its private data, so as to realize the masking of the private data and obtain the secret matrix.
- ⁇ e> A ⁇ m> A ⁇ v> A
- the second participant B uses random numbers to split its private data to obtain a secret matrix.
- ⁇ e> B ⁇ m> B - ⁇ v> B .
- Step 3 Participants send their secret matrices to each other, and process based on their own secret matrix and the received secret matrix.
- the first party A sends ⁇ d> A and ⁇ e> A to the second party B, and the second party B sends ⁇ d> B and ⁇ e> B to the first party A.
- e ⁇ e> A - ⁇ e> B .
- step 4 the participants calculate their respective data fragments.
- ⁇ Y> A + ⁇ Y> B x*m.
- the first participant A and the second participant B obtained the fragment ⁇ Y> A respectively without exposing private data ⁇ x> A and ⁇ m> A and ⁇ x> B and ⁇ m> B and ⁇ Y> B , these two slices can obtain the product of the vector x and the value m when the reconstruction is assumed.
- the communication volume between the participants including the data communication in the third step above, is 2(D+1), and the communication volume required to calculate X T ⁇ is 2(D+1)*N . Compared with the communication amount 2 (D*N+N*N) required for general matrix multiplication calculation, this reduces a large amount of communication amount.
- multiple participants multiply each column in X T by the corresponding diagonal element in ⁇ .
- the multiple slices ⁇ Y> A that can be obtained
- the The matrix formed by splicing multiple slices ⁇ Y> A is the slice of X T ⁇ in the participant.
- the processing process between the first participant A and the second participant B can refer to the schematic diagram described in Figure 3, the data ⁇ x> A of the first participant A in Figure 3 is replaced by ⁇ X T ⁇ > A , and ⁇ Replace m> A with ⁇ x> A , replace the data ⁇ x> B of the second party B with ⁇ X T ⁇ > B , replace ⁇ m> B with ⁇ x> B , and adjust the matrix of each parameter accordingly Dimensions, that is, based on the flow chart shown in Figure 3, the first participant A and the second participant B can obtain Hessian matrix slices ⁇ X T ⁇ X> A and ⁇ X T ⁇ X> B respectively.
- ⁇ X T ⁇ X> A corresponds to ⁇ Y> A
- ⁇ X T ⁇ X> B corresponds to ⁇ Y> B .
- the operations performed by the first participant A and the second participant B are respectively performed by corresponding participant devices of each party in actual operation.
- step 2 based on the intermediate matrix slice ⁇ H> of multiple participants, calculate the slice ⁇ H -1 > of the inverse of the intermediate matrix corresponding to multiple participants, and obtain the corresponding
- the Secure Matrix Inverse (SMI) algorithm can be used to obtain multiple participants through iterative calculation based on the intermediate matrix fragmentation ⁇ H> of multiple participants.
- the corresponding covariance matrix slices ⁇ Cov> respectively.
- the intermediate matrix slices ⁇ H> A and ⁇ H> B can use SMI for iterative calculation.
- the intermediate matrix slices ⁇ H> A and ⁇ H> B obtain the intermediate matrix H when the reconstruction is assumed, and H -1 is the inverse matrix of H, but the first participant A and the second participant B will not be reconstructed H. Therefore, it is necessary to make the first participant A and the second participant B determine ⁇ H -1 > A and ⁇ H - 1 > B.
- the intermediate matrix H is not reconstructed, which can avoid the leakage of private data.
- the first participant A and the second participant B respectively obtain L 0 through joint calculation
- tr is the trace of the matrix.
- SMM is used among multiple participants, and the calculations are performed according to the following iterative formula
- I is the identity matrix.
- two SMMs are required.
- the number of iteration rounds can be preset, for example, it can be set to 20 to 32 times, and k is the number of iterations.
- step S230 when determining the effective value of the feature item corresponding to the model parameter in improving the effect of the business prediction model based on the model parameter fragmentation and covariance matrix fragmentation of multiple participants, the Wald test formula ( 2)
- the molecular part is Model parameters
- denominator part is the standard deviation of the model parameters
- the standard deviation can be obtained from the square root of the variance of the model parameters
- the diagonal elements of the covariance matrix are the variances of the corresponding model parameters.
- SNSI Secure Number Sqrt Invert
- multiple participant devices use diagonal elements in the covariance matrix slices of the multiple participants as variance slices corresponding to multiple model parameters.
- the diagonal elements here may refer to main diagonal elements.
- the main diagonal elements are the variances of the feature items.
- the main diagonal elements are the variance slices of the feature items.
- Step 2b the first participant device, for any model parameter, uses the SNSI algorithm and the significance test method, based on the corresponding model parameter slices of the first participant A and the corresponding variance slices of multiple participants, through multiple
- the interaction between the devices of the participating parties jointly performs the inverse operation of the security root number to determine the significance test value slice of the first participant A for the model parameters.
- the effective value of the feature item corresponding to the model parameter is determined.
- the second participant device uses the SNSI algorithm and the significance test value, based on the corresponding model parameter slices of the second participant B and the corresponding variance slices of multiple participants, through multiple
- the interaction between the devices of the participating parties jointly performs the inverse operation of the safe root sign to determine the significance test value slice of the model parameters of the second participant B.
- the significance test value slices of multiple participants can be sent to a certain participant device or a third-party device, and the participant device or third-party device reconstructs the significance test value, based on the The significance test value can determine the effective value of the corresponding feature item according to a predetermined transformation method.
- the significance test value slices of multiple participants can also be directly used as effective value slices, and multiple significant test value slices can be reconstructed to obtain effective values.
- the significance test value can be calculated based on the above formula (2) or formula (8), or the p_value formula, and the obtained significance test value fragmentation can be but not limited to Wald k -value fragmentation, z k- value fragmentation or p Value sharding.
- the model parameter sharding of multiple parties obtains the model parameters when the assumption is reconstructed. For example, for any model parameter ⁇ 1 , the model parameter slice ⁇ 1 > A of the first participant and the model parameter slice ⁇ 1 > B of the second participant B get the model parameter ⁇ when assuming reconstruction 1 . Model parameter sharding will not actually be reconstructed, but here is just to illustrate the relationship between model parameter sharding and model parameters.
- step 2b in step 2b, for any model parameter ⁇ k , the first participant device uses the SNSI algorithm and the significance test method, and through the interaction between multiple participant devices, based on the model parameter analysis of the first participant A, Slice ⁇ k > A and the variance slices of multiple participants, jointly perform the inverse operation of the safe root sign, and determine the first participant A's significance test value slice for the model parameter ⁇ k .
- the significance test value slice for the model parameter ⁇ k determined by the second participant device for the second participant B can be obtained.
- ⁇ z k > A is the significance test value slice of the model parameter ⁇ k of the first participant A
- the numerator part is the model parameter slice of the first participant A
- ⁇ Cov kk >A is The variance slice corresponding to the model parameter ⁇ k owned by the first participant A is also the kkth element (diagonal element) in the covariance matrix slice of the first participant A
- ⁇ Cov kk > B is the second participant
- the variance slice corresponding to the model parameter ⁇ k owned by B is also the kkth element in the covariance matrix slice of the second party B.
- the numerator part is owned by the first participant A, and the denominator part is jointly owned by the first participant A and the second participant B. Therefore, the focus of the problem now is how to calculate the inverse of the root sign in formula (12).
- the root sign inverse of the sum of the variance slice of the model parameter ⁇ k of the first participant A and the variance slice of the model parameter ⁇ k of the second participant B is determined using the SNSI algorithm, based on the root sign inverse
- the product of the model parameter fragment ⁇ k > A of the first participant A can obtain the significance test value fragment of the first participant A for the model parameter ⁇ k .
- the inverse of the root sign in formula (12) is as follows
- step 1c the first participant device and the second participant device convert the additive slice into a multiplicative slice through interaction.
- the first participant device locally generates a random number x a , and calculates The first participant device and the second participant device jointly calculate through secret sharing matrix multiplication Get x ba2 , x bb respectively;
- the first party A owns x a
- the second party owns x b .
- step 2c the two participant devices respectively perform initialization of the iterative estimated value locally.
- the device of the first participant reads the storage value of the 64-bit floating-point number x a according to the storage method of a 64-bit integer, and shifts it to the right by one bit (divided by 2 and rounded down) , recorded as int a ; calculate 0x5fe6eb50c7b537a9-int a , and read it according to the storage method of 64-bit floating point number, and record it as y a .
- x a is initialized to y a .
- the second participant device may initialize x b to y b after performing the above initialization.
- the first party A owns y a
- the second party owns y b .
- step 3c the two participants jointly use Newton's method to iteratively calculate n-1/2.
- the iteration formula is as follows
- two secret sharing matrix multiplications are used in the iterative process, a total of one iteration, and the first participant A and the second participant B obtain floating-point numbers c a and c b respectively.
- step 2b can also be implemented in other ways.
- the variance slice of the first participant A and the variance slice of the second participant B are security-standardized first, then the initial value of the iteration is obtained through linear approximation calculation, and finally the iteration is performed based on the Goldschmidt algorithm.
- the secret sharing matrix multiplication may be performed based on the variance slices of the first participant A and the variance slices of the second participant B, and then perform other operations.
- the number of multiple participants can be 2, 3 or more, and each participant performs various operations through the corresponding participant equipment, and the participant equipment can use any device with computing and processing capabilities , devices, platforms, device clusters, etc. to achieve.
- two participants are taken as examples for illustration.
- algorithms such as secret sharing matrix multiplication, secret sharing square root sign inversion, and secret sharing matrix inversion for multi-party secure computing
- the implementation mode of two participants can be easily extended to more parties participating scenario, the specific process will not be repeated.
- Fig. 4 is a schematic block diagram of an apparatus for determining an effective value of a feature of service data to protect privacy provided by an embodiment.
- the business data is distributed among multiple participants, and the respective business data of the multiple participants constitute joint data under the assumption of splicing, and the joint data includes feature values of multiple objects for multiple feature items; the device 400 deploys In any first participant device, the first participant device may be implemented by any device, device, platform, device cluster, etc. having computing and processing capabilities. This device embodiment corresponds to the method embodiment shown in FIG. 2 .
- the apparatus 400 includes: an acquisition module 410 configured to acquire joint data slices of the first participant, to acquire predicted value slices corresponding to multiple objects, and model parameter slices corresponding to multiple feature items respectively; The predicted value slice and the model parameter slice are obtained based on the trained business forecast model; the interaction module 420 is configured to use multi-party security computing, through the interaction between multiple participant devices, based on the joint Data sharding and predicted value sharding, determine the correlation data shards corresponding to multiple participants, including the correlation data between multiple feature items; the inspection module 430 is configured to adopt the significance test method, through Secure interaction between multiple participant devices, based on the model parameter slices of multiple participants and the corresponding data in the correlation data slice, determine the feature items corresponding to the model parameters in improving the effect of the service prediction model valid value for .
- the acquisition module 410 when acquiring the joint data slice of the first participant, includes: using secret sharing and addition, through interaction with other participant devices, based on the business data of multiple participants The splitting and splicing operations are performed so that multiple participants obtain joint data fragments respectively; the joint data fragments of multiple participants obtain the joint data under the assumption of reconstruction.
- the service prediction model is obtained through security joint training based on joint data slices of multiple participants; the service prediction model is used to perform service prediction on objects.
- the acquiring module 410 when acquiring the predicted value slices corresponding to multiple objects and the model parameter slices corresponding to multiple feature items respectively, includes: acquiring the trained service prediction model in The local model parameter fragmentation of the first participant device; through the interaction of multiple participant devices, based on the joint data fragmentation of multiple participants and the service prediction model after training, the multiple participants determine the object's Prediction sharding.
- the correlation data includes covariance matrix data
- the correlation data slices include covariance matrix slices
- the interaction module 420 includes: a determination sub-module 421 configured to, based on multiple The joint data sharding and predicted value sharding of the participating parties, as well as the functional relationship in the business forecasting model, determine the intermediate matrix slicing corresponding to multiple participating parties
- the calculation sub-module 422 is configured to, based on multiple participating The slice of the intermediate matrix of the party, calculate the slice of the inverse of the intermediate matrix corresponding to the multiple participants, and obtain the slice of the covariance matrix corresponding to the multiple participants.
- the determination sub-module 421 is specifically configured to: based on the joint data fragmentation and prediction value fragmentation of multiple participants, and the Hessian obtained based on the functional relationship in the business prediction model A matrix expression is used to determine Hessian matrix slices corresponding to multiple participants as intermediate matrix slices; the Hessian matrix expression includes a joint data matrix and a predictive value matrix.
- the determination sub-module 421, when determining the Hessian matrix slices corresponding to multiple participants includes: using secret sharing multiplication, based on the expression of the predictive value matrix, for multiple participants The corresponding multiplication of the vector elements is carried out on the predicted value slices, so that multiple participants can obtain the intermediate vector slices respectively; the elements in the intermediate vector slices of the first participant are used as diagonal elements, and the diagonalization obtained by constructing The predictive value matrix fragmentation of the first participant; based on the joint data fragmentation of multiple participants, the predictive value matrix fragmentation and the Hessian matrix expression, determine the Hessian matrix corresponding to multiple participants Fragmentation.
- the determination sub-module 421 determines the Hessian values corresponding to the multiple participants based on the joint data slices of multiple participants, the predictive value matrix slices, and the Hessian matrix expression.
- slicing the matrix it includes: when calculating the safe multiplication operation of the joint data sharding and the predictive value matrix sharding of multiple participants, the column vectors in the joint data sharding are respectively paired with the corresponding pairs in the predictive value matrix slicing Angular elements perform safe multiplication operations.
- the calculation sub-module 422 is specifically configured to: use the secret sharing matrix inverse algorithm SMI, based on the intermediate matrix slices of multiple participants, through iterative calculations, to obtain the protocol information corresponding to the multiple participants respectively. Variance matrix slices.
- the verification module 430 is specifically configured to: use the diagonal elements in the covariance matrix slices of multiple participants as variance slices corresponding to multiple model parameters; for any model Parameters, using SNSI and the significance test method, based on the corresponding model parameter slices of the first participant and the corresponding variance slices of multiple participants, through the interaction between the devices of multiple participants, the security root number is jointly performed
- the inverse operation is to determine the significance test value slice of the first participant for the model parameter; based on the significance test value slice of the model parameter of multiple participating parties, determine the validity of the feature item corresponding to the model parameter value.
- the apparatus 400 further includes a reconstruction module (not shown in the figure), configured to: for any first feature item, obtain the effective value score of the first feature item from other participant devices Slice: determining the reconstructed effective value of the first feature item based on the local effective value slice of the first feature item and the acquired effective value slice.
- a reconstruction module configured to: for any first feature item, obtain the effective value score of the first feature item from other participant devices Slice: determining the reconstructed effective value of the first feature item based on the local effective value slice of the first feature item and the acquired effective value slice.
- the device 400 further includes a removal module (not shown in the figure), configured to: based on the effective value, remove the feature item whose effective value does not meet the preset condition from the plurality of feature items, so that Multiple participants use the business data after feature items are removed to conduct secure joint training on the business prediction model.
- a removal module (not shown in the figure), configured to: based on the effective value, remove the feature item whose effective value does not meet the preset condition from the plurality of feature items, so that Multiple participants use the business data after feature items are removed to conduct secure joint training on the business prediction model.
- the object includes one of users, commodities, and events; the feature item includes at least one of the following: basic attribute information, association relationship information, interaction information, and historical behavior information; the business forecast Models are used to make business predictions about objects.
- the service prediction model is obtained based on a logistic regression model.
- the foregoing device embodiments correspond to the method embodiments, and for specific descriptions, refer to the description of the method embodiments, and details are not repeated here.
- the device embodiment is obtained based on the corresponding method embodiment, and has the same technical effect as the corresponding method embodiment. For specific description, please refer to the corresponding method embodiment.
- the embodiment of the present specification also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer is instructed to execute the method described in any one of FIG. 1 to FIG. 3 .
- the embodiment of this specification also provides a computing device, including a memory and a processor, wherein executable code is stored in the memory, and when the processor executes the executable code, the computer described in any one of Fig. 1 to Fig. 3 is implemented. described method.
- each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments.
- the description is relatively simple, and for relevant parts, please refer to the part of the description of the method embodiments.
- the functions described in the embodiments of the present invention may be implemented by hardware, software, firmware or any combination thereof.
- the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- General Physics & Mathematics (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- General Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Educational Administration (AREA)
- Operations Research (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Development Economics (AREA)
- Quality & Reliability (AREA)
- Marketing (AREA)
- Game Theory and Decision Science (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本说明书实施例提供了一种保护隐私的确定业务数据特征有效值的方法及装置。业务数据分布在多个参与方中,多个参与方的业务数据能假定拼接成联合数据,其中包括多个对象针对多个特征项的特征值。多方分别获取联合数据分片、多个对象分别对应的预测值分片以及多个特征项分别对应的模型参数分片。这些预测值分片和模型参数分片均基于业务预测模型得到。多方可以利用多方安全计算,基于多方的联合数据分片和预测值分片,确定多方分别对应的相关性数据分片,其中包括多个特征项之间的相关性数据;然后,采用显著性检验法,基于多方的模型参数分片和相关性数据分片中的对应数据,确定模型参数对应的特征项在提升业务预测模型效果上的有效值。
Description
本说明书一个或多个实施例涉及数据安全技术领域,尤其涉及保护隐私的确定业务数据特征有效值的方法及装置。
机器学习所需要的数据往往会涉及多个平台、多个领域。例如在基于机器学习的商户分类分析场景中,电子支付平台拥有商户的交易流水数据,电子商务平台存储有商户的销售数据,银行机构拥有商户的借贷数据。为了提高服务,多方常常在保证业务数据隐私性和安全性的前提下,联合起来训练业务预测模型。
随着数据量的增多,数据的特征维度也变得越来越大。这些多维特征数据往往存在一些冗余信息,可能会影响机器学习的效果,降低模型的稳定性。因此,可以根据特征有效性,对多维特征数据进行降维处理,在尽量不损失信息量的情况下,去掉在提升模型性能方面显著性不高的冗余特征,将其转化为低维特征。
因此,希望能有改进的方案,可以尽可能在安全、不泄露隐私数据的情况下确定特征有效性。
发明内容
本说明书一个或多个实施例描述了保护隐私的确定业务数据特征有效值的方法及装置,可以在安全、不泄露隐私数据的情况下,针对分布在多方中的业务数据进行特征项有效值的确定。具体的技术方案如下。
第一方面,实施例提供了一种保护隐私的确定业务数据特征有效值的方法,业务数据分布在多个参与方中,多个参与方各自的业务数据在假定拼接的情况下构成联合数据,所述联合数据包括多个对象针对多个特征项的特征值;所述方法通过任意的第一参与方设备执行,包括:获取第一参与方的联合数据分片,获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片;所述预测值分片和所述模型参数分片基于训练后的业务预测模型得到;利用多方安全计算,通过多个参与方设备之间的交互,基于多个参与方的联合数据分片和预测值分片,确定多个参与方分别对应的相关性数据分片,其中包括多个特征项之间的相关性数据;采用显著性检验法,通过多个参与方设备之间的安全交互,基于多个参与方的模型参数分片和所述相关性数据分片中的对应数据,确定模型参数对应的特征项在提升所述业务预测模型效果上的有效值。
在一种实施方式中,所述获取第一参与方的联合数据分片的步骤,包括:采用秘密分享加法,通过与其他参与方设备的交互,基于多个参与方的业务数据进行拆分和拼接操作,使得多个参与方分别获取到联合数据分片;多个参与方的联合数据分片在假定重构的情况下得到所述联合数据。
在一种实施方式中,所述业务预测模型,基于多个参与方各自的联合数据分片进行安全联合训练得到;所述业务预测模型用于对对象进行业务预测。
在一种实施方式中,所述获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片的步骤,包括:获取训练后的业务预测模型在所述第一参与方设备本地的模型参数分片;通过多个参与方设备的交互,基于多个参与方的联合数据分片以及训练后的业务预测模型,分别使得多个参与方确定对象的预测值分片。
在一种实施方式中,所述相关性数据包括协方差矩阵数据,所述相关性数据分片包括协方差矩阵分片;所述确定多个参与方分别对应的相关性数据分片的步骤,包括:基于多个参与方的联合数据分片和预测值分片,以及所述业务预测模型中的函数关系式,确定多个参与方分别对应的中间矩阵分片;基于多个参与方的中间矩阵分片,计算多个参与方分别对应的中间矩阵逆的分片,得到多个参与方分别对应的协方差矩阵分片。
在一种实施方式中,所述确定多个参与方分别对应的中间矩阵分片的步骤,包括:基于多个参与方的联合数据分片和预测值分片,以及基于所述业务预测模型中的函数关系式得到的海森矩阵表达式,确定多个参与方分别对应的海森矩阵分片,作为中间矩阵分片;所述海森矩阵表达式中包括联合数据矩阵和预测值矩阵。
在一种实施方式中,所述确定多个参与方分别对应的海森矩阵分片的步骤,包括:利用秘密分享乘法,基于预测值矩阵的表达式,对多个参与方的预测值分片进行向量元素的对应相乘,使得多个参与方分别得到中间向量分片;以所述第一参与方的中间向量分片中的元素作为对角元素,构建得到对角化的所述第一参与方的预测值矩阵分片;基于多个参与方的联合数据分片、预测值矩阵分片和所述海森矩阵表达式,确定多个参与方分别对应的海森矩阵分片。
在一种实施方式中,所述基于多个参与方的联合数据分片、预测值矩阵分片和所述海森矩阵表达式,确定多个参与方分别对应的海森矩阵分片的步骤,包括:在计算多个参与方的联合数据分片与预测值矩阵分片的安全乘操作时,将联合数据分片中的列向量分别与预测值矩阵分片中对应的对角元素进行安全乘操作。
在一种实施方式中,所述基于多个参与方的中间矩阵分片,计算多个参与方分别对应的中间矩阵逆的分片,得到多个参与方分别对应的协方差矩阵分片的步骤,包括:利用秘密分享矩阵逆算法SMI,基于多个参与方的中间矩阵分片,通过迭代计算,得到多个参与方分别对应的协方差矩阵分片。
在一种实施方式中,所述确定模型参数对应的特征项在提升所述业务预测模型效果上的有效值的步骤,包括:将多个参与方的协方差矩阵分片中的对角元素,作为与多个模型参数分别对应的方差分片;针对任意一个模型参数,利用秘密分享根号逆算法SNSI以及显著性检验法,基于所述第一参与方的对应模型参数分片以及多个参与方的对应方差分片,通过多个参与方设备之间的交互,联合进行安全根号逆操作,确定所述第一参与方的针对该模型参数的显著性检验值分片;基于多个参与方针对该模型参数的显著性检验值分片,确定该模型参数对应的特征项的有效值。
在一种实施方式中,方法还包括:针对任意的第一特征项,从其他参与方设备中获取所述第一特征项的有效值分片;基于所述第一特征项在本地的有效值分片和获取的有效值分片,确定所述第一特征项的重构后的有效值。
在一种实施方式中,方法还包括:基于所述有效值,从多个特征项中去除有效值不满足预设条件的特征项,以使多个参与方采用去除特征项后的业务数据,对所述业务预 测模型进行安全联合训练。
在一种实施方式中,所述对象包括用户、商品、事件中的一种;所述特征项包括以下至少一种:基本属性信息、关联关系信息、交互信息、历史行为信息;所述业务预测模型用于对对象进行业务预测。
在一种实施方式中,所述业务预测模型基于逻辑回归模型得到。
第二方面,实施例提供了一种保护隐私的确定业务数据特征有效值的装置,业务数据分布在多个参与方中,多个参与方各自的业务数据在假定拼接的情况下构成联合数据,所述联合数据包括多个对象针对多个特征项的特征值;所述装置部署在任意的第一参与方设备中,包括:获取模块,配置为,获取第一参与方的联合数据分片,获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片;所述预测值分片和所述模型参数分片基于训练后的业务预测模型得到;交互模块,配置为,利用多方安全计算,通过多个参与方设备之间的交互,基于多个参与方的联合数据分片和预测值分片,确定多个参与方分别对应的相关性数据分片,其中包括多个特征项之间的相关性数据;检验模块,配置为,采用显著性检验法,通过多个参与方设备之间的安全交互,基于多个参与方的模型参数分片和所述相关性数据分片中的对应数据,确定模型参数对应的特征项在提升所述业务预测模型效果上的有效值。
在一种实施方式中,所述获取模块,在获取第一参与方的联合数据分片时,包括:采用秘密分享加法,通过与其他参与方设备的交互,基于多个参与方的业务数据进行拆分和拼接操作,使得多个参与方分别获取到联合数据分片;多个参与方的联合数据分片在假定重构的情况下得到所述联合数据。
在一种实施方式中,所述业务预测模型,基于多个参与方各自的联合数据分片进行安全联合训练得到;所述业务预测模型用于对对象进行业务预测。
在一种实施方式中,所述获取模块,在获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片时,包括:获取训练后的业务预测模型在所述第一参与方设备本地的模型参数分片;通过多个参与方设备的交互,基于多个参与方的联合数据分片以及训练后的业务预测模型,分别使得多个参与方确定对象的预测值分片。
在一种实施方式中,所述相关性数据包括协方差矩阵数据,所述相关性数据分片包括协方差矩阵分片;所述交互模块,包括:确定子模块,配置为,基于多个参与方的联合数据分片和预测值分片,以及所述业务预测模型中的函数关系式,确定多个参与方分别对应的中间矩阵分片;计算子模块,配置为,基于多个参与方的中间矩阵分片,计算多个参与方分别对应的中间矩阵逆的分片,得到多个参与方分别对应的协方差矩阵分片。
第三方面,实施例提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面中任一项所述的方法。
第四方面,实施例提供了一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面中任一项所述的方法。
本说明书实施例提供的方法及装置中,通过多个参与方之间的交互,基于第一参与方的联合数据分片、预测值分片,以及其他参与方的联合数据分片、预测值分片,利用多方安全计算使得多个参与方得到相关性数据分片,进而利用模型参数分片和相关性数据分片确定特征项在提升模型效果上的效果值。多个参与方之间利用各类数据的分片进 行多方安全计算,得到的数据也是分片,处理过程中并不会重构特征项之间的相关性数据等隐私数据,提高了处理过程中数据的隐私性、安全性。
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本说明书披露的一个实施例的实施场景示意图;
图2为本实施例提供的一种保护隐私的确定业务数据特征有效值的方法的流程示意图;
图3为本实施例中秘密分享矩阵乘法应用的一种计算流程示意图;
图4为实施例提供的一种保护隐私的确定业务数据特征有效值的装置的示意性框图。
下面结合附图,对本说明书提供的方案进行描述。
图1为本说明书披露的一个实施例的实施场景示意图。如图1所示,在共享学习场景下,数据集由多个参与方1,2,…,W共同提供(W为自然数),每个参与方拥有数据集中的一部分数据,构成该参与方的业务数据(即原始矩阵)。该数据集可以是用于训练模型的训练数据集,也可以是用于测试模型的测试数据集,或者是待预测的数据集。数据集可以包括对象的特征数据,对象可以是用户、商品、事件等各种业务上有待分析的对象之一。上述模型可以包括采用机器学习方式训练的业务预测模型。
数据集存在至少两种数据分布。一种分布是,各个参与方拥有全部对象的不同特征数据。例如,每个参与方都有同样的N个对象的样本,每个样本的隐私数据中包含D个特征,这些特征分布在W个参与方中,每个参与方拥有D/W个特征。又如,两个平台有相同的一批用户,但其业务数据中的用户特征不同。每个参与方拥有的特征种类不同,所拥有的特征数量可以相同(例如各自拥有D/W个特征),也可以不同。N、D和W都是自然数。这是数据集中的数据垂直切分的场景,表1为数据垂直切分场景的业务数据分布。
表1
其中,xx代表具体的特征值,属于参与方的隐私数据。表1中每一行表示一条样本数据,每一列代表N个对象的某个特征项的特征值,D个特征项分属于W个参与方。 N个对象的D个特征项的特征值构成了全部的业务数据。
另一种分布是,各个参与方拥有不同对象的全部特征数据。例如,共有N个对象的样本,每个样本的业务数据中包含D个特征项,这N条业务数据分布在W个参与方中,每个参与方拥有全部N个样本中的一部分样本,每个样本包含的特征项相同。不同参与方存储的对象样本的数量可以相同,也可以不同。又如,有两个银行,其服务的用户群体不同,但是它们都拥有相同的用户信贷特征。此为数据集中的数据水平切分的场景,表2为数据水平切分场景的业务数据分布。
表2
其中,xx代表具体的特征值,属于参与方的隐私数据。表2中每一行表示一条样本数据,每一列代表N个对象的某个特征项的特征值,N条样本数据分属于W个参与方。不同参与方拥有不同的对象样本。N个对象的D个特征项的特征值构成了全部的业务数据。
参与方所拥有的业务数据可以包括多个特征项。对象的特征项可以包括以下中的至少一种:对象的基本属性信息、关联关系信息、交互信息、历史行为信息等。例如,当对象是用户时,其基本属性信息可以包括用户的性别、年龄、收入等,用户的关联关系信息可以包括与用户存在关联关系的其他用户、公司、地区等,用户的交互信息可以包括用户在某个网站进行的点击、查看、参与的某个活动等信息,用户的历史行为信息可以包括用户历史的交易行为、支付行为、购买行为等。
当对象是商品时,其基本属性信息可以包括商品的类别、产地、价格等,商品的关联关系信息可以包括与该商品存在关联关系的用户、商铺或其他商品等,商品的交互信息可以包括用户、商铺与该商品之间的交互特征,商品的历史行为信息可以包括商品被购买、转存、退货等信息。
当对象是事件时,事件可以包括交易事件、登录事件、购买事件和社交事件等等。事件的基本属性信息可以是用于描述事件的文字信息,关联关系信息可以包括与该事件在上下文上存在关系的文本、与该事件存在关联性的其他事件信息等,历史行为信息可以包括该事件在时间维度上发展变化的记录信息等。
各个参与方可以对应于不同的服务平台,该服务平台可以包括各种企业、机构、组织等。业务数据往往是服务平台的隐私数据,在处理过程中要求保持较高的隐私性和安 全性。不管是哪种数据分布方式,其对象的特征项对应的特征值(即特征数据)都属于隐私数据,并且可存储为隐私数据矩阵。为了隐私数据的安全,各个参与方需要将其隐私数据留在本地,不输出明文数据,不进行明文聚合。
为了保护各个参与方的隐私数据不泄露,在一种实施方式中,各个参与方可以采用多方安全计算的方式,利用自身的预测值以及原始矩阵,通过与第三方的交互,使得第三方得到能够表示多个特征项之间的相关性数据的协方差矩阵数据。第三方利用该协方差矩阵数据以及模型参数,采用显著性检验法,确定模型参数对应的特征项在提升业务预测模型效果上的有效值。
由于协方差矩阵数据包含了一定的隐私数据,对其进行进一步的改进,能够提高隐私数据的安全性。参见图1,在本说明书的实施例中,各个参与方存储有各自的数据分片,包括各自的联合数据分片、多个对象对应的预测值分片和多个特征对应的模型参数分片等,多个参与方设备之间进行基于多方安全计算的交互,利用联合数据分片和预测值分片,确定多个参与方分别对应的相关性数据分片,该相关性数据分片中包括多个特征项之间的相关性数据,各个参与方分别采用显著性检验法,基于多个参与方的模型参数分片和相关性数据分片中的对应数据,确定特征项在提升业务预测模型效果上的有效值。多个参与方之间利用各类数据分片进行多方安全计算,得到的相关性数据也是分片,不会重构特征项之间的相关性数据等隐私数据,因此能够提高处理过程中数据的隐私性和安全性。
在本说明书中,多个参与方分别存在对应的参与方设备,并利用对应的参与方设备执行本说明书实施例中的操作。参与方设备包括但不限于任何具有计算、处理能力的装置、设备、平台、设备集群等。下面结合具体实施例对本发明实施例进行说明。
图2为本实施例提供的一种保护隐私的确定业务数据特征有效值的方法的流程示意图。业务数据分布在多个参与方中,多个参与方各自的业务数据在假定拼接的情况下构成联合数据。参与方的业务数据属于高隐私性的隐私数据,多个参与方之间不会明文发送业务数据,也不会对业务数据进行真实的拼接进而构成联合数据。联合数据仅是假定情况下多个参与方的业务数据构成的数据集。例如,上述表1和表2分别为数据垂直切分和数据水平切分场景下联合数据的具体形式。联合数据包括多个对象针对多个特征项的特征值,例如可以包括N个对象针对D个特征项的特征值,N和D均为自然数。
为了描述方便,下面举例中多以两个参与方为例进行说明。例如,两个参与方分别为第一参与方A和第二参与方B,第一参与方A对应于第一参与方设备,第二参与方B对应于第二参与方设备。参与方设备用于执行本参与方的操作,并存储本参与方的数据。在具体实施方式中,参与方设备也可以从其他设备中获取本参与方的数据。本实施例方法具体包括以下步骤S210~S230。
步骤S210,第一参与方设备获取第一参与方A的联合数据分片,获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片。第二参与方设备获取第二参与方B的联合数据分片,获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片。
多个参与方各自拥有各自的业务数据,该业务数据属于原始数据,也是隐私数据。在垂直切分场景下,多个参与方之间的特征项不同,对象相同。多个参与方可以分别将各自的原始数据采用原始矩阵表示,例如第一参与方A和第二参与方B的原始矩阵可 以分别表示为X
A和X
B,特征项分别表示为d
A、d
B,对象个数分别表示为n
A和n
B,那么联合数据的总特征项为D=d
A+d
B,对象总数或样本总数为N=n
A=n
B。当原始矩阵中列表示特征项,行表示对象或样本时,对第一参与方A和第二参与方B等多个参与方的业务数据进行假定横向拼接,可以得到联合数据,形式为X=(X
A,X
B)。以上是原始矩阵中列表示特征项、行表示样本的情况,对应于表1中的数据分布情形。在其他实施方式中,原始矩阵中的列可以表示对象,行表示特征项,在这种情况下,对第一参与方A和第二参与方B等多个参与方的业务数据进行假定纵向拼接,可以得到联合数据,形式为
在水平切分场景下,多个参与方之间的特征项相同,对象不同。第一参与方A和第二参与方B的原始矩阵分别为X
A和X
B,特征项分别为d
A=d
B=D,对象个数分别为n
A、n
B,那么联合数据的总特征项为D=d
A=d
B,对象总数或样本总数为N=n
A+n
B。当参与方的原始矩阵中行表示对象,列表示特征项时,对第一参与方A和第二参与方B等多个参与方的业务数据进行假定纵向拼接,可以得到联合数据,形式为
以上可以对应于表2中的数据分布情形。当原始矩阵中行表示特征项,列表示对象时,对第一参与方A和第二参与方B等多个参与方的业务数据进行假定横向拼接,可以得到联合数据,形式为
X=(X
A,X
B)。
为了使得多个参与方得到联合数据分片,多个参与方之间可以采用秘密分享加法,将参与方的业务数据拆分成随机数,通过随机数在多个参与方之间的传递来完成分片。具体的,第一参与方设备在获取第一参与方A的联合数据分片时,可以采用秘密分享加法,通过与其他参与方设备的交互,基于多个参与方的业务数据进行拆分和拼接操作,使得多个参与方分别获取到联合数据分片。同理,第二参与方B也获取到其联合数据分片。
秘密分享加法能够把原始矩阵拆分成随机矩阵,通过随机矩阵在多个参与方之间的传递,来完成分片。以两个参与方为例,第一参与方A和第二参与方B分别拥有业务数据的原始矩阵X
A和X
B。对于第一参与方设备来说,其可以在有限域中生成随机矩阵R
A,并计算X
A-R
A=X
2,第一参与方设备可以将两个随机矩阵R
A和X
2中的任意一个,例如X
2,发送至第二参与方设备。第二参与方设备,也在有限域中生成随机矩阵R
B,并计算X
B-R
B=X
3,第二参与方设备可以将两个随机矩阵R
B和X
3中的任意一个,例如X
3,发送至第一参与方设备。
第一参与方设备可以将R
A和接收到的第二参与方设备发送的X
3拼接成联合数据分片,第二参与方设备可以将R
B和接收到的第一参与方设备发送的X
2拼接成联合数据分片。当然,在实际应用场景中,参与方通常在3个或3个以上,上述秘密分享加法的实施过程可以容易地扩展至三方以上。在多个参与方之间所发送的数据都是随机矩阵,并没有泄露原始矩阵的隐私数据。
其中,多个参与方的联合数据分片在假定重构的情况下得到联合数据。重构可以基 于将各方的数据分片加起来实现,具体的重构可以是在相加的基础上加入其它的矩阵变换操作,矩阵变换例如包括乘以预设数值等。联合数据中包含隐私数据,各个参与方不直接进行该隐私数据的明文聚合,该联合数据仅是一种假设情况下的表示,实际中不会将参与方的数据分片直接重构在一起。以下关于重构的含义,均适用于此处的说明。
第一参与方A的联合数据分片可采用<X>
A表示,第一参与方B的联合数据分片可采用<X>
B表示,那么联合数据X=<X>
A+<X>
B。其中,<X>表示参量X的分片,其下标表示该分片所归属的参与方。为了表述上的统一性,在下文中均采用“尖括号+下标”的形式表示数据在某个参与方中的分片。
在本实施例中,参与方的联合数据分片,是基于多个参与方的业务数据得到,并且多个参与方的联合数据分片的和在概念上或理论上等于联合数据。
在步骤S210中,预测值分片和模型参数分片是基于训练后的业务预测模型得到的数据。业务预测模型,是基于多个参与方各自的联合数据分片进行安全联合训练得到的模型。业务预测模型可以预先训练得到。业务预测模型可以是基于逻辑回归模型训练得到的模型,也可以是基于其他类型的模型训练得到。业务预测模型用于对对象进行业务预测,例如可以对输入的对象的特征数据进行分类预测或者进行回归预测。
多个参与方设备,可以通过训练后的业务预测模型获取预测值分片和模型参数分片。例如,第一参与方设备可以获取训练后的业务预测模型在第一参与方设备本地的模型参数分片,并通过多个参与方设备之间的安全交互,基于多个参与方的联合数据分片以及训练后的业务预测模型,分别使得多个参与方确定对象的预测值分片。
多个参与方设备将联合数据分片中的N个对象作为样本,训练业务预测模型。在训练之后,可以得到业务预测模型在本参与方设备中的模型参数分片。通过多个参与方设备之间的安全交互,将多个参与方的联合数据分片输入业务预测模型,每个参与方设备能够确定本参与方的多个对象的预测值分片。
因此,对于一个参与方来说,其获取的数据中,一个对象对应一个预测值分片,N个对象分别与N个预测值分片对应,N个预测值分片可以作为向量元素而构成向量;当业务数据中包含D个特征项时,训练后的业务预测模型中包含多个模型参数,其分别与D个特征项对应。针对任意一个预测值数据,多个参与方拥有的对应预测值分片在假定重构的情况下得到该预测值数据。针对任意一个模型参数,多个参与方拥有的对应模型参数分片在假定重构的情况下得到该模型参数。
步骤S220,利用多方安全计算,通过多个参与方设备之间的交互,基于多个参与方的联合数据分片和预测值分片,确定多个参与方分别对应的相关性数据分片,其中包括多个特征项之间的相关性数据。
其中,多个参与方的相关性数据分片在假定重构的情况下得到相关性数据,即特征项之间的相关性数据,该特征项包括同一参与方拥有的特征项之间的相关性数据,也包括不同参与方拥有的特征项之间的相关性数据,既有不同特征项之间的相关性数据,又有相同特征项之间的相关性数据。
在本步骤实施时,可以基于已有的计算特征项之间相关性数据的公式,利用联合数据分片和预测值分片,并通过多方安全计算的方式,确定多个参与方分别对应的相关性数据分片。能够表示特征项之间相关性数据的公式,可以包括协方差矩阵公式、相关系 数公式等等。
多方安全计算(Secure Multi-party Computation,MPC)是一种已有的可以用于多方参与的数据隐私保护技术,其具体实现方式包括同态加密、混淆电路、不经意传输、秘密分享等技术。采用多方安全计算的方式,能够实现多个参与方设备之间针对联合数据分片和预测值分片的安全交互计算,进而使得多个参与方能够确定对应的相关性数据分片。
步骤S230,采用显著性检验法,通过多个参与方设备之间的安全交互,基于多个参与方的模型参数分片和相关性数据分片中的对应数据,确定模型参数对应的特征项在提升业务预测模型效果上的有效值。
其中,显著性检验法可以包括Wald检验、似然比(LR)检验、拉格朗日乘子(LM)检验等。对显著性检验法提供的已有公式进行变换之后,通过多个参与方设备之间的安全交互,对多个参与方的模型参数分片和相关性数据分片进行安全计算,确定与参与方对应的有效值分片。
在本实施例中,特征项与模型参数对应,模型参数分片和相关性数据分片中均存在与特征项分别对应的数据。利用模型参数分片和相关性数据分片中的对应数据,采用显著性检验法,能够确定多个模型参数分别对应的显著性检验值分片,也就是对应的多个特征项的显著性检验值分片,并可以基于显著性检验值分片确定上述有效值分片。
在需要确定某个特征项的有效值时,例如针对任意的第一特征项,第一参与方设备可以从其他参与方设备中获取第一特征项的有效值分片,基于第一特征项在本地的有效值分片和获取的有效值分片,确定第一特征项的重构后的有效值。特征项的有效值也可以在第二参与方设备或其他参与方设备中重构,本实施例仅以在第一参与方设备中重构有效值为例来进行说明。
在获取多个特征项的有效值之后,第一参与方设备还可以基于多个有效值,从多个特征项中去除有效值不满足预设条件的特征项,以使多个参与方采用去除特征项后的业务数据,对业务预测模型进行安全联合训练。去除特征项后的业务数据,实现了对原始矩阵的降维处理,使得特征项更加精炼,同时保证隐私数据的安全不泄露。
下面详细说明一种具体实施例。当业务预测模型包括逻辑回归模型,且显著性检验方法采用Wald检验法时,步骤S220中确定相关性数据分片的方式,以及步骤S230中确定特征项的效果值的具体实施方式。
下面首先详细说明Wald检验在逻辑回归上的应用原理。在采用逻辑回归模型对样本的特征数据进行回归时,预测值的计算公式包括:
其中,X为样本的特征数据,可以作为自变量;π(X)为样本的预测值函数,可以作为因变量;β为模型参数,为特征项系数;e为自然常数。
Wald检验的原假设和备择假设为:
H
0:ω
j=0(j=1,2,…,k),即自变量对因变量发生的可能性无影响作用,也就是假 设自变量对因变量的估计值无影响;
H
1:ω
j≠0
如果零假设被拒绝,说明因变量的变化依赖于自变量j。
该Wald检验的检验统计量为
其中
为海森矩阵H中的元素表达式,角标k和r均取小于N的自然数,x
ik和x
ir为联合数据X中的元素,X
i表示第i个样本的特征数据。
通过对上面的公式推导可知,H矩阵可以表示为H=X
TMX,其中
其中,N为样本的总数量,即对象的总数量,D为特征数据的维度,π(X
N)为逻辑回归模型针对样本X
N的预测值,M为基于预测值得到的对角矩阵,也可以称为预测值矩阵。
从上面的公式(2)
可以看出,针对第k个模型参数,当该模型参数的标准差越大时,也就是协方差矩阵中对应第k行第k列的值越大时,说明该模型参数会使得逻辑回归模型的震荡性越大,该模型参数对应的Wald检验值越小。
在确定第k个模型参数的显著性检验值Wald
k之后,还可以根据
得到z
k统计量,并根据p_value=2[1-norm.cdf(|z
k|)]计算对应的p值,其中函数norm.cdf用来获取正态分布的概率分布函数。当p值小于显著性水平阈值,则拒绝原假设,该模型参数可以保留建模,该模型参数对应的特征项的有效值可以取为1或其他较高的值;当p值不小于显著性水平阈值,则接受原假设,该模型参数不予保留,该模型参数对应的特征项的有效值可以取为0或其他较低的值。显著性水平阈值通常可以取0.05或0.01等。
逻辑回归分析是解析自变量和因变量并明确两者关系的统计方法。只有当自变量与因变量确实存在某种关系时,建立的回归方程才有意义。因此,作为自变量的因素与作为因变量的预测对象是否有关,相关程度如何,以及判断这种相关程度的把握性有多大,是回归分析要解决的问题。逻辑回归分析可以使用Wald检验一一检验回归项系数的值。如果对于特定的自变量,Wald检验后表明这些自变量是重要的,则应该包含在模型中。如果Wald测试表明这些自变量不具有重要意义,则可以从模型中省略这些自变量。利用逻辑回归分析和Wald检验能够对业务预测模型的模型参数进行评估,进而基于评估结果筛选对象样本的特征项,达到对业务数据进行降维处理的目的。
在本实施例中,步骤S220中,相关性数据包括协方差矩阵数据,相关性数据分片包括协方差矩阵分片。多个参与方的协方差矩阵分片在假定重构的情况下能够构成协方差矩阵。协方差矩阵是联合数据中多个特征项中两两特征项之间的协方差构成的矩阵,其主对角线上的元素为多个特征项的方差,非对角线上的元素为两两特征项之间的协方差。协方差矩阵为对称矩阵,当联合数据中有D个特征项时,协方差矩阵可以是D*D维的对称矩阵。
在步骤S220中确定多个参与方分别对应的相关性数据分片时,也就是确定多个参与方分别对应的协方差矩阵分片时,多个参与方的参与方设备可以执行以下步骤1和2。
步骤1,基于多个参与方的联合数据分片和预测值分片,以及业务预测模型中的函数关系式,确定多个参与方分别对应的中间矩阵分片。例如,第一参与方A得到中间矩阵分片<H>
A,第二参与方B得到中间矩阵分片<H>
B,多个中间矩阵分片在假设重构的情况下得到中间矩阵H。多个参与方并不会真正进行中间矩阵分片的重构,这里只是表示多个中间矩阵分片之间的关系。
步骤2,基于多个参与方的中间矩阵分片,计算多个参与方分别对应的中间矩阵逆的分片,得到多个参与方分别对应的协方差矩阵分片。例如,第一参与方A得到中间矩阵逆的分片<H
-1>
A,第二参与方B得到中间矩阵逆的分片<H
-1>
B,多个中间矩阵逆的分片在假设重构的情况下得到中间矩阵的逆H
-1。多个参与方并不会真正进行中间矩阵逆的分片的重构,这里只是表示多个中间矩阵逆的分片之间的关系。
在步骤1中,确定多个参与方分别对应的中间矩阵分片时,可以基于多个参与方的联合数据分片和预测值分片,以及基于业务预测模型中的函数关系式得到的海森矩阵表达式,确定多个参与方分别对应的海森矩阵分片,作为中间矩阵分片;所述海森矩阵表达式中包括联合数据矩阵和预测值矩阵。
其中,当业务预测模型为逻辑回归模型时,业务预测模型的函数关系式,即模型预测值的函数关系式如上式(1)所示在逻辑回归模型经过训练之后,得到对应的模型 参数,例如β。海森矩阵表达式实际上是针对模型参数β进行的二阶导。通过上述式(1)~式(5),可知基于业务预测模型中的函数关系式得到的海森矩阵表达式为
H=X
TMX (9)
通过多个参与方设备之间的安全交互,基于多个参与方分别拥有的联合数据分片<X>,以及基于多个预测值π(X
N)分片得到的矩阵M分片,利用上述式(9)可以使得多个参与方分别确定H的分片,该H的分片作为中间矩阵分片。其中,M可以称为预测值矩阵。
在一种应用场景中,联合数据X是高维矩阵,对象的数量N的基本常常在十万、百万甚至更多,这就导致在利用多个参与方的分片数据计算H=X
TMX时,交互数据量过大,处理效率不高。为了简化对H分片的计算,尽可能简化多个参与方之间的交互数据,可以对矩阵M的形式进行变换,以便简化多个参与方对H分片的确定过程。
具体的,第一参与方设备在利用联合数据分片<X>
A、多个预测值分片以及上述式(9)确定多个参与方分别对应的海森矩阵分片<H>时,可以执行以下步骤1a~3a。
步骤1a,利用秘密分享乘法,基于预测值矩阵的表达式,对多个参与方的预测值分片进行向量元素的对应相乘,使得多个参与方分别得到中间向量分片。
例如,针对两个参与方的情况,第一参与方A和第二参与方B之间可以利用秘密分享乘法,对预测值分片进行向量元素的对应相乘,得到第一参与方A的中间向量分片,第二参与方B的中间向量分片。多个参与方的中间向量分片在假定重构时得到中间向量。多个参与方并不会真正重构中间向量,这里只是表示多个中间向量分片之间的关系。
步骤2a,以第一参与方A的中间向量分片中的元素作为对角元素,构建得到对角化的第一参与方A的预测值矩阵分片。
作为其他的参与方设备,第二参与方设备也以第二参与方B的中间向量分片中的元素作为对角元素,构建得到对角化的第二参与方B的预测值矩阵分片。
步骤3a,基于多个参与方的联合数据分片<X>、预测值矩阵分片和海森矩阵表达式,确定多个参与方分别对应的海森矩阵分片。例如,第一参与方A和第二参与方B之间可以通过例如秘密分享矩阵乘法确定海森矩阵分片<H>
A和<H>
B。
通过上述步骤1a和2a,多个参与方基于自身的多个预测值分片,分别得到了对角化之后的预测值矩阵分片。由于对角化后矩阵的主对角元素不为0,非主对角元素均为0。这样就对预测值矩阵进行了简化,从而能够提高处理效率。
在步骤1a中,预测值矩阵M的表达式中包括
π(X
N)[π(X
N)-1] (10)
因此,可以利用多个参与方各自拥有的预测值分片,例如第一参与方A的预测值分片<π>
A,第二参与方B的预测值分片<π>
B,得到上式(10)的另一种表达形式
(<π>
A+<π>
B)*(<π>
A+<π>
B-1)=<中间向量>
A+<中间向量>
B (11)
多个参与方之间可以利用秘密分享乘法,按照式(11)进行向量元素的对应相乘。也就是,针对多个参与方之间的任意一组预测值分片,将该组预测值分片作为秘密分享乘法的输入,秘密分享乘法按照预测值矩阵表达式的形式进行,输出多个参与方各自的 中间向量分片中的元素。多组预测值分片对应的中间向量分片元素组成中间向量分片。多个中间向量分片在假定重构时得到中间向量。
例如,第一参与方A的每个预测值分片<π>
A、第二参与方B的对应预测值分片<π>
B可以作为秘密分享乘法的输入,秘密分享乘法按照式(11)进行,输出第一参与方A和第二参与方B分别对应的<中间向量>
A分片中的元素和<中间向量>
B分片中的元素。
在步骤2a中,第一参与方A以<中间向量>
A分片中的元素作为对角元素,构建得到对角矩阵<Λ>
A,此即为对角化的第一参与方A的预测值矩阵分片。第二参与方B以<中间向量>
B分片中的元素作为对角元素,构建得到对角矩阵<Λ>
B,此即为对角化后的预测值矩阵分片。当<中间向量>
A分片的维数为N时,构建得到的对角矩阵的维数为N*N。在构建对角矩阵时,预测值矩阵分片<Λ>
A的对角元素分别是<中间向量>
A分片中的元素,预测值矩阵分片<Λ>
A的非对角元素均为0。
在步骤3a中,海森矩阵表达式H=X
TMX中的M矩阵可以替换为预测值矩阵Λ,因此海森矩阵表达式可以更新为H=X
TΛX。第一参与方A,第二参与方B可以采用秘密分享矩阵乘法(Secret Matrix Multiplication,SMM),基于第一参与方A的联合数据分片<X>
A、预测值矩阵分片<Λ>
A,以及第二参与方B的联合数据分片<X>
B、预测值矩阵分片<Λ>
B,按照H=X
TΛX,确定第一参与方A的海森矩阵分片<H>
A,以及第二参与方B的海森矩阵分片<H>
B。
由于预测值矩阵分片是对角矩阵,其中含有大量的0元素,且矩阵维度为N*N。在一种业务场景中,样本量N的量级非常大,例如在十万、百万或更高量级,即联合数据X的维度非常高。在针对X
T和对角矩阵Λ进行秘密分享矩阵乘法时,为了提高执行效率,减少参与方之间的通信量,可以在计算X
TΛ时采用更简洁方法。
也就是,在计算多个参与方的联合数据分片与预测值矩阵分片的安全乘操作时,将联合数据分片中的列向量分别与预测值矩阵分片中对应的对角元素进行安全乘操作。
多个预测值矩阵分片均为对角矩阵,主对角线上的元素不为0,非主对角线上的元素均为。当联合数据分片与预测值矩阵分片进行矩阵乘法操作时,可以将其切分成联合数据分片的列向量分别与预测值矩阵分片中的对角元素的乘操作,即列向量与非0元素的乘操作。列向量与0元素的乘操作,其结果均为0,可以省略不计算。这样,就能够将多个参与方之间的高维矩阵乘操作进行拆解,省去大量计算量,从而减少很多参与方之间的通信量。通信量在隐私保护场景下对处理效率起到决定性作用。
下面结合矩阵表达式说明列向量与非0元素的乘操作如何减少通信量。在海森矩阵表达式H=X
TΛX中,XTΛ的具体形式为
下面以X
TΛ的第一列的计算方法为例进行说明。要求得X
TΛ的第一列,需要向量x=(x
11……x
1D)的每个元素乘以
以第一参与方A和第二参与方B之间的乘 操作为例进行说明,参见图3所示的流程图,图3为本实施例中秘密分享矩阵乘法应用的一种计算流程示意图。
第1步,双方分别获得随机数三元组(triple)。第一参与方A获得<u>
A(D*1)、<v>
A
(1*1)、<z>
A(D*1),第二参与方B获得<u>
B(D*1)、<v>
B(1*1)、<z>
B(D*1),且满足z
(D*1)=u
(D*1)*v
(1*1),其中,z=<z>
A+<z>
B,u=<u>
A+<u>
B,v=<v>
A+<v>
B。其中,D*1、1*1为矩阵维度。
第2步,第一参与方A利用随机数对其隐私数据进行拆分,以实现对隐私数据的遮蔽进而得到隐秘矩阵。第一参与方A计算<d>
A=<x>
A-<u>
A,<e>
A=<m>
A-<v>
A。第二参与方B利用随机数对其隐私数据进行拆分,得到隐秘矩阵。第二参与方B计算<d>
B=<x>
B-<u>
B,<e>
B=<m>
B-<v>
B。
第3步,参与方之间相互发送各自的隐秘矩阵,并基于各自自身的隐秘矩阵和接收的隐秘矩阵进行处理。第一参与方A向第二参与方B发送<d>
A和<e>
A,第二参与方B向第一参与方A发送<d>
B和<e>
B。第一参与方A计算d=<d>
A-<d>
B,e=<e>
A-<e>
B,第一参与方B计算d=<d>
A-<d>
B,e=<e>
A-<e>
B。
第4步,参与方分别计算各自的数据分片。第一参与方A计算<Y>A=<z>
A+<u>
A*e+d*<v>
A+d*e,第二参与方B计算<Y>
B=<z>
B+<u>
B*e+d*<v>
B。并且,<Y>
A+<Y>
B=x*m。
于是,第一参与方A和第二参与方B在不暴露隐私数据<x>
A和<m>
A以及<x>
B和<m>
B的情况下,分别得到了分片<Y>
A和<Y>
B,这两个分片在假定重构时,能够得到向量x与数值m的乘积。并且,每进行一次矩阵乘法,参与方之间的通信量包括上述第3步中进行的数据通信为2(D+1),计算X
TΛ需要的通信量为2(D+1)*N。这相比于一般矩阵乘法计算需要的通信量2(D*N+N*N),减少了大量的通信量。
按照上述方式,多个参与方将X
T中的每一列都与Λ中对应的对角元素相乘,对于任意一个参与方来说,其可以得到的多个分片<Y>
A,将该多个分片<Y>
A拼接起来构成的矩阵,即是X
TΛ在该参与方中的分片。
在多个参与方联合计算得到X
TΛ之后,可以采用SMM,基于多个参与方分别拥有的<X
TΛ>分片以及联合数据分片<X>,确定海森矩阵H=X
TΛX的分片。
下面以两个参与方为例,说明利用SMM进行分片矩阵乘法的过程。已知第一参与方A拥有分片<X
TΛ>
A以及联合数据分片<X>
A,第二参与方B拥有分片<X
TΛ>
B以及联合数据分片<X>
B,目标是输出X
TΛX,使得第一参与方得到<X
TΛX>A,第二参与方B得到<X
TΛX>
B,并且<X
TΛX>
A+<X
TΛX>
B=X
TΛX。
第一参与方A和第二参与方B之间的处理过程可以参见图3所述示意图,将图3中第一参与方A的数据<x>
A替换为<X
TΛ>
A,将<m>
A替换为<x>
A,将第二参与方B的数据<x>
B替换为<X
TΛ>
B,将<m>
B替换为<x>
B,并相应调整各个参量的矩阵维度,即可以基于图3所示的流程图,使得第一参与方A和第二参与方B分别得到海森矩阵分片<X
TΛX>
A和<X
TΛX>
B。在图3中,<X
TΛX>
A对应于<Y>
A,<X
TΛX>
B对应于<Y>
B。
第一参与方A和第二参与方B所执行的操作,实际操作中分别是由各方对应的参与方设备执行的。
下面回到步骤2中,在基于多个参与方的中间矩阵分片<H>,计算多个参与方分别对应的中间矩阵逆的分片<H
-1>,得到多个参与方分别对应的协方差矩阵分片<Cov>的步骤执行时,可以利用秘密分享矩阵逆(Secure Matrix Inverse,SMI)算法,基于多个参与方的中间矩阵分片<H>,通过迭代计算,得到多个参与方分别对应的协方差矩阵分片<Cov>。其中,协方差矩阵等于中间矩阵的逆,Cov=H
-1。
例如,已知第一参与方A的中间矩阵分片<H>
A和第二参与方B的中间矩阵分片<H>
B,为了计算<H
-1>
A和<H
-1>
B得到,可以利用SMI进行迭代计算。其中,中间矩阵分片<H>
A和<H>
B在假定重构时得到中间矩阵H,H
-1为H的逆矩阵,但是第一参与方A和第二参与方B不会重构H。因此,需要在已知<H>
A和<H>
B,且不对其进行重构的情况下,使得第一参与方A和第二参与方B分别确定<H
-1>
A和<H
-1>
B。不对中间矩阵H进行重构,能够避免隐私数据的泄露。
下面以两个参与方为例,说明利用SMI迭代计算协方差矩阵分片的过程。已知第一参与方A拥有中间矩阵分片<H>
A,第二参与方B拥有中间矩阵分片<H>
B,H=<H>
A+<H>
B。期望:使得第一参与方A得到<H
-1>
A,第二参与方B得到<H
-1>
B,H
-1=<H
-1>
A+<H
-1>
B。
初始化时,第一参与方A和第二参与方B通过联合计算分别得到L
0,
L
0=tr(H)
-1=[tr(<H>
A)+tr(<H>
B)]
-1
其中,tr为矩阵的迹。
在任意一次迭代计算中,多个参与方之间利用SMM,并按照以下迭代公式分别进行计算
L
k+1=L
k(2*I-H L
k)=(<L
k>
A+<L
k>
B)[2*I-(<H>
A+<H>
B)(<L
k>
A+<L
k>
B)]
其中,I为单位矩阵。在一次迭代过程中,需要进行2次SMM。迭代轮数可以预先设定,例如可以设置为20至32次,k是迭代次数。
下面回到步骤S230,在基于多个参与方的模型参数分片和协方差矩阵分片,确定模型参数对应的特征项在提升业务预测模型效果上的有效值时,可以采用Wald检验的式(2)
或者采用式(8)
计算第k个模型参数的显著性检验值(或称为显著性水平值),基于显著性检验值以及初始假设,确定模型参数对应的特征项在提升业务预测模型效果上的有效值。
在确定Wald
k或者z
k时,分子部分是
模型参数,分母部分
是模型参数的 标准差,标准差可以根据模型参数方差的平方根求得,而协方差矩阵对角元素即为对应的模型参数的方差。下面可以利用秘密分享根号逆(Secure Number Sqrt Invert,SNSI)算法,基于多个参与方的模型参数分片和协方差矩阵分片,确定模型参数对应的特征项的有效值。具体可以包括以下步骤1b和2b。
步骤1b,多个参与方设备,将多个参与方的协方差矩阵分片中的对角元素,作为与多个模型参数分别对应的方差分片。这里的对角元素可以是指主对角元素。在协方差矩阵中,主对角元素为特征项的方差。相对应的,在协方差矩阵分片中,主对角元素是特征项的方差分片。
步骤2b,第一参与方设备,针对任意一个模型参数,利用SNSI算法以及显著性检验法,基于第一参与方A的对应模型参数分片以及多个参与方的对应方差分片,通过多个参与方设备之间的交互,联合进行安全根号逆操作,确定第一参与方A的针对该模型参数的显著性检验值分片。基于多个参与方针对该模型参数的显著性检验值分片,确定该模型参数对应的特征项的有效值。
同样的,第二参与方设备,针对任意一个模型参数,利用SNSI算法以及显著性检验值,基于第二参与方B的对应模型参数分片以及多个参与方的对应方差分片,通过多个参与方设备之间的交互,联合进行安全根号逆操作,确定第二参与方B的该模型参数的显著性检验值分片。
在一种实施方式中,多个参与方的显著性检验值分片可以发送至某一个参与方设备或者第三方设备,由该参与方设备或者第三方设备重构得到显著性检验值,基于该显著性检验值,按照预定的变换方式,可以确定对应的特征项的有效值。在另一种实施方式中,多个参与方的显著性检验值分片也可以直接作为有效值分片,多个显著性检验值分片可以重构得到有效值。
显著性检验值可以基于上述式(2)或者式(8),或者p_value公式进行计算,所得到的显著性检验值分片可以但不限于是Wald
k值分片、z
k值分片或者p值分片。
多个参与方的模型参数分片在假定重构时得到该模型参数。例如,针对任意一个模型参数β
1,第一参与方的模型参数分片<β
1>
A和第二参与方B的模型参数分片<β
1>
B在假定重构时得到该模型参数β
1。模型参数分片不会真正进行重构,这里仅是为了说明模型参数分片与模型参数之间的关系。
可见,本实施例在计算显著性检验值时,使用的是多个参与方的协方差矩阵分片中的对角元素,并没有对协方差矩阵中的数据进行重构,因此能够很好地保护协方差矩阵中隐私数据的安全性。
下面详细说明步骤2b中,针对任意一个模型参数β
k,第一参与方设备利用SNSI算法以及显著性检验法,通过多个参与方设备之间的交互,基于第一参与方A的模型参数分片<β
k>
A以及多个参与方的方差分片,联合进行安全根号逆操作,确定第一参与方A的针对模型参数β
k的显著性检验值分片的步骤。同理可以得到第二参与方设备确定第二参与方B的针对模型参数β
k的显著性检验值分片。
其中,<z
k>
A为第一参与方A的模型参数β
k的显著性检验值分片,分子部分为第一参与方A的模型参数分片,分母部分中,<Cov
kk>A为第一参与方A拥有的模型参数β
k对应的方差分片,也是第一参与方A的协方差矩阵分片中的第kk元素(对角元素),<Cov
kk>
B为第二参与方B拥有的模型参数β
k对应的方差分片,也是第二参与方B的协方差矩阵分片中的第kk元素。
分子部分是第一参与方A拥有的,分母部分是第一参与方A和第二参与方B共同拥有的。于是,现在问题的重点在于如何计算式(12)中的根号逆。本实施例中,利用SNSI算法确定第一参与方A的模型参数β
k的方差分片与第二参与方B的模型参数β
k的方差分片的和的根号逆,基于该根号逆与第一参与方A的模型参数分片<β
k>
A的乘积,可以得到第一参与方A针对模型参数β
k的显著性检验值分片。其中,式(12)中的根号逆如下
下面通过步骤1c~3c具体说明如何利用SNSI算法计算根号逆(<Cov
kk>
A+<Cov
kk>
B)-1/2。为了描述方便,令n
a=<Cov
kk>
A,n
b=<Cov
kk>
B,令n表示模型参数β
k,即n=n
a+n
b,通过以下计算期望使得第一参与方设备得到c
a,第二参与方设备得到c
b,并且c
a+c
b=(n
a+n
b)
-1/2=n
-1/2。
步骤1c,第一参与方设备和第二参与方设备通过交互,将加法分片转化为乘法分片。
第一参与方设备计算x
ba=x
ba1+x
ba2,并将x
ba发送至第二参与方设备(x
ba1,x
ba2不可单独发送);第二参与方设备计算x
b=x
ba+x
bb,此时n=x
a×x
b,实现将加法分片n=n
a+n
b转化成乘法分片n=x
a×x
b。此时,第一参与方A拥有x
a,第二参与方拥有x
b。
步骤2c,两个参与方设备分别在本地进行迭代估计值的初始化。
以第一参与方A为例,第一参与方设备将64位浮点数x
a的存储值按照64位整数的存储方式进行读取,并右移一位(除以2并向下取整),记为int
a;计算0x5fe6eb50c7b537a9-int
a,并按照64位浮点数的存储方式进行读取,记为y
a。这样,即将x
a初始化为y
a。
同样的,第二参与方设备进行以上初始化,可以将x
b初始化为y
b。此时,第一参与方A拥有y
a,第二参与方拥有y
b。
步骤3c,两参与方联合利用牛顿法迭代计算n-1/2。
迭代初始值为Y
0=Y
0a×Y
0b=y
a×y
b,分别由两个参与方拥有。迭代公式如下
其中,迭代过程中使用两次秘密分享矩阵乘法,共迭代1次,第一参与方A和第二参与方B分别得到浮点数c
a和c
b。
上述步骤2b的实施过程,还可以采用其他方式实施。例如,先将第一参与方A的方差分片和第二参与方B的方差分片进行安全标准化,之后通过线性近似计算得到迭代初始值,最后基于Goldschmidt算法进行迭代。在该实施方式中,可以基于第一参与方A的方差分片和第二参与方B的方差分片进行秘密分享矩阵乘法运算,再执行其他操作。
本说明书中,第一参与方、第一特征项中的“第一”,以及文中的“第二”,仅仅是为了区分和描述方便,而不具有任何限定意义。
本说明书中,多个参与方的数量可以是2个、3个或更多个,每个参与方通过对应的参与方设备执行多种操作,参与方设备可以通过任何具有计算、处理能力的装置、设备、平台、设备集群等来实现。
在本说明书的实施例中,多以两个参与方为例进行说明。例如,在针对多方安全计算的秘密分享矩阵乘法、秘密分享根号逆、秘密分享矩阵逆等算法的实施例说明中,可以将两个参与方的实施方式较容易地扩展至更多方参与的场景,具体过程不再赘述。
上述内容对本说明书的特定实施例进行了描述,其他实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行,并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要按照示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的,或者可能是有利的。
图4为实施例提供的一种保护隐私的确定业务数据特征有效值的装置的示意性框图。业务数据分布在多个参与方中,多个参与方各自的业务数据在假定拼接的情况下构成联合数据,所述联合数据包括多个对象针对多个特征项的特征值;所述装置400部署在任意的第一参与方设备中,第一参与方设备可以通过任何具有计算、处理能力的装置、设备、平台、设备集群等来实现。该装置实施例与图2所示方法实施例相对应。该装置400包括:获取模块410,配置为,获取第一参与方的联合数据分片,获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片;所述预测值分片和所述模型参数分片基于训练后的业务预测模型得到;交互模块420,配置为,利用多方安全计算,通过多个参与方设备之间的交互,基于多个参与方的联合数据分片和预测值分片,确定多个参与方分别对应的相关性数据分片,其中包括多个特征项之间的相关性数据;检验模块430,配置为,采用显著性检验法,通过多个参与方设备之间的安全交互,基于多个参与方的模型参数分片和所述相关性数据分片中的对应数据,确定模型参数对应的特征项在提升所述业务预测模型效果上的有效值。
在一种实施方式中,所述获取模块410,在获取第一参与方的联合数据分片时,包括:采用秘密分享加法,通过与其他参与方设备的交互,基于多个参与方的业务数据进行拆分和拼接操作,使得多个参与方分别获取到联合数据分片;多个参与方的联合数据分片在假定重构的情况下得到所述联合数据。
在一种实施方式中,所述业务预测模型,基于多个参与方各自的联合数据分片进行安全联合训练得到;所述业务预测模型用于对对象进行业务预测。
在一种实施方式中,所述获取模块410,在获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片时,包括:获取训练后的业务预测模型在所述第一参与方设备本地的模型参数分片;通过多个参与方设备的交互,基于多个参与方的联合数据分片以及训练后的业务预测模型,分别使得多个参与方确定对象的预测值分片。
在一种实施方式中,所述相关性数据包括协方差矩阵数据,所述相关性数据分片包括协方差矩阵分片;所述交互模块420包括:确定子模块421,配置为,基于多个参与方的联合数据分片和预测值分片,以及所述业务预测模型中的函数关系式,确定多个参与方分别对应的中间矩阵分片;计算子模块422,配置为,基于多个参与方的中间矩阵分片,计算多个参与方分别对应的中间矩阵逆的分片,得到多个参与方分别对应的协方差矩阵分片。
在一种实施方式中,所述确定子模块421,具体配置为:基于多个参与方的联合数据分片和预测值分片,以及基于所述业务预测模型中的函数关系式得到的海森矩阵表达式,确定多个参与方分别对应的海森矩阵分片,作为中间矩阵分片;所述海森矩阵表达式中包括联合数据矩阵和预测值矩阵。
在一种实施方式中,所述确定子模块421,在确定多个参与方分别对应的海森矩阵分片时,包括:利用秘密分享乘法,基于预测值矩阵的表达式,对多个参与方的预测值分片进行向量元素的对应相乘,使得多个参与方分别得到中间向量分片;以所述第一参与方的中间向量分片中的元素作为对角元素,构建得到对角化的所述第一参与方的预测值矩阵分片;基于多个参与方的联合数据分片、预测值矩阵分片和所述海森矩阵表达式,确定多个参与方分别对应的海森矩阵分片。
在一种实施方式中,所述确定子模块421,在基于多个参与方的联合数据分片、预测值矩阵分片和所述海森矩阵表达式,确定多个参与方分别对应的海森矩阵分片时,包括:在计算多个参与方的联合数据分片与预测值矩阵分片的安全乘操作时,将联合数据分片中的列向量分别与预测值矩阵分片中对应的对角元素进行安全乘操作。
在一种实施方式中,所述计算子模块422,具体配置为:利用秘密分享矩阵逆算法SMI,基于多个参与方的中间矩阵分片,通过迭代计算,得到多个参与方分别对应的协方差矩阵分片。
在一种实施方式中,所述检验模块430具体配置为:将多个参与方的协方差矩阵分片中的对角元素,作为与多个模型参数分别对应的方差分片;针对任意一个模型参数,利用SNSI以及显著性检验法,基于所述第一参与方的对应模型参数分片以及多个参与方的对应方差分片,通过多个参与方设备之间的交互,联合进行安全根号逆操作,确定所述第一参与方的针对该模型参数的显著性检验值分片;基于多个参与方针对该模型参数的显著性检验值分片,确定该模型参数对应的特征项的有效值。
在一种实施方式中,装置400还包括重构模块(图中未示出),配置为:针对任意的第一特征项,从其他参与方设备中获取所述第一特征项的有效值分片;基于所述第一特征项在本地的有效值分片和获取的有效值分片,确定所述第一特征项的重构后的有效值。
在一种实施方式中,装置400还包括去除模块(图中未示出),配置为:基于所述有效值,从多个特征项中去除有效值不满足预设条件的特征项,以使多个参与方采用去除特征项后的业务数据,对所述业务预测模型进行安全联合训练。
在一种实施方式中,所述对象包括用户、商品、事件中的一种;所述特征项包括以下至少一种:基本属性信息、关联关系信息、交互信息、历史行为信息;所述业务预测模型用于对对象进行业务预测。
在一种实施方式中,所述业务预测模型基于逻辑回归模型得到。
上述装置实施例与方法实施例相对应,具体说明可以参见方法实施例部分的描述,此处不再赘述。装置实施例是基于对应的方法实施例得到,与对应的方法实施例具有同样的技术效果,具体说明可参见对应的方法实施例。
本说明书实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行图1至图3任一项所述的方法。
本说明书实施例还提供了一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现图1至图3任一项所述的方法。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于存储介质和计算设备实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。
以上所述的具体实施方式,对本发明实施例的目的、技术方案和有益效果进行了进一步的详细说明。所应理解的是,以上所述仅为本发明实施例的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。
Claims (21)
- 一种保护隐私的确定业务数据特征有效值的方法,业务数据分布在多个参与方中,多个参与方各自的业务数据在假定拼接的情况下构成联合数据,所述联合数据包括多个对象针对多个特征项的特征值;所述方法通过任意的第一参与方设备执行,包括:获取第一参与方的联合数据分片,获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片;所述预测值分片和所述模型参数分片基于训练后的业务预测模型得到;利用多方安全计算,通过多个参与方设备之间的交互,基于多个参与方的联合数据分片和预测值分片,确定多个参与方分别对应的相关性数据分片,其中包括多个特征项之间的相关性数据;采用显著性检验法,通过多个参与方设备之间的安全交互,基于多个参与方的模型参数分片和所述相关性数据分片中的对应数据,确定模型参数对应的特征项在提升所述业务预测模型效果上的有效值。
- 根据权利要求1所述的方法,所述获取第一参与方的联合数据分片的步骤,包括:采用秘密分享加法,通过与其他参与方设备的交互,基于多个参与方的业务数据进行拆分和拼接操作,使得多个参与方分别获取到联合数据分片;多个参与方的联合数据分片在假定重构的情况下得到所述联合数据。
- 根据权利要求1所述的方法,所述业务预测模型,基于多个参与方各自的联合数据分片进行安全联合训练得到;所述业务预测模型用于对对象进行业务预测。
- 根据权利要求3所述的方法,所述获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片的步骤,包括:获取训练后的业务预测模型在所述第一参与方设备本地的模型参数分片;通过多个参与方设备的交互,基于多个参与方的联合数据分片以及训练后的业务预测模型,分别使得多个参与方确定对象的预测值分片。
- 根据权利要求1所述的方法,所述相关性数据包括协方差矩阵数据,所述相关性数据分片包括协方差矩阵分片;所述确定多个参与方分别对应的相关性数据分片的步骤,包括:基于多个参与方的联合数据分片和预测值分片,以及所述业务预测模型中的函数关系式,确定多个参与方分别对应的中间矩阵分片;基于多个参与方的中间矩阵分片,计算多个参与方分别对应的中间矩阵逆的分片,得到多个参与方分别对应的协方差矩阵分片。
- 根据权利要求5所述的方法,所述确定多个参与方分别对应的中间矩阵分片的步骤,包括:基于多个参与方的联合数据分片和预测值分片,以及基于所述业务预测模型中的函数关系式得到的海森矩阵表达式,确定多个参与方分别对应的海森矩阵分片,作为中间矩阵分片;所述海森矩阵表达式中包括联合数据矩阵和预测值矩阵。
- 根据权利要求6所述的方法,所述确定多个参与方分别对应的海森矩阵分片的步骤,包括:利用秘密分享乘法,基于预测值矩阵的表达式,对多个参与方的预测值分片进行向量元素的对应相乘,使得多个参与方分别得到中间向量分片;以所述第一参与方的中间向量分片中的元素作为对角元素,构建得到对角化的所述第一参与方的预测值矩阵分片;基于多个参与方的联合数据分片、预测值矩阵分片和所述海森矩阵表达式,确定多个参与方分别对应的海森矩阵分片。
- 根据权利要求7所述的方法,所述基于多个参与方的联合数据分片、预测值矩阵分片和所述海森矩阵表达式,确定多个参与方分别对应的海森矩阵分片的步骤,包括:在计算多个参与方的联合数据分片与预测值矩阵分片的安全乘操作时,将联合数据分片中的列向量分别与预测值矩阵分片中对应的对角元素进行安全乘操作。
- 根据权利要求5所述的方法,所述基于多个参与方的中间矩阵分片,计算多个参与方分别对应的中间矩阵逆的分片,得到多个参与方分别对应的协方差矩阵分片的步骤,包括:利用秘密分享矩阵逆算法SMI,基于多个参与方的中间矩阵分片,通过迭代计算,得到多个参与方分别对应的协方差矩阵分片。
- 根据权利要求5所述的方法,所述确定模型参数对应的特征项在提升所述业务预测模型效果上的有效值的步骤,包括:将多个参与方的协方差矩阵分片中的对角元素,作为与多个模型参数分别对应的方差分片;针对任意一个模型参数,利用秘密分享根号逆算法SNSI以及显著性检验法,基于所述第一参与方的对应模型参数分片以及多个参与方的对应方差分片,通过多个参与方设备之间的交互,联合进行安全根号逆操作,确定所述第一参与方的针对该模型参数的显著性检验值分片;基于多个参与方针对该模型参数的显著性检验值分片,确定该模型参数对应的特征项的有效值。
- 根据权利要求10所述的方法,还包括:针对任意的第一特征项,从其他参与方设备中获取所述第一特征项的有效值分片;基于所述第一特征项在本地的有效值分片和获取的有效值分片,确定所述第一特征项的重构后的有效值。
- 根据权利要求1所述的方法,还包括:基于所述有效值,从多个特征项中去除有效值不满足预设条件的特征项,以使多个参与方采用去除特征项后的业务数据,对所述业务预测模型进行安全联合训练。
- 根据权利要求1所述的方法,所述对象包括用户、商品、事件中的一种;所述特征项包括以下至少一种:基本属性信息、关联关系信息、交互信息、历史行为信息;所述业务预测模型用于对对象进行业务预测。
- 根据权利要求1所述的方法,所述业务预测模型基于逻辑回归模型得到。
- 一种保护隐私的确定业务数据特征有效值的装置,业务数据分布在多个参与方中,多个参与方各自的业务数据在假定拼接的情况下构成联合数据,所述联合数据包括多个对象针对多个特征项的特征值;所述装置部署在任意的第一参与方设备中,包括:获取模块,配置为,获取第一参与方的联合数据分片,获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片;所述预测值分片和所述模型参数分片基于训练后的业务预测模型得到;交互模块,配置为,利用多方安全计算,通过多个参与方设备之间的交互,基于多个参与方的联合数据分片和预测值分片,确定多个参与方分别对应的相关性数据分片,其中包括多个特征项之间的相关性数据;检验模块,配置为,采用显著性检验法,通过多个参与方设备之间的安全交互,基于多个参与方的模型参数分片和所述相关性数据分片中的对应数据,确定模型参数对应 的特征项在提升所述业务预测模型效果上的有效值。
- 根据权利要求15所述的装置,所述获取模块,在获取第一参与方的联合数据分片时,包括:采用秘密分享加法,通过与其他参与方设备的交互,基于多个参与方的业务数据进行拆分和拼接操作,使得多个参与方分别获取到联合数据分片;多个参与方的联合数据分片在假定重构的情况下得到所述联合数据。
- 根据权利要求15所述的装置,所述业务预测模型,基于多个参与方各自的联合数据分片进行安全联合训练得到;所述业务预测模型用于对对象进行业务预测。
- 根据权利要求17所述的装置,所述获取模块,在获取多个对象分别对应的预测值分片,以及多个特征项分别对应的模型参数分片时,包括:获取训练后的业务预测模型在所述第一参与方设备本地的模型参数分片;通过多个参与方设备的交互,基于多个参与方的联合数据分片以及训练后的业务预测模型,分别使得多个参与方确定对象的预测值分片。
- 根据权利要求15所述的装置,所述相关性数据包括协方差矩阵数据,所述相关性数据分片包括协方差矩阵分片;所述交互模块,包括:确定子模块,配置为,基于多个参与方的联合数据分片和预测值分片,以及所述业务预测模型中的函数关系式,确定多个参与方分别对应的中间矩阵分片;计算子模块,配置为,基于多个参与方的中间矩阵分片,计算多个参与方分别对应的中间矩阵逆的分片,得到多个参与方分别对应的协方差矩阵分片。
- 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-14中任一项所述的方法。
- 一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-14中任一项所述的方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/517,425 US20240095647A1 (en) | 2021-05-24 | 2023-11-22 | Privacy-protecting methods and apparatuses for determining feature effective value of business data |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110564443.3A CN113407987B (zh) | 2021-05-24 | 2021-05-24 | 保护隐私的确定业务数据特征有效值的方法及装置 |
CN202110564443.3 | 2021-05-24 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/517,425 Continuation US20240095647A1 (en) | 2021-05-24 | 2023-11-22 | Privacy-protecting methods and apparatuses for determining feature effective value of business data |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022247620A1 true WO2022247620A1 (zh) | 2022-12-01 |
Family
ID=77674529
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/091637 WO2022247620A1 (zh) | 2021-05-24 | 2022-05-09 | 保护隐私的确定业务数据特征有效值的方法及装置 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240095647A1 (zh) |
CN (1) | CN113407987B (zh) |
WO (1) | WO2022247620A1 (zh) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113407987B (zh) * | 2021-05-24 | 2023-10-20 | 支付宝(杭州)信息技术有限公司 | 保护隐私的确定业务数据特征有效值的方法及装置 |
CN114781000B (zh) * | 2022-06-21 | 2022-09-20 | 支付宝(杭州)信息技术有限公司 | 针对大规模数据的对象特征之间相关性的确定方法及装置 |
CN115396101B (zh) * | 2022-10-26 | 2022-12-27 | 华控清交信息科技(北京)有限公司 | 一种基于秘密分享的不经意打乱方法和系统 |
CN118585542A (zh) * | 2023-03-01 | 2024-09-03 | 脸萌有限公司 | 数据查询方法、装置、设备和存储介质 |
CN117195060B (zh) * | 2023-11-06 | 2024-02-02 | 上海零数众合信息科技有限公司 | 基于多方安全计算的电信诈骗识别方法和模型训练方法 |
CN117521150B (zh) * | 2024-01-04 | 2024-04-09 | 极术(杭州)科技有限公司 | 基于多方安全计算的数据协同处理方法 |
CN118504038A (zh) * | 2024-07-17 | 2024-08-16 | 蚂蚁科技集团股份有限公司 | 保护隐私的广义线性模型的假设检验方法和装置 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111506922A (zh) * | 2020-04-17 | 2020-08-07 | 支付宝(杭州)信息技术有限公司 | 多方联合对隐私数据进行显著性检验的方法和装置 |
CN112182649A (zh) * | 2020-09-22 | 2021-01-05 | 上海海洋大学 | 一种基于安全两方计算线性回归算法的数据隐私保护系统 |
CN112464287A (zh) * | 2020-12-12 | 2021-03-09 | 同济大学 | 基于秘密共享与联邦学习的多方XGBoost安全预测模型训练方法 |
US20210133569A1 (en) * | 2019-11-04 | 2021-05-06 | Tsinghua University | Methods, computing devices, and storage media for predicting traffic matrix |
CN113407987A (zh) * | 2021-05-24 | 2021-09-17 | 支付宝(杭州)信息技术有限公司 | 保护隐私的确定业务数据特征有效值的方法及装置 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11429915B2 (en) * | 2017-11-30 | 2022-08-30 | Microsoft Technology Licensing, Llc | Predicting feature values in a matrix |
CN116232582A (zh) * | 2019-05-22 | 2023-06-06 | 妙泰公司 | 具有增强的安全性、弹性和控制的分布式数据存储方法及系统 |
CN110555315B (zh) * | 2019-08-09 | 2021-04-09 | 创新先进技术有限公司 | 基于秘密分享算法的模型参数更新方法、装置和电子设备 |
CN110889447B (zh) * | 2019-11-26 | 2022-05-17 | 支付宝(杭州)信息技术有限公司 | 基于多方安全计算检验模型特征显著性的方法和装置 |
CN111160573B (zh) * | 2020-04-01 | 2020-06-30 | 支付宝(杭州)信息技术有限公司 | 保护数据隐私的双方联合训练业务预测模型的方法和装置 |
CN111931241B (zh) * | 2020-09-23 | 2021-04-09 | 支付宝(杭州)信息技术有限公司 | 基于隐私保护的线性回归特征显著性检验方法、装置 |
CN112434026B (zh) * | 2020-10-29 | 2024-06-28 | 暨南大学 | 一种基于哈希链的安全知识产权质押融资方法 |
CN112818290B (zh) * | 2021-01-21 | 2023-11-14 | 支付宝(杭州)信息技术有限公司 | 多方联合确定隐私数据中对象特征相关性的方法及装置 |
CN112597540B (zh) * | 2021-01-28 | 2021-10-01 | 支付宝(杭州)信息技术有限公司 | 基于隐私保护的多重共线性检测方法、装置及系统 |
-
2021
- 2021-05-24 CN CN202110564443.3A patent/CN113407987B/zh active Active
-
2022
- 2022-05-09 WO PCT/CN2022/091637 patent/WO2022247620A1/zh active Application Filing
-
2023
- 2023-11-22 US US18/517,425 patent/US20240095647A1/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210133569A1 (en) * | 2019-11-04 | 2021-05-06 | Tsinghua University | Methods, computing devices, and storage media for predicting traffic matrix |
CN111506922A (zh) * | 2020-04-17 | 2020-08-07 | 支付宝(杭州)信息技术有限公司 | 多方联合对隐私数据进行显著性检验的方法和装置 |
CN112182649A (zh) * | 2020-09-22 | 2021-01-05 | 上海海洋大学 | 一种基于安全两方计算线性回归算法的数据隐私保护系统 |
CN112464287A (zh) * | 2020-12-12 | 2021-03-09 | 同济大学 | 基于秘密共享与联邦学习的多方XGBoost安全预测模型训练方法 |
CN113407987A (zh) * | 2021-05-24 | 2021-09-17 | 支付宝(杭州)信息技术有限公司 | 保护隐私的确定业务数据特征有效值的方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN113407987B (zh) | 2023-10-20 |
CN113407987A (zh) | 2021-09-17 |
US20240095647A1 (en) | 2024-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022247620A1 (zh) | 保护隐私的确定业务数据特征有效值的方法及装置 | |
EP3627759B1 (en) | Method and apparatus for encrypting data, method and apparatus for training machine learning model, and electronic device | |
US20220092413A1 (en) | Method and system for relation learning by multi-hop attention graph neural network | |
US10872166B2 (en) | Systems and methods for secure prediction using an encrypted query executed based on encrypted data | |
US11315032B2 (en) | Method and system for recommending content items to a user based on tensor factorization | |
CA3095309A1 (en) | Application of trained artificial intelligence processes to encrypted data within a distributed computing environment | |
Li et al. | FedSDG-FS: Efficient and secure feature selection for vertical federated learning | |
Xie et al. | Enhancing reputation via price discounts in e-commerce systems: A data-driven approach | |
WO2023000794A1 (zh) | 保护数据隐私的业务预测模型训练的方法及装置 | |
US20230113118A1 (en) | Data compression techniques for machine learning models | |
US20230046601A1 (en) | Machine learning models with efficient feature learning | |
Pentyala et al. | Privfairfl: Privacy-preserving group fairness in federated learning | |
EP4085332A1 (en) | Creating predictor variables for prediction models from unstructured data using natural language processing | |
Zheng et al. | A matrix factorization recommendation system-based local differential privacy for protecting users’ sensitive data | |
CN113761350A (zh) | 一种数据推荐方法、相关装置和数据推荐系统 | |
US20230161899A1 (en) | Data processing for release while protecting individual privacy | |
US20220164374A1 (en) | Method of scoring and valuing data for exchange | |
US20210357955A1 (en) | User search category predictor | |
Pessach et al. | Fairness-driven private collaborative machine learning | |
CN116432040B (zh) | 基于联邦学习的模型训练方法、装置、介质以及电子设备 | |
Yang et al. | Optimized and federated soft-impute for privacy-preserving tensor completion in cyber-physical-social systems | |
WO2023185125A1 (zh) | 产品资源的数据处理方法及装置、电子设备、存储介质 | |
CN113407988A (zh) | 控制通信量的确定业务数据特征有效值的方法及装置 | |
JPWO2017122437A1 (ja) | 情報処理装置、情報処理システム、および情報処理方法、並びにプログラム | |
Zhang | A novel data preprocessing solution for large scale digital forensics investigation on big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22810347 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22810347 Country of ref document: EP Kind code of ref document: A1 |