CN113407988A - Method and device for determining effective value of service data characteristic of control traffic - Google Patents

Method and device for determining effective value of service data characteristic of control traffic Download PDF

Info

Publication number
CN113407988A
CN113407988A CN202110580162.7A CN202110580162A CN113407988A CN 113407988 A CN113407988 A CN 113407988A CN 202110580162 A CN202110580162 A CN 202110580162A CN 113407988 A CN113407988 A CN 113407988A
Authority
CN
China
Prior art keywords
data
participant
participants
fragments
predicted value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110580162.7A
Other languages
Chinese (zh)
Inventor
刘颖婷
陈超超
王力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202110580162.7A priority Critical patent/CN113407988A/en
Publication of CN113407988A publication Critical patent/CN113407988A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a method and a device for determining an effective value of a service data characteristic of control traffic. The business data belongs to privacy data, and the business data of a plurality of participants can be supposed to be spliced into joint data which comprises characteristic values of a plurality of objects for a plurality of characteristic items. And multiple parties respectively obtain the joint data fragment, the predicted value fragment of the sample and the model parameter fragment. The selected participants in the multiple parties reconstruct complete predicted value data by using the predicted value fragments in the multiple parties; determining relevance data fragments respectively corresponding to multiple parties by utilizing multi-party safety calculation and multi-party interaction based on joint data fragments of the multiple parties and predicted value data of selected participants, wherein the relevance data fragments comprise relevance data among multiple feature items; and determining the effective value of the characteristic item corresponding to the model parameter on the aspect of improving the effect of the service prediction model by adopting a significance test method and through the safety interaction among multiple parties based on the corresponding data in the model parameter fragment and the relevance data fragment of the multiple parties.

Description

Method and device for determining effective value of service data characteristic of control traffic
Technical Field
One or more embodiments of the present specification relate to the technical field of data security, and in particular, to a method and an apparatus for determining an effective value of a service data characteristic for controlling traffic.
Background
The data required for machine learning often involves multiple platforms, multiple domains. For example, in a merchant classification analysis scenario based on machine learning, an electronic payment platform has transaction flow data of merchants, an electronic commerce platform stores sales data of the merchants, and a banking institution has loan data of the merchants. In order to improve service, multiple parties often train a business prediction model in a combined manner on the premise of ensuring privacy and security of business data.
As the amount of data increases, the characteristic dimensions of the data become larger and larger. The multi-dimensional feature data often has some redundant information, which may affect the effect of machine learning and reduce the stability of the model. Therefore, the multidimensional feature data can be subjected to dimension reduction according to feature effectiveness, redundant features with low significance in improving the model performance are removed under the condition that the information quantity is not lost as much as possible, and the redundant features are converted into low-dimensional features.
Therefore, an improved scheme is desired, which can improve the processing efficiency in the process of determining the effective value of the feature as much as possible while ensuring the privacy of the data.
Disclosure of Invention
One or more embodiments of the present specification describe a method and an apparatus for determining effective values of characteristics of service data for controlling traffic, which can determine effective values of characteristic items for service data distributed in multiple parties without revealing privacy data, and improve processing efficiency. The specific technical scheme is as follows.
In a first aspect, an embodiment provides a method for determining effective values of characteristics of service data for controlling traffic, where the service data is distributed among multiple participants, and the service data of each of the multiple participants forms federated data under the assumption of concatenation, where the federated data includes characteristic values of multiple objects for multiple characteristic items; the method is performed by any first participant device, comprising:
acquiring joint data fragments of a first participant, acquiring predicted value fragments corresponding to a plurality of objects respectively, and model parameter fragments corresponding to a plurality of feature items respectively; the predicted value fragment and the model parameter fragment are obtained based on a trained service prediction model;
reconstructing complete predictor data in a selected participant device using a plurality of predictor slices through transmission of predictor slices between the participant device and the selected participant device;
determining relevance data fragments respectively corresponding to a plurality of participants based on joint data fragments of the plurality of participants and predicted value data of a selected participant by utilizing multi-party safety calculation through interaction among a plurality of participant devices, wherein the relevance data fragments comprise relevance data among a plurality of characteristic items;
and determining the effective value of the characteristic item corresponding to the model parameter on improving the effect of the business prediction model by adopting a significance test method and through the safety interaction among a plurality of participant devices based on the model parameter fragments of the participants and the corresponding data in the relevance data fragments.
In one embodiment, the step of obtaining the federated data segment of the first party includes:
adopting a secret sharing addition, and carrying out splitting and splicing operations based on the service data of a plurality of participants through interaction with other participant equipment, so that the plurality of participants respectively obtain joint data fragments; the federated data fragments of multiple participants result in the federated data assuming reconstruction.
In one embodiment, the service prediction model is obtained by performing security association training based on respective association data segments of a plurality of participants; the business prediction model is used for conducting business prediction on the object.
In an embodiment, the step of obtaining the predicted value slices corresponding to the plurality of objects and the model parameter slices corresponding to the plurality of feature items includes:
obtaining a model parameter fragment of the trained service prediction model in the local first participant device;
through interaction of a plurality of participants, the plurality of participants are enabled to determine predicted value fragments of the object respectively based on joint data fragments of the plurality of participants and the trained service prediction model.
In one embodiment, the selected participant comprises a participant having tag data; the step of reconstructing complete predictive value data comprises:
when the first participant is not the selected participant, the predicted value fragments of the first participant are sent to the selected participant equipment, so that the selected participant equipment utilizes the plurality of predicted value fragments to reconstruct complete predicted value data;
and when the first participant is the selected participant, receiving the predicted value fragments sent by other participants, and reconstructing the predicted value fragments of the first participant and the received predicted value fragments to obtain complete predicted value data.
In one embodiment, the correlation data comprises covariance matrix data, and the correlation data patches comprise covariance matrix patches;
the step of determining the relevance data segments corresponding to the multiple participants respectively includes:
determining intermediate matrix fragments corresponding to the multiple participants respectively based on joint data fragments of the multiple participants, predicted value data of the selected participants and a functional relation in the service prediction model;
and calculating the inverse fragments of the intermediate matrix corresponding to the multiple participants respectively based on the intermediate matrix fragments of the multiple participants to obtain the covariance matrix fragments corresponding to the multiple participants respectively.
In one embodiment, the step of determining the intermediate matrix slices corresponding to the multiple participants respectively includes:
dividing a Hessian matrix expression obtained based on a functional relation in the service prediction model into a plurality of blocks according to the assumed splicing relation of the service data of a plurality of participants; the Hessian matrix expression comprises joint data and predictive value data;
and according to the data of the participants related to the multiple blocks, respectively determining the data fragments of the multiple blocks by the multiple participants by using the joint data fragments of the multiple participants and the corresponding data in the predictive value data of the selected participant, and respectively determining the corresponding Hessian matrix fragments based on the data fragments of the blocks to serve as the middle matrix fragments.
In one embodiment, the business data of any one participant comprises characteristic values of part of characteristic items of all objects;
the step of dividing the hessian matrix expression obtained based on the functional relation in the service prediction model into a plurality of blocks includes:
dividing the hessian matrix expression into a first partition associated with the data of the selected participant, a second partition associated with a plurality of participants;
the step of enabling the plurality of participants to respectively determine the data shards of the plurality of chunks and respectively determine the corresponding hessian matrix shards based on the data shards of the chunks includes:
when the first participant is a selected participant, determining the data fragment of the first block by using the joint data fragment of the first participant and the predictive value data; when the first participant is not the selected participant, filling a value of 0 into the first block to obtain the data fragment of the first block;
determining data shards of second shards corresponding to the multiple participants by utilizing the secret sharing matrix multiplication SMM and through interaction among the multiple participants and utilizing the joint data shards of the multiple participants and corresponding data in the predicted value data of the selected participant;
and splicing the data fragments of the first block and the data fragments of the second block of the first party to obtain the Hessian matrix fragment of the first party.
In one embodiment, the business data of any one participant comprises feature values of all feature items of a part of objects;
the step of dividing the hessian matrix expression obtained based on the functional relation in the service prediction model into a plurality of blocks includes: dividing the hessian matrix expression into a plurality of partitions respectively associated with data of a plurality of parties;
the step of enabling the plurality of participants to respectively determine the data shards of the plurality of chunks and respectively determine the corresponding hessian matrix shards based on the data shards of the chunks includes:
determining a data fragment of a block associated with the data of the first participant by using the joint data fragment and predicted value data of the first participant, and filling a value of 0 into the block associated with the data of other participants to obtain the data fragment;
and splicing the data fragments of the plurality of blocks to obtain the Hessian matrix fragment of the first participant.
In an embodiment, the step of calculating inverse partitions of intermediate matrices corresponding to the multiple participants respectively based on the intermediate matrix partitions of the multiple participants to obtain covariance matrix partitions corresponding to the multiple participants respectively includes:
and obtaining covariance matrix fragments respectively corresponding to the multiple participants through iterative computation based on the intermediate matrix fragments of the multiple participants by using a secret sharing matrix inverse algorithm (SMI).
In an embodiment, the step of determining an effective value of a feature item corresponding to a model parameter in improving the effect of the business prediction model includes:
using diagonal elements in the covariance matrix segments of the multiple participants as variance segments corresponding to the multiple model parameters respectively;
aiming at any model parameter, utilizing a secret sharing root number inverse algorithm SNSI and a significance test method, jointly performing safe root number inverse operation through interaction among a plurality of participant devices on the basis of a corresponding model parameter fragment of the first participant and a corresponding variance fragment of a plurality of participants, and determining a significance test value fragment of the first participant aiming at the model parameter; and determining the effective value of the feature item corresponding to the model parameter based on the significance test value shards of the multiple participants for the model parameter.
In one embodiment, the method further comprises:
aiming at any first characteristic item, obtaining a valid value fragment of the first characteristic item from other participant equipment;
and determining the reconstructed effective value of the first feature item based on the local effective value fragment of the first feature item and the obtained effective value fragment.
In one embodiment, the method further comprises:
and based on the effective value, removing the characteristic items of which the effective values do not meet the preset conditions from the plurality of characteristic items so that the plurality of participants perform safe joint training on the service prediction model by adopting the service data without the characteristic items.
In one embodiment, the object comprises one of a user, a commodity, an event; the characteristic items include at least one of: basic attribute information, incidence relation information, interaction information and historical behavior information; the business prediction model is used for conducting business prediction on the object.
In one embodiment, the business prediction model is based on a logistic regression model.
In a second aspect, an embodiment provides an apparatus for determining effective values of characteristics of service data for controlling traffic, where the service data is distributed among multiple participants, and the service data of each of the multiple participants forms federated data under the assumption of concatenation, where the federated data includes characteristic values of multiple objects for multiple characteristic items; the apparatus is deployed in any first participant device, and comprises:
the acquisition module is configured to acquire joint data fragments of a first participant, acquire predicted value fragments corresponding to a plurality of objects respectively, and model parameter fragments corresponding to a plurality of feature items respectively; the predicted value fragment and the model parameter fragment are obtained based on a trained service prediction model;
a reconstruction module configured to reconstruct complete predictor data in a selected participant device using a plurality of predictor slices through transmission of predictor slices between the participant device and the selected participant device;
the interaction module is configured to determine relevance data fragments corresponding to a plurality of participants respectively based on joint data fragments of the plurality of participants and predicted value data of a selected participant by utilizing multi-party safety calculation through interaction among a plurality of participant devices, wherein the relevance data fragments comprise relevance data among a plurality of characteristic items;
and the verification module is configured to determine an effective value of a feature item corresponding to a model parameter in improving the effect of the business prediction model by adopting a significance verification method through the safety interaction among a plurality of participant devices and based on the model parameter fragments of the participants and the corresponding data in the relevance data fragments.
In one embodiment, the obtaining module, when obtaining the federated data segment of the first participant, includes:
adopting a secret sharing addition, and carrying out splitting and splicing operations based on the service data of a plurality of participants through interaction with other participant equipment, so that the plurality of participants respectively obtain joint data fragments; the federated data fragments of multiple participants result in the federated data assuming reconstruction.
In one embodiment, the selected participant comprises a participant having tag data; the reconstruction module is specifically configured to:
when the first participant is not the selected participant, the predicted value fragments of the first participant are sent to the selected participant equipment, so that the selected participant equipment utilizes the plurality of predicted value fragments to reconstruct complete predicted value data;
and when the first participant is the selected participant, receiving the predicted value fragments sent by other participants, and reconstructing the predicted value fragments of the first participant and the received predicted value fragments to obtain complete predicted value data.
In one embodiment, the correlation data comprises covariance matrix data, and the correlation data patches comprise covariance matrix patches;
the interaction module comprises:
the determining submodule is configured to determine intermediate matrix fragments corresponding to a plurality of participants respectively based on joint data fragments of the participants, predicted value data of the selected participants and a functional relation in the service prediction model;
and the calculation submodule is configured to calculate the inverse fragments of the intermediate matrix corresponding to the multiple participants respectively based on the intermediate matrix fragments of the multiple participants to obtain the covariance matrix fragments corresponding to the multiple participants respectively.
In a third aspect, embodiments provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any of the first aspect.
In a fourth aspect, an embodiment provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method of any one of the first aspect.
According to the method and the device provided by the embodiment of the specification, through interaction among a plurality of participants, based on the joint data fragmentation and the predicted value data of the selected participant and the joint data fragmentation and the predicted value fragmentation of other participants, a plurality of participants are enabled to obtain the relevance data fragmentation by utilizing multi-party safety calculation, and then the effect value of the feature item on improving the model effect is determined by utilizing the model parameter fragmentation and the relevance data fragmentation. A plurality of participants use the joint data fragments to carry out multi-party security calculation, the obtained related data is also the data fragments, model parameter fragments are also used when effective values are calculated, the effective values are determined by using the data fragments, the data privacy can be well protected from being revealed, and the privacy and the security of the data in the processing process are improved. Meanwhile, the prediction value data is reconstructed, so that the interactive data amount in the multi-party safety calculation process can be reduced to a greater extent, and the efficiency in the whole processing process is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
fig. 2 is a schematic flowchart of a method for determining an effective value of a service data characteristic of controlling traffic according to this embodiment;
FIG. 3 is a schematic diagram illustrating a calculation flow of the secret sharing matrix multiplication application according to the present embodiment;
fig. 4 is a schematic block diagram of an apparatus for determining a valid value of a service data characteristic of control traffic according to an embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. As shown in fig. 1, in a shared learning scenario, a data set is provided by a plurality of participants 1,2, …, W in common (W is a natural number), and each participant possesses a part of data in the data set, and forms business data (i.e., an original matrix) of the participant. The data set may be a training data set for training a model, a testing data set for testing a model, or a data set to be predicted. The data set may include characteristic data of an object, and the object may be one of various business objects to be analyzed, such as a user, a commodity, an event, and the like. The model may comprise a business prediction model trained in a machine learning manner.
There are at least two data distributions for the data set. One distribution is that each participant has different characteristic data for all objects. For example, each participant has the same samples of N objects, and the privacy data of each sample contains D features, which are distributed among W participants, each participant having D/W features. As another example, two platforms have the same set of users, but have different user characteristics in their business data. Each participant has different kinds of features, and the number of the features can be the same (for example, each participant has D/W features) or different. N, D and W are both natural numbers. This is a scenario of data vertical slicing in a data set, and table 1 is service data distribution of the data vertical slicing scenario.
TABLE 1
Figure BDA0003085753250000061
Where xx represents a specific characteristic value, belonging to the private data of the participant. Each row in table 1 represents one sample data, each column represents the feature value of a feature item of N objects, and D feature items belong to W participants. The feature values of the D feature items of the N objects constitute the entire business data.
Another distribution is that each participant has all the characteristic data of the different objects. For example, there are N samples of the object, the business data of each sample includes D feature items, the N pieces of business data are distributed in W participants, each participant has a part of the samples in all N samples, and the feature items included in each sample are the same. The number of object samples stored by different participants may be the same or different. As another example, there are two banks that serve different groups of users, but they both have the same user credit characteristics. This is a scenario of data horizontal slicing in the data set, and table 2 is service data distribution of the data horizontal slicing scenario.
TABLE 2
Figure BDA0003085753250000062
Figure BDA0003085753250000071
Where xx represents a specific characteristic value, belonging to the private data of the participant. Each row in table 2 represents one sample data, each column represents the feature value of a certain feature item of N objects, and N sample data belong to W participants. Different participants have different object samples. The feature values of the D feature items of the N objects constitute the entire business data.
The business data owned by the participants may include a plurality of characteristic items. The feature item of the object may include at least one of: basic attribute information, association relation information, interaction information, historical behavior information and the like of the object. For example, when the object is a user, the basic attribute information may include gender, age, income, and the like of the user, the association information of the user may include other users, companies, regions, and the like, which have an association with the user, the interaction information of the user may include information of clicking, viewing, participating in a certain activity, and the like of the user at a certain website, and the historical behavior information of the user may include historical transaction behavior, payment behavior, purchase behavior, and the like of the user.
When the object is a commodity, the basic attribute information may include a category, a place of production, a price, and the like of the commodity, the association relationship information of the commodity may include a user, a shop, or other commodities, and the like, which have an association relationship with the commodity, the interaction information of the commodity may include interaction characteristics between the user, the shop, and the commodity, and the historical behavior information of the commodity may include information that the commodity is purchased, transferred, returned, and the like.
When the object is an event, the event may include a transaction event, a login event, a purchase event, a social event, and the like. The basic attribute information of the event may be text information for describing the event, the association relation information may include text having a contextual relation with the event, other event information having an association with the event, and the like, and the historical behavior information may include record information of the event developing and changing in a time dimension, and the like.
The various participants may correspond to different service platforms that may include various enterprises, institutions, organizations, and the like. The service data is often privacy data of the service platform, and higher privacy and security are required to be maintained in the processing process. Regardless of the data distribution mode, the eigenvalue (i.e., the characteristic data) corresponding to the characteristic item of the object belongs to the private data, and can be stored as a private data matrix. In order to secure the private data, each participant needs to leave the private data thereof locally, not output plaintext data, and not perform plaintext aggregation.
In order to protect private data of each participant from being leaked out, in one embodiment, each participant can adopt a multi-party safe calculation mode, and utilize a predicted value and an original matrix of each participant to enable a third party to obtain covariance matrix data capable of representing correlation data among a plurality of feature items through interaction with the third party. And the third party determines the effective value of the characteristic item corresponding to the model parameter on improving the effect of the service prediction model by using the covariance matrix data and the model parameter and adopting a significance test method.
The covariance matrix data contains certain privacy data, so that the security of the privacy data can be improved by further improving the covariance matrix data. In another embodiment, the validity of the feature item may be determined based on multi-party security calculations and using various data shards to determine the relevance data shard, and thus the validity of the feature item, without using a third party, but only between the multiple participants.
When processing in the above manner, since data exists in the form of data fragments among a plurality of participants, and the plurality of participants need to perform a large amount of data interaction when determining the relevant data fragments by using multi-party security calculation, although the security of private data can be guaranteed to the maximum extent, the processing efficiency is very low.
In order to balance the ratio between the processing efficiency and the privacy protection, namely to improve the processing efficiency and appropriately reduce the requirement on a little privacy protection when the requirement is allowed, the embodiment of the specification provides a corresponding implementation scheme. Referring to fig. 1, in the embodiment of the present specification, each participant stores a respective data segment, which includes a respective joint data segment, a predicted value segment corresponding to a plurality of objects, a model parameter segment corresponding to a plurality of features, and the like, a selected participant device reconstructs complete predicted value data using the predicted value segments of the plurality of participants, the plurality of participant devices perform interaction based on multi-party security computation, determines a relevance data segment corresponding to each of the plurality of participants using the joint data segment and the predicted value data of the selected participant, the relevance data fragment comprises relevance data among a plurality of characteristic items, each participant adopts a significance test method respectively, and determining effective values of the feature items on the effect of the service prediction model on the basis of corresponding data in the model parameter fragments and the relevance data fragments of the multiple participants through the safety interaction among the multiple participant devices.
In this embodiment, a plurality of participants perform multi-party security calculation by using the joint data fragment, the obtained correlation data is also a fragment, and a model parameter fragment is also used when calculating an effective value. In the process, joint data, correlation data, model parameters and other data containing important privacy are all represented by fragments, data reconstruction is not performed, and data privacy can be well protected from being leaked. And the selected participants are used for reconstructing the predicted value data, so that the interactive data volume in the multi-party security calculation process can be well reduced, the communication traffic between the participants is controlled, and the overall processing efficiency can be improved. Therefore, the purpose of giving consideration to privacy protection and processing efficiency is achieved.
In this specification, a plurality of participants have corresponding participant apparatuses, respectively, and the operations in the embodiments of the specification are performed using the corresponding participant apparatuses. Participant devices include, but are not limited to, any apparatus, device, platform, cluster of devices, etc. having computing, processing capabilities. The following describes embodiments of the present invention with reference to specific examples.
Fig. 2 is a flowchart illustrating a method for determining an effective value of a service data characteristic of controlling traffic according to this embodiment. The service data is distributed among a plurality of participants, and the service data of each of the plurality of participants constitutes joint data under the condition of supposing splicing. The business data of the participants belong to privacy data with high privacy, and the business data cannot be sent in a clear text among the multiple participants, and the business data cannot be really spliced to form combined data. The syndication data is only a data set consisting of business data of a plurality of participants under the assumption. For example, the above tables 1 and 2 are specific forms of the joint data in the scenarios of data vertical slicing and data horizontal slicing, respectively. The joint data includes feature values of a plurality of objects for a plurality of feature items, and may include feature values of N objects for D feature items, where N and D are both natural numbers, for example.
For convenience of description, two participants are exemplified in the following examples. For example, the two parties are a first party a and a second party B, respectively, the first party a corresponding to a first party device and the second party B corresponding to a second party device. The participator device is used for executing the operation of the participator and storing the data of the participator. In particular embodiments, the participant device may also obtain data of the participant from other devices. The method of the present embodiment specifically includes the following steps S210 to S240.
Step S210, the first participant device obtains the joint data segment of the first participant a, obtains the predicted value segments corresponding to the plurality of objects, and obtains the model parameter segments corresponding to the plurality of feature items. And the second participant equipment acquires the joint data fragments of the second participant B, acquires the predicted value fragments corresponding to the plurality of objects respectively, and acquires the model parameter fragments corresponding to the plurality of characteristic items respectively.
The multiple participants respectively have respective service data, which belongs to the original data and is also the privacy data. In the vertical segmentation scene, the feature items of a plurality of participants are different, and the objects are the same. The plurality of participants may respectively represent their respective raw data in raw matrices, for example, the raw matrices of the first participant a and the second participant B may be respectively represented as XAAnd XBThe characteristic items are respectively represented as dA、dBThe number of objects is respectively represented as nAAnd nBThen the total characteristic term of the conjoined data is D ═ DA+dBThe total number of objects or samples is N ═ NA=nB. When the columns in the original matrix represent characteristic items and the rows represent objects or samples, a plurality of participants such as a first participant A and a second participant B are involvedThe assumed transverse splicing is carried out on the service data to obtain joint data in the form of X ═ X (X)A,XB). The above is the case where the columns in the original matrix represent the feature items and the rows represent the samples, and corresponds to the data distribution in table 1. In other embodiments, the columns in the original matrix may represent objects, and the rows represent feature items, in which case, assuming vertical concatenation is performed on the service data of multiple participants, such as the first participant a and the second participant B, joint data may be obtained in the form of joint data
Figure BDA0003085753250000091
In the horizontal segmentation scene, the characteristic items of a plurality of participants are the same, and the objects are different. The original matrices of the first party A and the second party B are X respectivelyAAnd XBThe characteristic items are respectively dA=dBD, the number of objects is nA、nBThen the total characteristic term of the conjoined data is D ═ DA=dBThe total number of objects or samples is N ═ NA+nB. When the rows in the original matrix of the participants represent objects and the columns represent characteristic items, the service data of a plurality of participants such as a first participant A and a second participant B are assumed to be longitudinally spliced to obtain joint data in the form of
Figure BDA0003085753250000092
The above may correspond to the data distribution scenario in table 2. When the rows in the original matrix represent feature items and the columns represent objects, service data of multiple participants such as a first participant a and a second participant B are subjected to assumed horizontal splicing to obtain joint data in the form of X ═ X (X)A,XB)。
In order to enable a plurality of participants to obtain the joint data fragmentation, secret sharing addition can be adopted among the participants to split the business data of the participants into random numbers, and the fragmentation is completed through the transmission of the random numbers among the participants. Specifically, when the first participant device acquires the joint data segment of the first participant a, a secret sharing addition may be adopted, and through interaction with other participant devices, splitting and splicing operations are performed based on the service data of multiple participants, so that the multiple participants respectively acquire the joint data segment. Similarly, the second party B also obtains its joint data shards.
The secret sharing addition can split an original matrix into random matrices, and the fragmentation is completed through the transmission of the random matrices among a plurality of participants. Taking two participants as an example, a first participant a and a second participant B respectively possess original matrices X of service dataAAnd XB. For the first participant device, it may generate a random matrix R in the finite fieldAAnd calculating XA-RA=X2The first participant device may combine the two random matrices RAAnd X2Any one of (1), e.g. X2And sending to the second participant device. A second participant device, also generating a random matrix R in the finite fieldBAnd calculating XB-RB=X3The second participant device may combine the two random matrices RBAnd X3Any one of (1), e.g. X3And sending the message to the first participant device.
The first participant device may associate R withAAnd X received from the second participant device3Spliced into federated data fragments, the second participant device can segment RBAnd X sent by the received first participant device2And splicing into joint data fragments. Of course, in practical application scenarios, the number of participants is usually 3 or more, and the implementation process of the secret sharing addition can be easily extended to more than three parties. The data sent among the multiple participants is a random matrix, and the privacy data of the original matrix is not disclosed.
Wherein the federated data fragments of the multiple participants result in federated data assuming reconstruction. The reconstruction may be implemented by adding the data fragments of the parties, and the specific reconstruction may be to add other matrix transformation operations on the basis of the addition, the matrix transformation including, for example, multiplication by a preset value, and the like. The union data contains the privacy data, each participant does not directly carry out plaintext aggregation on the privacy data, the union data is only a representation under an assumed condition, and the data fragments of the participants cannot be directly reconstructed together in practice. The following meanings for reconstitution apply to the description herein.
Federated data fragmentation for first participant A may be employed<X>AIndicating that federated data fragmentation of the first party B can be employed<X>BDenotes that then the joint data X ═<X>A+<X>B. Wherein the content of the first and second substances,<X>denotes the slice of the parameter X, with the subscript indicating the party to which the slice belongs. For the sake of uniformity in expression, the fragmentation of data in a certain participant is indicated hereinafter in the form of "tip brackets + subscripts".
In this embodiment, the federated data segments of the participants are obtained based on the business data of the multiple participants, and the sum of the federated data segments of the multiple participants is conceptually or theoretically equal to the federated data.
In step S210, the predicted value segment and the model parameter segment are data obtained based on the trained service prediction model. The service prediction model is obtained by performing safe joint training based on joint data fragments of a plurality of participants. The business prediction model can be obtained by pre-training. The business prediction model can be a model obtained by training based on a logistic regression model, and can also be obtained by training based on other types of models. The business prediction model is used for performing business prediction on the object, for example, classification prediction or regression prediction can be performed on input feature data of the object.
And the plurality of participant devices can obtain the predicted value fragments and the model parameter fragments through the trained service prediction model. For example, the first participant device may obtain a model parameter fragment of the trained business prediction model local to the first participant device, and respectively enable the multiple participants to determine the predicted value fragments of the object based on the joint data fragments of the multiple participants and the trained business prediction model through secure interaction between the multiple participant devices.
And the plurality of participant devices take N objects in the joint data fragments as samples to train a service prediction model. After training, the model parameter fragment of the service prediction model in the present participant device can be obtained. Through the safe interaction among a plurality of participant devices, the joint data fragments of the participants are input into a service prediction model, and each participant device can determine the predicted value fragments of a plurality of objects of the participant.
Therefore, for a participant, in the acquired data, one object corresponds to one predicted value fragment, N objects correspond to N predicted value fragments respectively, and the N predicted value fragments can be used as vector elements to form a vector, that is, the vector is represented; when the service data contains D characteristic items, the trained service prediction model contains a plurality of model parameters which respectively correspond to the D characteristic items. For any predicted value data, the corresponding predicted value segments owned by a plurality of participants obtain the predicted value data under the condition of supposing reconstruction. For any model parameter, the corresponding model parameter slices owned by multiple participants obtain the model parameter under the condition of supposing reconstruction.
Step S220, reconstructing complete predicted value data in the selected participant device using the plurality of predicted value slices through transmission of the predicted value slices between the participant device and the selected participant device.
The selected party C may be any one of a plurality of parties, or may be selected from a plurality of parties according to a certain selection rule. The selected participant may be preselected or set by an administrator, and the identity of the selected participant may be pre-sent to other participants.
In one embodiment, the selected participant may be the participant who owns the tag data. Taking the first party a as an example, the step of reconstructing the complete predicted value data may include:
and when the first participant A is not the selected participant, transmitting the predicted value fragment of the first participant A to the selected participant equipment, so that the selected participant equipment reconstructs complete predicted value data by using the plurality of predicted value fragments.
And when the first participant A is the selected participant, receiving the predicted value fragments sent by other participants, and reconstructing the predicted value fragments of the first participant A and the received predicted value fragments to obtain complete predicted value data.
Tag data may be included in the business data of the participants, for example, at least one tag data exists in the business data corresponding to each object. For example, when the object is a user, the business data of user 1 may include tag data of whether user 1 is a high-risk user, and tag data 1 and 0 may represent yes or no, respectively.
The participant having the tag data may be one or a plurality of participants. In the service data distribution as shown in table 1, one of the participating parties 1 to W may possess tag data of all objects, for example, the participating party 1 possesses tag data of all N objects. When reconstructing the predicted value data, the selected participant can obtain a plurality of predicted value fragments of other participants, and all the predicted value fragments of the N objects are reconstructed to obtain predicted value data. For example, there are 3 participants A, B and C, each having business data for 10 objects, and participant A having tag data for the selected participant. Party B and party C may send their 10 predicted value slices to party a, respectively. The device of the participant a may obtain all 30 predicted value slices, where each object corresponds to 3 predicted value slices, and the 3 predicted value slices may reconstruct predicted value data of the object, so as to obtain 10 predicted value data of 10 objects, which is 10 in total.
In the service data distribution as shown in table 2, the participating parties 1 to W respectively possess tag data of partial objects. In this case, the predictive value data is reconstructed in the different parties for the different traffic data, with the different parties acting as selected parties. For example, there are 3 participants A, B and C, each having business data for 5 objects, for a total of 15 objects. The business data of party a includes tag data of objects 1 to 5, the business data of party B includes tag data of objects 6 to 10, and the business data of party C includes tag data of objects 11 to 15. When reconstructing the predicted value data, for 5 objects from the object 1 to the object 5, the participant B and the participant C transmit the predicted value fragments of the 5 objects to the participant a, and the participant a reconstructs the predicted value data corresponding to the object 1 to the object 5 by using the 5 predicted value fragments of the participant a and the received 10 predicted value fragments transmitted by other participants, and the total number of the predicted value data is 5. Similarly, for the objects 6 to 10, the participant B serves as a selected participant, and the predicted value data of the part of the objects are reconstructed; for the objects 11 to 15, the participant C serves as a selected participant, and the prediction value data of the part of the objects is reconstructed.
The participant having the tag data serves as a selected participant, and the step of reconstructing the predicted value data is executed, so that the predicted value data is kept secret relative to the non-tag owner, and other participants cannot acquire the predicted value data, thereby protecting the privacy of the predicted value data.
Step S230, determining, by using multi-party security computation, through interaction among multiple participant devices, relevance data segments corresponding to multiple participants respectively based on joint data segments of the multiple participants and predicted value data of a selected participant, where the relevance data segments include relevance data among multiple feature items.
The relevance data fragments of the multiple participants obtain relevance data under the condition of reconstruction, namely relevance data among feature items, wherein the feature items comprise relevance data among feature items owned by the same participant and relevance data among feature items owned by different participants, and the relevance data among different feature items and the relevance data among the same feature items exist.
When the step is implemented, the relevance data fragments corresponding to a plurality of participants respectively can be determined by utilizing the joint data fragments and the predictive value data of the selected participants and in a multi-party safe calculation mode based on the existing formula for calculating the relevance data between the characteristic items. The formula capable of expressing the correlation data between the feature items may include a covariance matrix formula, a correlation coefficient formula, and the like.
Multi-party Secure computing (MPC) is an existing data privacy protection technology that can be used for Multi-party participation, and specific implementations thereof include homomorphic encryption, garbled circuit, careless transmission, secret sharing, and the like. By adopting a multi-party safe computing mode, the safe interactive computing aiming at the joint data fragmentation and the predictive value data among a plurality of participant devices can be realized, and then the plurality of participants can determine the corresponding correlation data fragmentation.
And S240, determining the effective value of the feature item corresponding to the model parameter on the improvement of the effect of the business prediction model by adopting a significance test method through the safety interaction among the equipment of the multiple participants and based on the corresponding data in the model parameter fragments and the relevance data fragments of the multiple participants.
The significance test method may include a Wald test, a Likelihood Ratio (LR) test, a Lagrange Multiplier (LM) test, and the like. After the existing formula provided by the significance test method is transformed, the model parameter fragments and the relevance data fragments of a plurality of participants are safely calculated through the safety interaction among the devices of the participants, and the effective value fragments corresponding to the participants are determined.
In this embodiment, the feature items correspond to model parameters, and data corresponding to the feature items exist in both the model parameter patches and the correlation data patches. By using the corresponding data in the model parameter fragment and the correlation data fragment and adopting a significance test method, the significance test value fragments corresponding to the plurality of model parameters respectively, namely the significance test value fragments of the corresponding plurality of feature items, can be determined, and the effective value fragment can be determined based on the significance test value fragments.
When the valid value of a certain feature item needs to be determined, for example, for an arbitrary first feature item, the first participant device may obtain a valid value fragment of the first feature item from other participant devices, and determine a reconstructed valid value of the first feature item based on the local valid value fragment of the first feature item and the obtained valid value fragment. The valid value of the feature item may also be reconstructed in the second participant device or in another participant device, and this embodiment is described only by taking the reconstruction of the valid value in the first participant device as an example.
After obtaining the effective values of the plurality of feature items, the first participant device may further remove, from the plurality of feature items, the feature item whose effective value does not satisfy the preset condition based on the plurality of effective values, so that the plurality of participants perform safe joint training on the service prediction model by using the service data from which the feature item is removed. The service data after the feature items are removed realizes the dimension reduction processing of the original matrix, so that the feature items are more refined, and the safety of the private data is ensured without leakage.
One embodiment is described in detail below. When the business prediction model includes a logistic regression model and the significance test method adopts the Wald test method, the manner of determining the relevance data fragmentation in step S230 and the specific implementation manner of determining the effect value of the feature item in step S240 are provided.
The application of the Wald test to logistic regression is first explained in detail below. When the logistic regression model is adopted to carry out regression on the characteristic data of the sample, the calculation formula of the predicted value comprises the following steps:
Figure BDA0003085753250000131
Figure BDA0003085753250000132
wherein, X is the characteristic data of the sample and can be used as an independent variable; pi (X) is a predictive value function of the sample and can be used as a dependent variable; beta is a model parameter and is a characteristic term coefficient; e is a natural constant.
The original and alternative hypotheses of the Wald test are:
H0:ωj0 (j-1, 2, …, k), i.e. the independent variable has no influence on the probability of the dependent variable occurring, i.e. the dependent variable is not influencedThe independent variable is assumed to have no influence on the estimation value of the dependent variable;
H1:ωj≠0
if the null hypothesis is rejected, it is stated that the dependent variable changes depending on the independent variable j.
The test statistic of the Wald test is
Figure BDA0003085753250000133
WaldkIs a significance check value, which conforms to a chi-square distribution with a degree of freedom of 1. Wherein the content of the first and second substances,
Figure BDA0003085753250000134
as a parameter of the model
Figure BDA0003085753250000135
Also equal to the square root of the diagonal elements of the covariance matrix:
Figure BDA0003085753250000136
the diagonal elements of the covariance matrix are the variances of the feature terms. Covariance matrix of model parameters
Figure BDA0003085753250000137
The negative Hessian (Hessian) matrix is a log-likelihood function
Figure BDA0003085753250000138
Value of (A)
Figure BDA0003085753250000139
Wherein
Figure BDA00030857532500001310
For the element expression in the Hessian matrix H, the indices k and r are natural numbers less than N, xikAnd xirIs an element in the joint data X, XiRepresenting the characteristic data of the ith sample.
By deriving the above formula, the H matrix can be expressed as H ═ XTMX of which
Figure BDA00030857532500001311
Figure BDA00030857532500001312
Where N is the total number of samples, i.e., the total number of objects, D is the dimension of the feature data, and pi (X)N) For sample X for logistic regression modelNM is a diagonal matrix obtained based on the predicted value, and may also be referred to as a predicted value matrix.
From the above equation (2)
Figure BDA0003085753250000141
It can be seen that, for the kth model parameter, when the standard deviation of the model parameter is larger, that is, the value of the kth row and the kth column in the covariance matrix is larger, it is indicated that the model parameter causes the higher the concussion of the logistic regression model, and the smaller the Wald test value corresponding to the model parameter is.
In determining the significance check value Wald of the kth model parameterkThen, can also be according to
Figure BDA0003085753250000142
To obtain zkStatistic and according to p _ value ═ 2[1-normk|)]Cdf to obtain the probability distribution function of normal distributionAnd (4) counting. When the p value is smaller than the significance level threshold value, rejecting the original hypothesis, wherein the model parameter can be kept for modeling, and the effective value of the feature item corresponding to the model parameter can be 1 or other higher values; when the p value is not less than the significance level threshold, the original assumption is accepted, the model parameter is not retained, and the valid value of the feature item corresponding to the model parameter can be 0 or other lower value. The significance level threshold may typically be 0.05 or 0.01, etc.
Logistic regression analysis is a statistical method that resolves independent variables and dependent variables and defines the relationship between them. The regression equation that is built is only meaningful if the independent and dependent variables do have some relationship. Therefore, the fact that the independent variable is related to whether the prediction target is a dependent variable, how much the correlation is, and how much the reliability of the correlation is determined is a problem to be solved by the regression analysis. Logistic regression analysis may use the Wald test to check the values of the regression term coefficients one by one. If for certain arguments, the Wald test indicates that these arguments are important, they should be included in the model. If the Wald test indicates that these arguments are not significant, these arguments may be omitted from the model. The model parameters of the business prediction model can be evaluated by using logistic regression analysis and Wald test, and then the characteristic items of the object samples are screened based on the evaluation results, so that the purpose of performing dimension reduction processing on the business data is achieved.
In this embodiment, in step S230, the correlation data includes covariance matrix data, and the correlation data slices include covariance matrix slices. Covariance matrix patches of multiple participants can constitute a covariance matrix assuming reconstruction. The covariance matrix is a matrix formed by the covariance between two feature items in a plurality of feature items in the joint data, wherein the elements on the main diagonal are the variances of the plurality of feature items, and the elements on the off-diagonal are the covariance between the two feature items. The covariance matrix is a symmetric matrix, and when there are D feature entries in the joint data, the covariance matrix may be a symmetric matrix in D × D dimensions.
When determining the pieces of correlation data corresponding to the plurality of participants, respectively, in step S230, that is, determining the pieces of covariance matrix corresponding to the plurality of participants, respectively, the participant devices of the plurality of participants may perform the following steps 1 and 2.
Step 1, determining intermediate matrix fragments respectively corresponding to a plurality of participants based on joint data fragments of the participants, predicted value data of the selected participants and a functional relation in a service prediction model. For example, the first participant A gets the intermediate matrix shard<H>AThe second participant B gets the intermediate matrix patches<H>BMultiple intermediate matrix slices yield an intermediate matrix H under the assumption of reconstruction. The multiple participants do not really perform the reconstruction of the inter-matrix slice, and here only represent the relationship between the multiple inter-matrix slices.
And 2, calculating the inverse fragments of the intermediate matrix corresponding to the multiple participants respectively based on the intermediate matrix fragments of the multiple participants to obtain the covariance matrix fragments corresponding to the multiple participants respectively. For example, the first participant a gets the inverse sharding of the intermediate matrix<H-1>AThe second participant B gets the inverse of the intermediate matrix<H-1>BThe slicing of the inverses of the plurality of intermediate matrices yields the inverse H of the intermediate matrix under the assumption of reconstruction-1. The multiple participants do not really perform the reconstruction of the slices of the intermediate matrix inverse, and here only the relation between the slices of the multiple intermediate matrix inverses is shown.
In the step 1, when determining the intermediate matrix segments corresponding to the multiple participants respectively, the following steps 1_1 and 1_2 may be included.
Step 1_1, dividing a Hessian matrix expression obtained based on a functional relation in a service prediction model into a plurality of blocks according to an assumed splicing relation of service data of a plurality of participants. The hessian matrix expression includes joint data and predictive value data.
Step 1_2, according to the participant data associated with the multiple blocks, the multiple participants respectively determine the data slices of the multiple blocks by using the joint data slices of the multiple participants and the corresponding data in the predictive value data of the selected participant, and respectively determine the corresponding Hessian matrix slices as middle matrix slices based on the data slices of the blocks.
When the business prediction model is a logistic regression model, the functional relation of the business prediction model, that is, the functional relation of the model prediction value, is shown in the above formula (1). After the logistic regression model is trained, the corresponding model parameters, such as β, are obtained. The hessian matrix expression is actually a second derivative of the model parameter β, i.e., equation (5)
Figure BDA0003085753250000151
Wherein H is a Hessian matrix obtained based on a functional relation in a service prediction model, HkrIs the expression of the elements in the kth row and the r column in the Hessian matrix H, xikAnd xirIs an element in the joint data X, XiCharacteristic data, pi (X), representing the ith samplei) For predictive value data, i can be taken from 1 to N.
In this embodiment, the hessian matrix H may be divided into a plurality of blocks according to an assumed splicing relationship of the service data of a plurality of participants. The assumed splicing relation of the business data of the multiple participants comprises a splicing relation that the business data of the multiple participants are spliced into joint data based on the horizontal or vertical assumed splicing of the given participant sequence.
In the application scenario of data vertical slicing, the business data of any one participant includes feature values of part of feature items of all objects. Referring to table 1, where the service data of any one participant includes feature values of partial feature items of N objects, the service data of multiple participants may be combined based on horizontal concatenation to obtain joint data. Table 1 is merely an example of data distribution in which rows represent objects and columns represent feature items. In other embodiments, rows may be used to represent feature items and columns may be used to represent objects. In the following examples, the division of the hessian matrix in the data vertical slicing scenario is illustrated by taking the data distribution shown in table 1 as an example.
In such an application scenario, when the hessian matrix expression is partitioned, the hessian matrix expression may be partitioned into a first partition associated with data of a selected participant and a second partition associated with a plurality of participants.
For example, taking two participants as an example, the business data of the first participant a and the second participant B are conceptually spliced into joint data as shown in table 3. It should be noted that the splicing here is a virtual splicing and is not executed in practice.
TABLE 3
Figure BDA0003085753250000152
Figure BDA0003085753250000161
Wherein, the service data of the first party A comprises dAAnd assuming that the first party a is the selected party, the service data of which includes the tag data. The service data of the second party B contains dBA feature item, the second party B is not the selected party. If expressed in matrix form, the business data of the first party a and the second party B can be assumed to be spliced into a joint data matrix X similar to table 3, where the total number of rows is N and the total number of columns is dAAnd dBThe sum of XN*(dA+dB)
The Hessian matrix H is divided according to the expression of the formula (5) to obtain
Figure BDA0003085753250000162
Here, the hessian matrix H is divided into 4 blocks, each of which is a sub-matrix. It may be noted here that the total number of columns of conjoined data is dA+dBX in the formula (5)ikAnd xirI takes values in the rows 1-N of the joint data, and k and r take values in the columns 1-D of the joint data. This is the theoretical case.
For theUpper left corner block [ A ]]Matrix, in which H is determinedkrThe required data of the elements are taken from the data of the first participant A and the value data pi (X) is predictedi) Also owned by the first party a. That is, when k and r are in the columns 0 to dAWhen taking the value in (1), x in the formula (5)ikAnd xirAll take values in the service data of the first party a. Thus, the upper left corner tile [ A ] can be determined]May be independently calculated by the first party a. Upper left corner block [ A ]]Is the first chunk associated with the data of the selected participant.
In formula (9), the upper right corner is blocked [ AB ]]And the lower left corner block [ BA]The matrices are transposed matrices of each other, and after one of the matrices is determined, the other matrix can be determined by transposing the other matrix. Following to determine the block [ AB ]]The description is given for the sake of example. Block [ AB ]]H in (1)krThe data of the rows and columns of elements are taken from the service data of the first party a and the second party B. That is, when k and r are one in the columns 0 to dAMiddle value, one in column number dA+1~dA+dBWhen the value is medium, x in the formula (5) isikAnd xirA service data X at the first party AAA middle value, a service data X at the second party BBTaking the value in the step (1). Thus, the upper right block [ AB ] can be determined]May be jointly computed by a first party a and a second party B. Upper right block [ AB ]]Belonging to a second partition associated with both the first party a and the second party B data.
Determining a lower right corner partition [ B]H in (1)krThe data required by the element is fetched from the data of the first party a and the second party B. I.e., when k and r are in column number dA+1~dA+dBWhen taking the value in (1), x in the formula (5)ikAnd xirService data X both at the second party BBMedian, but predictive value data is owned by the first participant a, so the lower right hand corner is blocked B]Joint computation of the first party a and the second party B is also required. Lower right block [ B ]]Belonging to a second partition associated with both the first party a and the second party B data.
Thus, the hessian matrix H is divided into 4 blocks,accordingly, the Hessian matrix of the first party A<H>AAnd Hessian matrix sharding of the second participant B<H>BAre divided into 4 blocks, and the corresponding blocks in the two hessian matrix slices constitute the blocks in the hessian matrix under the assumed reconstruction.
Wherein the Hessian matrix of the first party A is sliced<H>AMiddle, upper left data slicing<[A]>A(belonging to the first partition), federated data shards owned by the first participant A<X>AAnd predicted value data pi (X)i) The data determination in (1); data slicing of upper right corner blocks<[AB]>A(belonging to the second partition), the federated data slice owned by the first participant A<X>AAnd predicted value data pi (X)i) And federated data sharding owned by the second participant B<X>BThe data in (1) is obtained by multi-party security calculation; data slicing of lower left corner tiles<[BA]>A(belonging to the second partition), by slicing the data of the upper right partition<[AB]>ATaking and transposing to obtain; data slicing of lower right corner partitions<[B]>A(belonging to the second partition), predictive value data pi (X) owned by the first party Ai) And federated data sharding owned by the second participant B<X>BThe data in (2) is obtained through multi-party security calculation.
Hessian matrix sharding of the second participant B<H>BMiddle, upper left data slicing<[A]>BThe value (belonging to the first block) can be obtained by filling with a value of 0; data slicing of upper right corner blocks<[AB]>B(belonging to the second partition), the federated data slice owned by the first participant A<X>AAnd predicted value data pi (X)i) And a federated data shard owned by the second participant B<X>BThe data in (1) is obtained by multi-party security calculation; data slicing of lower left corner tiles<[BA]>B(belonging to the second partition), by slicing the data of the upper right partition<[AB]>BTaking and transposing to obtain; data slicing of lower right corner partitions<[B]>B(belonging to the second partition) fromPredicted value data pi (X) owned by a party Ai) And a federated data shard owned by the second participant B<X>BThe data in (2) is obtained through multi-party security calculation.
To summarize, for a first partition, for a selected participant (e.g., first participant a), the selected participant device may determine a data slice of the first partition using its joint data slice and predictor data; for other participants than the selected participant (e.g., second participant B), the other participant devices may populate the 0 value into the first chunk, resulting in their data fragmentation;
for the second partition, a Secret Matrix Multiplication (SMM) in the multi-party security computation may be utilized, and through interaction among the multiple participants, the data fragments of the second partition corresponding to the multiple participants are determined by utilizing the joint data fragments of the multiple participants and the corresponding data in the predicted value data of the selected participant.
The first participant device may splice the data shards of the first partition and the data shards of the second partition of the first participant a to obtain the hessian matrix shards of the first participant a<H>A. The second participant device may splice the data shards of the first chunk and the data shards of the second chunk of the second participant B to obtain the hessian matrix shard of the second participant B<H>B
The following describes embodiments for determining a data shard of a second chunk corresponding to a plurality of participants through interaction between the plurality of participants using federated data shards of the plurality of participants and corresponding data in predictive value data of a selected participant using SMM. With federated data sharding owned by the first participant A<X>AAnd predicted value data pi (X)i) And federated data sharding owned by the second participant B<X>BDetermining, by SMM, a data slice of the top right corner partitions of the first participant A and the second participant B<[AB]>AAnd<[AB]>Bfor example.
In accordance with the formula xikπ(Xi)[π(Xi)-1]*xirIn the calculation, the joint data of the first party A needs to be adopted<x>APartial data, predicted value data pi (X)i) And federated data fragmentation of the second participant B<x>BBy using SMM joint computation, the first and second parties a and B respectively determine the data slice<[AB]>AAnd<[AB]>B
in order to make the description more concise and clear, the amount to be calculated as described above is simplified as follows. Having matrix data shards in a first participant A<x>AAnd<y>Athe second party B owns the matrix data slice<x>BAnd<y>Bthe target parameter to be calculated is xy. Wherein x ═<x>A+<x>B,y=<y>A+<y>B. Through SMM federated computation, the first participant A is made available<Y>AThe second party B gets<Y>BAnd xy ═<Y>A+<Y>B. The specific calculation process may be performed according to the flow chart shown in fig. 3.
Fig. 3 is a schematic diagram of a calculation flow of secret sharing matrix multiplication application in this embodiment, which includes the following specific steps.
Step 1, both sides respectively obtain a random number matrix triple. First party A acquisition<u>A、<v>A、<z>ASecond party B acquisition<u>B、<v>B、<z>BAnd z ═ u × v is satisfied, wherein z ═ u × v<z>A+<z>B,u=<u>A+<u>B,v=<v>A+<v>B
And 2, the first participant A splits the private data of the first participant A by using the random number so as to realize the shielding of the private data and further obtain a secret matrix. First Party A computation<d>A=<x>A-<u>A,<e>A=<y>A-<v>A. Second participant B followsAnd the machine number splits the private data to obtain a secret matrix. Second Party B computation<d>B=<x>B-<u>B,<e>B=<y>B-<v>B
And 3, transmitting the respective secret matrixes among the participants, and processing the secret matrixes based on the respective secret matrixes and the received secret matrixes. First party A sends to second party B<d>AAnd<e>Athe second party B sends to the first party A<d>BAnd<e>B. The first participant a calculates d ═<d>A-<d>B,e=<e>A-<e>BThe first participant B calculates d ═<d>A-<d>B,e=<e>A-<e>B
And 4, respectively calculating respective data fragments by the participants. First Party A computation<Y>A=<z>A+<u>A*e+d*<v>A+ d × e, second participant B calculation<Y>B=<z>B+<u>B*e+d*<v>B. And, in theory, satisfy<Y>A+<Y>B=x*m。
In this way, the syndicated data sharding may be based on the first party A<X>APartial data and predictive value data of (1), joint data fragmentation of the second participant B<X>BThe SMM is adopted to determine the partitions of the product between the parts in the joint data and the predicted value data, which correspond to the multiple participants respectively.
In this embodiment, an implementation manner is provided for determining hessian matrix fragmentation by using blocks in a data vertical segmentation scene, where a part of the blocks require data interaction among multiple participants, and a part of the blocks are calculated locally, so that a large amount of interaction data can be saved, and the overall processing flow is improved.
In the data level segmentation application scenario, the business data of any one participant includes feature values of all feature items of a part of objects. Referring to table 2, where the service data of any one participant includes feature values of D feature items of a partial object, the service data of multiple participants may be combined based on vertical concatenation to obtain joint data. Table 2 is merely an example of data distribution in which rows represent objects and columns represent feature items. In other embodiments, rows may be used to represent feature items and columns may be used to represent objects. In the following examples, the division of the hessian matrix in the data horizontal slicing scenario is illustrated by taking the data distribution shown in table 2 as an example.
In such an application scenario, when the hessian matrix expression is divided, it may be divided into a plurality of blocks respectively associated with data of a plurality of parties.
For example, taking two participants as an example, the business data of the first participant a and the second participant B are conceptually spliced into joint data as shown in table 4. It should be noted that the splicing here is a virtual splicing and is not executed in practice.
TABLE 4
Figure BDA0003085753250000191
Wherein, the service data of the first party A comprises nAAnd the first party A is the selected party corresponding to the part of the object, and the business data of the first party A comprises the label data of the part of the object. The service data of the second party B contains nBAnd the second party B is the selected party corresponding to the part of the object, and the business data of the second party B comprises the label data of the part of the object. If expressed in the form of a matrix, the service data of the first party a and the second party B can be assumed to be spliced into a joint data matrix X similar to table 4, where the total number of rows N is NAAnd nBThe total number of columns is D and X(nA+nB)*D
The Hessian matrix H is divided according to the expression of the formula (5) to obtain
Figure BDA0003085753250000192
Here, the hessian matrix H is divided into 2 blocks, each of which is a sub-matrix. It may be noted here that the total number of rows of conjoined data is nA+nBX in the formula (5)ikAnd xirI takes values in the rows 1-N of the joint data, and k and r take values in the columns 1-D of the joint data. This is the theoretical case.
For elements in the Hessian matrix
Figure BDA0003085753250000193
As calculated by the first party a,
Figure BDA0003085753250000194
calculated by the second party B and stored locally, respectively, and Hkr=<Hkr>A+<Hkr>B. In the same way as above, the first and second,<H>Aand<H>Bmay be calculated locally by the respective intended participant.
Specifically, according to the formula (5), the upper partition [ A ] in the formula (10) is determined]The elements in the matrix, the required data, are all in the first party a. I.e. when i is in the number of rows 1 to nAWhen taking the value in (1), x in the formula (5)ikAnd xirAll values are taken in the service data of the first participant A, and the data pi (X) is predictedi) Also owned by the first party a. Thus, the upper partition [ A ] can be determined]May be independently calculated by the first party a. Upper block [ A ]]Is the chunk associated with the first party a's data.
From equation (5), the lower partition [ B ] in equation (10) is determined]The elements in the matrix, the required data, are all in the second participant B. I.e. when i is in the number n of rowsA+1~nA+nBWhen taking the value in (1), x in the formula (5)ikAnd xirAll values are taken in the service data of the second party B, and the corresponding predicted value data pi (X)i) Also owned by the second party B. Thus, the lower partition [ B ] can be determined]Can be independently calculated by the second party B. Lower block [ B ]]That is with the second party BData associated chunking.
Thus, the hessian matrix H is divided into 2 blocks, the hessian matrix of the first party A being correspondingly sliced<H>AAnd Hessian matrix sharding of the second participant B<H>BAre divided into 2 sub-blocks, and the corresponding sub-blocks in the two hessian matrix slices constitute the sub-blocks in the hessian matrix under the assumed reconstruction.
Wherein the Hessian matrix of the first party A is sliced<H>AData slicing of middle and upper blocks<[A]>AFederated data shards owned by the first participant A<X>AAnd predicted value data pi (X)i) The data determination in (1); data slicing of lower partitions<[B]>AThe value in (1) can be obtained by filling with a value of 0.
Hessian matrix sharding of the second participant B<H>BData slicing of middle and lower blocks<[B]>BFederated data sharding owned by second participant B<X>BAnd predicted value data pi (X)i) The data determination in (1); data slicing of upper slicing<[A]>BThe value in (1) can be obtained by filling with a value of 0.
In summary, for the first participant a, the first participant device determines the data segment of the segment associated with the data of the first participant a by using the joint data segment and the predicted value data of the first participant a, fills the 0 value into the segment associated with the data of the other participants to obtain the data segment of the segment, and splices the data segments of the multiple segments to obtain the hessian matrix segment of the first participant a.
For a first participant B, a second participant device determines a data fragment of a block associated with the data of the second participant B by using a joint data fragment and predicted value data of the second participant B, fills a value 0 into the block associated with the data of other participants to obtain the data fragment of the block, and splices a plurality of data fragments of the block to obtain a Hessian matrix fragment of the second participant B.
In this embodiment, an implementation manner is provided for dividing the hessian matrix in a scene of data horizontal segmentation and determining hessian matrix fragments in a block form, where hessian matrix fragments can be respectively determined without mutual interaction among multiple participants, so that the amount of interaction data among the participants is saved in the overall process.
Returning to step 2, in the intermediate matrix slicing based on multiple participants<H>Computing the inverse slices of the intermediate matrix corresponding to each of the plurality of participants<H-1>Obtaining the covariance matrix patches corresponding to the multiple participants respectively<Cov>The steps of (a) may be performed based on the partitioning of an intermediate Matrix of multiple participants using a secret Sharing Matrix Inversion (SMI) algorithm<H>Obtaining covariance matrix fragments corresponding to multiple participants respectively through iterative computation<Cov>. Wherein the covariance matrix is equal to the inverse of the intermediate matrix, Cov ═ H-1
For example, the intermediate matrix shard of the first participant a is known<H>AAnd the intermediate matrix shard of the second participant B<H>BTo calculate<H-1>AAnd<H-1>Bas a result, an iterative calculation can be performed using SMI. Wherein the intermediate matrix is sliced<H>AAnd<H>Bobtaining an intermediate matrix H, H upon hypothetical reconstruction-1Is the inverse of H, but the first party a and the second party B do not reconstruct H. Therefore, it is necessary to know<H>AAnd<H>Band without reconstructing it, causes the first party a and the second party B to determine separately<H-1>AAnd<H-1>B. The intermediate matrix H is not reconstructed, and the leakage of private data can be avoided.
The following describes a process of iteratively calculating covariance matrix shards using SMI by taking two participants as an example. It is known that the first participant a owns the intermediate matrix shard<H>AThe second participant B has an intermediate matrix slice<H>B,H=<H>A+<H>B. It is desired that: so that the first party a gets<H-1>AThe second party B gets<H-1>B,H-1=<H-1>A+<H-1>B
During initialization, the first party A and the second party B respectively obtain L through joint calculation0
L0=tr(H)-1=[tr(<H>A)+tr(<H>B)]-1
Where tr is the trace of the matrix.
In any one iteration calculation, SMM is utilized among a plurality of participants, and the calculation is respectively carried out according to the following iteration formula
Lk+1=Lk(2*I-H Lk)=(<Lk>A+<Lk>B)[2*I-(<H>A+<H>B)(<Lk>A+<Lk>B)]
Wherein I is an identity matrix. In one iteration, 2 SMMs need to be performed. The number of iteration rounds may be preset, and may be set to 20 to 32 times, for example, where k is the number of iterations.
Returning to step S230, when determining the effective value of the feature item corresponding to the model parameter in improving the effect of the service prediction model based on the model parameter shards and the covariance matrix shards of the multiple participants, the method may use the formula (2) of Wald test
Figure BDA0003085753250000211
Or adopt the formula (8)
Figure BDA0003085753250000212
And calculating a significance test value (or a significance level value) of the kth model parameter, and determining an effective value of the feature item corresponding to the model parameter on improving the effect of the business prediction model based on the significance test value and an initial hypothesis.
In the determination of WaldkOr zkWhen the molecular moiety is
Figure BDA0003085753250000213
Model parameters, denominator part
Figure BDA0003085753250000214
The standard deviation is the standard deviation of the model parameters, which can be obtained according to the square root of the variance of the model parameters, and the diagonal elements of the covariance matrix are the variances of the corresponding model parameters. Next, the effective value of the feature item corresponding to the model parameter may be determined based on the model parameter shards and the covariance matrix shards of the multiple participants by using a secret sharing root Number inverse (SNSI) algorithm. Specifically, the following steps 1b and 2b may be included.
And step 1b, the plurality of participant devices take diagonal elements in the covariance matrix fragments of the plurality of participants as variance fragments respectively corresponding to the plurality of model parameters. The diagonal element here may refer to the main diagonal element. In the covariance matrix, the main diagonal element is the variance of the feature term. Correspondingly, in covariance matrix slicing, the main diagonal elements are variance slices of feature items.
And 2b, the first participant equipment determines the significance test value fragment of the first participant A aiming at any model parameter by utilizing an SNSI algorithm and a significance test method and jointly performing safe root number inverse operation through interaction among the plurality of participant equipment on the basis of the corresponding model parameter fragment of the first participant A and the corresponding variance fragments of the plurality of participants. And determining the effective value of the feature item corresponding to the model parameter based on the significance test value shards of the multiple participants for the model parameter.
Similarly, the second participant device determines the significance check value fragment of the model parameter of the second participant B by performing the security root number inversion operation in a combined manner through interaction among the plurality of participant devices based on the corresponding model parameter fragment of the second participant B and the corresponding variance fragments of the plurality of participants by using the SNSI algorithm and the significance check value for any model parameter.
In one embodiment, the saliency check value slices of multiple participants may be sent to a certain participant device or a third-party device, the saliency check value is reconstructed by the participant device or the third-party device, and based on the saliency check value, the effective value of the corresponding feature item may be determined according to a predetermined transformation manner. In another embodiment, the significance check value slices of the multiple participants can also be directly used as valid value slices, and the multiple significance check value slices can be reconstructed to obtain valid values.
The significance check value can be calculated based on the above formula (2) or formula (8), or p _ value formula, and the obtained significance check value fragment can be, but is not limited to, WaldkValue sharding, zkValue sharding or p-value sharding.
The model parameter slices of multiple participants derive the model parameters when a reconstruction is assumed. For example, for any one model parameter β1Model parameter slicing of the first participant<β1>AAnd the second participant B's model parameter sharding<β1>BObtaining the model parameter beta when assuming reconstruction1. The model parameter slices are not actually reconstructed, and the description is only for illustrating the relationship between the model parameter slices and the model parameters.
It can be seen that, in the embodiment, when the significance test value is calculated, diagonal elements in covariance matrix fragments of multiple participants are used, and data in the covariance matrix is not reconstructed, so that security of private data in the covariance matrix can be well protected.
In step 2b, the following description will be made with respect to any one model parameter βkThe first participant device is sharded based on the model parameters of the first participant a through interaction between the plurality of participant devices using the SNSI algorithm and the significance test method<βk>AAnd the variance fragments of a plurality of participants jointly perform the inverse operation of the safety root number to determine the model parameter beta of the first participant AkThe significance check value slicing step. In the same way can obtainDetermining, by a two-participant device, a model parameter β for a second participant BkThe significance check value of (1).
In the significance test method (8)
Figure BDA0003085753250000221
For example. For the first party, this equation (8) can be modified to
Figure BDA0003085753250000222
Wherein the content of the first and second substances,<zk>Amodel parameter β for the first participant AkThe significance check value of (a) is sliced, the molecular part is the model parameter slice of the first participant a, in the denominator part,<Covkk>Amodel parameters β owned by the first participant akThe corresponding variance partition, which is also the kth element (diagonal element) in the covariance matrix partition of the first participant a,<Covkk>Bmodel parameter β owned by the second participant BkThe corresponding variance partition is also the kth element in the covariance matrix partition of the second participant B.
The numerator portion is owned by the first party a and the denominator portion is owned by both the first party a and the second party B. Therefore, the present problem is focused on how to calculate the root inverse in expression (11). In this embodiment, the SNSI algorithm is used to determine the model parameter β of the first party akWith the model parameter β of the second participant BkIs inverse to the root of the sum of the variance patches based on the inverse of the root and the model parameter patches of the first participant a<βk>AMay yield the first party a for the model parameter βkThe significance check value of (1). Wherein the root number in formula (11) is inverted as follows
Figure BDA0003085753250000231
Next, go through step 1c &3c specifies how to calculate the root number inverse using the SNSI algorithm (<Covkk>A+<Covkk>B)-1/2. For convenience of description, let na=<Covkk>A,nb=<Covkk>BLet n denote the model parameter βkI.e. n ═ na+nbThe expectation is calculated such that the first participant device gets caThe second participant device gets cbAnd c is anda+cb=(na+nb)-1/2=n-1/2
and step 1c, the first party equipment and the second party equipment convert the addition fragmentation into the multiplication fragmentation through interaction.
The first participant device locally generates a random number xaThen make a request for
Figure BDA0003085753250000232
Figure BDA0003085753250000233
The first party device and the second party device jointly calculate through secret sharing matrix multiplication
Figure BDA0003085753250000234
Respectively obtain xba2,xbb
First participant device calculates xba=xba1+xba2And x isbaSending to the second participant device (x)ba1,xba2Not separately transmittable);
the second participant device calculates xb=xba+xbbWhere n is xa×xbRealize the addition slicing n ═ na+nbConversion into multiplication shards n ═ xa×xb. At this point, the first party A owns xaThe second party has xb
And 2c, respectively carrying out initialization of iterative estimation values locally by the two participant devices.
Taking the first participant a as an example, the first participant device will float a 64-bit floating point number xaIs read as a 64-bit integer and shifted to the right by one bit (divided by 2 and rounded down), denoted as inta(ii) a Calculate 0x5fe6eb50c7b537a9-intaAnd reading according to the storage mode of 64-bit floating point number and recording as ya. Thus, i.e. xaInitialized to ya
Similarly, the second participant device performs the above initialization, and x may be setbInitialized to yb. At this point, the first party A owns yaThe second party has yb
Step 3c, two parties jointly use Newton method to iteratively calculate n-1/2
The initial value of the iteration is Y0=Y0a×Y0b=ya×ybOwned by two participants respectively. The iterative formula is as follows
Figure BDA0003085753250000235
In the iteration process, two times of secret sharing matrix multiplication are used, 1 time of iteration is performed in total, and the floating point number c is obtained by the first party A and the second party B respectivelyaAnd cb
The implementation process of step 2b may also be implemented in other manners. For example, firstly, the variance fragment of the first party a and the variance fragment of the second party B are subjected to security standardization, then an iteration initial value is obtained through linear approximation calculation, and finally iteration is performed based on the Goldschmidt algorithm. In this embodiment, the secret sharing matrix multiplication operation may be performed based on the variance shard of the first party a and the variance shard of the second party B, and then other operations may be performed.
In this specification, the first party, the first feature item "first", and the second feature item "second" are used for convenience of distinction and description only, and do not have any limiting meaning.
In this specification, the number of the plurality of participants may be 2, 3 or more, each participant performs various operations through a corresponding participant device, and the participant device may be implemented by any device, platform, device cluster, etc. having computing and processing capabilities.
In the embodiments of the present specification, two participants are exemplified in more detail. For example, in the description of the embodiments of algorithms such as secret sharing matrix multiplication, secret sharing root number inversion, secret sharing matrix inversion, and the like for multi-party security calculation, the implementation of two parties can be easily extended to a more multi-party participating scenario, and the detailed process is not repeated.
The foregoing describes certain embodiments of the present specification, and other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily have to be in the particular order shown or in sequential order to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Fig. 4 is a schematic block diagram of an apparatus for determining a valid value of a service data characteristic of control traffic according to an embodiment. The business data are distributed in a plurality of participants, the business data of each of the plurality of participants form joint data under the condition of splicing, and the joint data comprise characteristic values of a plurality of objects for a plurality of characteristic items. The apparatus 400 is deployed in any first participant device, which may be implemented by any apparatus, device, platform, cluster of devices, etc. having computing and processing capabilities. This embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2. The apparatus 400 comprises:
an obtaining module 410, configured to obtain joint data fragments of the first participant, obtain predicted value fragments corresponding to the plurality of objects, respectively, and model parameter fragments corresponding to the plurality of feature items, respectively; the predicted value fragment and the model parameter fragment are obtained based on a trained service prediction model;
a reconstruction module 420 configured to reconstruct complete predictor data in a selected participant device using a plurality of predictor slices through transmission of predictor slices between the participant device and the selected participant device;
an interaction module 430, configured to determine, by using multi-party security computation, through interaction between multiple participant devices, relevance data segments corresponding to multiple participants respectively based on joint data segments of the multiple participants and predicted value data of a selected participant, where the relevance data segments include relevance data between multiple feature items;
the checking module 440 is configured to determine, by using a significance checking method, effective values of feature items corresponding to model parameters in improving the effect of the business prediction model based on the model parameter segments of the multiple participants and corresponding data in the relevance data segments through secure interaction among the multiple participant devices.
In one embodiment, the obtaining module 410, when obtaining the federated data segment of the first participant, includes:
adopting a secret sharing addition, and carrying out splitting and splicing operations based on the service data of a plurality of participants through interaction with other participant equipment, so that the plurality of participants respectively obtain joint data fragments; the federated data fragments of multiple participants result in the federated data assuming reconstruction.
In one embodiment, the service prediction model is obtained by performing security association training based on respective association data segments of a plurality of participants; the business prediction model is used for conducting business prediction on the object.
In an embodiment, when obtaining the predicted value slices corresponding to the plurality of objects and the model parameter slices corresponding to the plurality of feature items, the obtaining module 410 includes:
obtaining a model parameter fragment of the trained service prediction model in the local first participant device;
through interaction of a plurality of participants, the plurality of participants are enabled to determine predicted value fragments of the object respectively based on joint data fragments of the plurality of participants and the trained service prediction model.
In one embodiment, the selected participant comprises a participant having tag data; the reconstruction module 420 is specifically configured to:
when the first participant is not the selected participant, the predicted value fragments of the first participant are sent to the selected participant equipment, so that the selected participant equipment utilizes the plurality of predicted value fragments to reconstruct complete predicted value data;
and when the first participant is the selected participant, receiving the predicted value fragments sent by other participants, and reconstructing the predicted value fragments of the first participant and the received predicted value fragments to obtain complete predicted value data.
In one embodiment, the correlation data comprises covariance matrix data, and the correlation data slice comprises a covariance matrix slice; the interaction module 430 includes:
the determining submodule 431 is configured to determine intermediate matrix fragments corresponding to multiple participants respectively based on joint data fragments of the multiple participants, predicted value data of the selected participants and a functional relation in the service prediction model;
the calculating submodule 432 is configured to calculate inverse partitions of intermediate matrices corresponding to the multiple participants, respectively, based on the intermediate matrix partitions of the multiple participants, and obtain covariance matrix partitions corresponding to the multiple participants, respectively.
In one embodiment, the determining submodule 431 is specifically configured to:
dividing a Hessian matrix expression obtained based on a functional relation in the service prediction model into a plurality of blocks according to the assumed splicing relation of the service data of a plurality of participants; the Hessian matrix expression comprises joint data and predictive value data;
and according to the data of the participants related to the multiple blocks, respectively determining the data fragments of the multiple blocks by the multiple participants by using the joint data fragments of the multiple participants and the corresponding data in the predictive value data of the selected participant, and respectively determining the corresponding Hessian matrix fragments based on the data fragments of the blocks to serve as the middle matrix fragments.
In one embodiment, the business data of any one participant comprises characteristic values of part of characteristic items of all objects;
the determining submodule 431, when dividing the hessian matrix expression obtained based on the functional relation in the service prediction model into a plurality of blocks, includes:
dividing the hessian matrix expression into a first partition associated with the data of the selected participant, a second partition associated with a plurality of participants;
the determining submodule 431, when enabling the multiple parties to respectively determine the data shards of the multiple chunks, and respectively determine the corresponding hessian matrix shards based on the data shards of the chunks, includes:
when the first participant is a selected participant, determining the data fragment of the first block by using the joint data fragment of the first participant and the predictive value data; when the first participant is not the selected participant, filling a value of 0 into the first block to obtain the data fragment of the first block;
determining data shards of second shards corresponding to the multiple participants by utilizing the secret sharing matrix multiplication SMM and through interaction among the multiple participants and utilizing the joint data shards of the multiple participants and corresponding data in the predicted value data of the selected participant;
and splicing the data fragments of the first block and the data fragments of the second block of the first party to obtain the Hessian matrix fragment of the first party.
In one embodiment, the business data of any one participant comprises feature values of all feature items of a part of objects;
the determining submodule 431, when dividing the hessian matrix expression obtained based on the functional relation in the service prediction model into a plurality of blocks, includes: dividing the hessian matrix expression into a plurality of partitions respectively associated with data of a plurality of parties;
the determining submodule 431, when enabling the multiple parties to respectively determine the data shards of the multiple chunks, and respectively determine the corresponding hessian matrix shards based on the data shards of the chunks, includes:
determining a data fragment of a block associated with the data of the first participant by using the joint data fragment and predicted value data of the first participant, and filling a value of 0 into the block associated with the data of other participants to obtain the data fragment;
and splicing the data fragments of the plurality of blocks to obtain the Hessian matrix fragment of the first participant.
In one embodiment, the calculation submodule 432 is specifically configured to:
and obtaining covariance matrix fragments respectively corresponding to the multiple participants through iterative computation based on the intermediate matrix fragments of the multiple participants by using a secret sharing matrix inverse algorithm (SMI).
In one embodiment, the verification module 440 is specifically configured to:
using diagonal elements in the covariance matrix segments of the multiple participants as variance segments corresponding to the multiple model parameters respectively;
aiming at any model parameter, utilizing a secret sharing root number inverse algorithm SNSI and a significance test method, jointly performing safe root number inverse operation through interaction among a plurality of participant devices on the basis of a corresponding model parameter fragment of the first participant and a corresponding variance fragment of a plurality of participants, and determining a significance test value fragment of the first participant aiming at the model parameter; and determining the effective value of the feature item corresponding to the model parameter based on the significance test value shards of the multiple participants for the model parameter.
In one embodiment, the apparatus 400 further comprises a determining module (not shown in the figures) configured to:
aiming at any first characteristic item, obtaining a valid value fragment of the first characteristic item from other participant equipment;
and determining the reconstructed effective value of the first feature item based on the local effective value fragment of the first feature item and the obtained effective value fragment.
In one embodiment, the apparatus 400 further comprises a removal module (not shown) configured to:
and based on the effective value, removing the characteristic items of which the effective values do not meet the preset conditions from the plurality of characteristic items so that the plurality of participants perform safe joint training on the service prediction model by adopting the service data without the characteristic items.
In one embodiment, the object comprises one of a user, a commodity, an event; the characteristic items include at least one of: basic attribute information, incidence relation information, interaction information and historical behavior information; the business prediction model is used for conducting business prediction on the object.
In one embodiment, the business prediction model is based on a logistic regression model.
The above device embodiments correspond to the method embodiments, and specific descriptions may refer to descriptions of the method embodiments, which are not repeated herein. The device embodiment is obtained based on the corresponding method embodiment, has the same technical effect as the corresponding method embodiment, and for the specific description, reference may be made to the corresponding method embodiment.
Embodiments of the present specification also provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of fig. 1 to 3.
The embodiment of the present specification further provides a computing device, which includes a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method described in any one of fig. 1 to 3.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the storage medium and the computing device embodiments, since they are substantially similar to the method embodiments, they are described relatively simply, and reference may be made to some descriptions of the method embodiments for relevant points.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments further describe the objects, technical solutions and advantages of the embodiments of the present invention in detail. It should be understood that the above description is only exemplary of the embodiments of the present invention, and is not intended to limit the scope of the present invention, and any modification, equivalent replacement, or improvement made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (21)

1. A method for determining effective values of characteristics of service data for controlling traffic, wherein the service data are distributed in a plurality of participants, the service data of each of the participants form joint data under the condition of splicing, and the joint data comprise the characteristic values of a plurality of objects for a plurality of characteristic items; the method is performed by any first participant device, comprising:
acquiring joint data fragments of a first participant, acquiring predicted value fragments corresponding to a plurality of objects respectively, and model parameter fragments corresponding to a plurality of feature items respectively; the predicted value fragment and the model parameter fragment are obtained based on a trained service prediction model;
reconstructing complete predictor data in a selected participant device using a plurality of predictor slices through transmission of predictor slices between the participant device and the selected participant device;
determining relevance data fragments respectively corresponding to a plurality of participants based on joint data fragments of the plurality of participants and predicted value data of a selected participant by utilizing multi-party safety calculation through interaction among a plurality of participant devices, wherein the relevance data fragments comprise relevance data among a plurality of characteristic items;
and determining the effective value of the characteristic item corresponding to the model parameter on improving the effect of the business prediction model by adopting a significance test method and through the safety interaction among a plurality of participant devices based on the model parameter fragments of the participants and the corresponding data in the relevance data fragments.
2. The method of claim 1, wherein the step of obtaining the federated data segment of the first party comprises:
adopting a secret sharing addition, and carrying out splitting and splicing operations based on the service data of a plurality of participants through interaction with other participant equipment, so that the plurality of participants respectively obtain joint data fragments; the federated data fragments of multiple participants result in the federated data assuming reconstruction.
3. The method according to claim 1, wherein the service prediction model is obtained by performing security association training based on respective association data segments of a plurality of participants; the business prediction model is used for conducting business prediction on the object.
4. The method according to claim 3, wherein the step of obtaining the predicted value slices corresponding to the plurality of objects and the model parameter slices corresponding to the plurality of feature items comprises:
obtaining a model parameter fragment of the trained service prediction model in the local first participant device;
through interaction of a plurality of participants, the plurality of participants are enabled to determine predicted value fragments of the object respectively based on joint data fragments of the plurality of participants and the trained service prediction model.
5. The method of claim 1, the selected party comprising a party possessing tag data; the step of reconstructing complete predictive value data comprises:
when the first participant is not the selected participant, the predicted value fragments of the first participant are sent to the selected participant equipment, so that the selected participant equipment utilizes the plurality of predicted value fragments to reconstruct complete predicted value data;
and when the first participant is the selected participant, receiving the predicted value fragments sent by other participants, and reconstructing the predicted value fragments of the first participant and the received predicted value fragments to obtain complete predicted value data.
6. The method of claim 1, the correlation data comprising covariance matrix data, the correlation data patches comprising covariance matrix patches;
the step of determining the relevance data segments corresponding to the multiple participants respectively includes:
determining intermediate matrix fragments corresponding to the multiple participants respectively based on joint data fragments of the multiple participants, predicted value data of the selected participants and a functional relation in the service prediction model;
and calculating the inverse fragments of the intermediate matrix corresponding to the multiple participants respectively based on the intermediate matrix fragments of the multiple participants to obtain the covariance matrix fragments corresponding to the multiple participants respectively.
7. The method according to claim 6, wherein the step of determining the intermediate matrix slices respectively corresponding to the plurality of participants comprises:
dividing a Hessian matrix expression obtained based on a functional relation in the service prediction model into a plurality of blocks according to the assumed splicing relation of the service data of a plurality of participants; the Hessian matrix expression comprises joint data and predictive value data;
and according to the data of the participants related to the multiple blocks, respectively determining the data fragments of the multiple blocks by the multiple participants by using the joint data fragments of the multiple participants and the corresponding data in the predictive value data of the selected participant, and respectively determining the corresponding Hessian matrix fragments based on the data fragments of the blocks to serve as the middle matrix fragments.
8. The method of claim 7, wherein the business data of any one participant comprises feature values of part of feature items of all objects;
the step of dividing the hessian matrix expression obtained based on the functional relation in the service prediction model into a plurality of blocks includes:
dividing the hessian matrix expression into a first partition associated with the data of the selected participant, a second partition associated with a plurality of participants;
the step of enabling the plurality of participants to respectively determine the data shards of the plurality of chunks and respectively determine the corresponding hessian matrix shards based on the data shards of the chunks includes:
when the first participant is a selected participant, determining the data fragment of the first block by using the joint data fragment of the first participant and the predictive value data; when the first participant is not the selected participant, filling a value of 0 into the first block to obtain the data fragment of the first block;
determining data shards of second shards corresponding to the multiple participants by utilizing the secret sharing matrix multiplication SMM and through interaction among the multiple participants and utilizing the joint data shards of the multiple participants and corresponding data in the predicted value data of the selected participant;
and splicing the data fragments of the first block and the data fragments of the second block of the first party to obtain the Hessian matrix fragment of the first party.
9. The method of claim 7, wherein the business data of any one participant comprises feature values of all feature items of a part of the objects;
the step of dividing the hessian matrix expression obtained based on the functional relation in the service prediction model into a plurality of blocks includes: dividing the hessian matrix expression into a plurality of partitions respectively associated with data of a plurality of parties;
the step of enabling the plurality of participants to respectively determine the data shards of the plurality of chunks and respectively determine the corresponding hessian matrix shards based on the data shards of the chunks includes:
determining a data fragment of a block associated with the data of the first participant by using the joint data fragment and predicted value data of the first participant, and filling a value of 0 into the block associated with the data of other participants to obtain the data fragment;
and splicing the data fragments of the plurality of blocks to obtain the Hessian matrix fragment of the first participant.
10. The method of claim 6, wherein the step of calculating inverse partitions of the intermediate matrix corresponding to the participants respectively based on the partitions of the intermediate matrix corresponding to the participants to obtain covariance matrix partitions corresponding to the participants respectively comprises:
and obtaining covariance matrix fragments respectively corresponding to the multiple participants through iterative computation based on the intermediate matrix fragments of the multiple participants by using a secret sharing matrix inverse algorithm (SMI).
11. The method according to claim 6, wherein the step of determining the effective value of the feature item corresponding to the model parameter in improving the effect of the business prediction model comprises:
using diagonal elements in the covariance matrix segments of the multiple participants as variance segments corresponding to the multiple model parameters respectively;
aiming at any model parameter, utilizing a secret sharing root number inverse algorithm SNSI and a significance test method, jointly performing safe root number inverse operation through interaction among a plurality of participant devices on the basis of a corresponding model parameter fragment of the first participant and a corresponding variance fragment of a plurality of participants, and determining a significance test value fragment of the first participant aiming at the model parameter; and determining the effective value of the feature item corresponding to the model parameter based on the significance test value shards of the multiple participants for the model parameter.
12. The method of claim 11, further comprising:
aiming at any first characteristic item, obtaining a valid value fragment of the first characteristic item from other participant equipment;
and determining the reconstructed effective value of the first feature item based on the local effective value fragment of the first feature item and the obtained effective value fragment.
13. The method of claim 1, further comprising:
and based on the effective value, removing the characteristic items of which the effective values do not meet the preset conditions from the plurality of characteristic items so that the plurality of participants perform safe joint training on the service prediction model by adopting the service data without the characteristic items.
14. The method of claim 1, the object comprising one of a user, a good, an event; the characteristic items include at least one of: basic attribute information, incidence relation information, interaction information and historical behavior information; the business prediction model is used for conducting business prediction on the object.
15. The method of claim 1, wherein the traffic prediction model is derived based on a logistic regression model.
16. An apparatus for determining effective values of characteristics of service data for controlling traffic, wherein the service data is distributed among a plurality of participants, the service data of each of the plurality of participants forms joint data under the condition of splicing, and the joint data comprises the characteristic values of a plurality of objects for a plurality of characteristic items; the apparatus is deployed in any first participant device, and comprises:
the acquisition module is configured to acquire joint data fragments of a first participant, acquire predicted value fragments corresponding to a plurality of objects respectively, and model parameter fragments corresponding to a plurality of feature items respectively; the predicted value fragment and the model parameter fragment are obtained based on a trained service prediction model;
a reconstruction module configured to reconstruct complete predictor data in a selected participant device using a plurality of predictor slices through transmission of predictor slices between the participant device and the selected participant device;
the interaction module is configured to determine relevance data fragments corresponding to a plurality of participants respectively based on joint data fragments of the plurality of participants and predicted value data of a selected participant by utilizing multi-party safety calculation through interaction among a plurality of participant devices, wherein the relevance data fragments comprise relevance data among a plurality of characteristic items;
and the verification module is configured to determine an effective value of a feature item corresponding to a model parameter in improving the effect of the business prediction model by adopting a significance verification method through the safety interaction among a plurality of participant devices and based on the model parameter fragments of the participants and the corresponding data in the relevance data fragments.
17. The apparatus of claim 16, the obtaining module, when obtaining the federated data segment of the first party, comprises:
adopting a secret sharing addition, and carrying out splitting and splicing operations based on the service data of a plurality of participants through interaction with other participant equipment, so that the plurality of participants respectively obtain joint data fragments; the federated data fragments of multiple participants result in the federated data assuming reconstruction.
18. The apparatus of claim 16, the selected party comprising a party possessing tag data; the reconstruction module is specifically configured to:
when the first participant is not the selected participant, the predicted value fragments of the first participant are sent to the selected participant equipment, so that the selected participant equipment utilizes the plurality of predicted value fragments to reconstruct complete predicted value data;
and when the first participant is the selected participant, receiving the predicted value fragments sent by other participants, and reconstructing the predicted value fragments of the first participant and the received predicted value fragments to obtain complete predicted value data.
19. The apparatus of claim 16, the correlation data comprising covariance matrix data, the correlation data tile comprising a covariance matrix tile;
the interaction module comprises:
the determining submodule is configured to determine intermediate matrix fragments corresponding to a plurality of participants respectively based on joint data fragments of the participants, predicted value data of the selected participants and a functional relation in the service prediction model;
and the calculation submodule is configured to calculate the inverse fragments of the intermediate matrix corresponding to the multiple participants respectively based on the intermediate matrix fragments of the multiple participants to obtain the covariance matrix fragments corresponding to the multiple participants respectively.
20. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-15.
21. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-15.
CN202110580162.7A 2021-05-26 2021-05-26 Method and device for determining effective value of service data characteristic of control traffic Pending CN113407988A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110580162.7A CN113407988A (en) 2021-05-26 2021-05-26 Method and device for determining effective value of service data characteristic of control traffic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110580162.7A CN113407988A (en) 2021-05-26 2021-05-26 Method and device for determining effective value of service data characteristic of control traffic

Publications (1)

Publication Number Publication Date
CN113407988A true CN113407988A (en) 2021-09-17

Family

ID=77675317

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110580162.7A Pending CN113407988A (en) 2021-05-26 2021-05-26 Method and device for determining effective value of service data characteristic of control traffic

Country Status (1)

Country Link
CN (1) CN113407988A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114781000A (en) * 2022-06-21 2022-07-22 支付宝(杭州)信息技术有限公司 Method and device for determining correlation between object features of large-scale data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114781000A (en) * 2022-06-21 2022-07-22 支付宝(杭州)信息技术有限公司 Method and device for determining correlation between object features of large-scale data
CN114781000B (en) * 2022-06-21 2022-09-20 支付宝(杭州)信息技术有限公司 Method and device for determining correlation between object features of large-scale data

Similar Documents

Publication Publication Date Title
CN113407987B (en) Method and device for determining effective value of service data characteristic for protecting privacy
EP3627759B1 (en) Method and apparatus for encrypting data, method and apparatus for training machine learning model, and electronic device
CN111178549B (en) Method and device for protecting business prediction model of data privacy joint training by two parties
US20220092413A1 (en) Method and system for relation learning by multi-hop attention graph neural network
CN112818290B (en) Method and device for determining object feature correlation in privacy data by multiparty combination
CN111400766A (en) Method and device for multi-party joint dimension reduction processing aiming at private data
CN111931241B (en) Linear regression feature significance testing method and device based on privacy protection
CN114936650A (en) Method and device for jointly training business model based on privacy protection
CN112597540B (en) Multiple collinearity detection method, device and system based on privacy protection
CN111506922A (en) Method and device for carrying out significance check on private data by multi-party union
CN114186263A (en) Data regression method based on longitudinal federal learning and electronic device
CN113407988A (en) Method and device for determining effective value of service data characteristic of control traffic
CN116432040B (en) Model training method, device and medium based on federal learning and electronic equipment
WO2022048107A1 (en) Multi-dimensional statistical analysis system and method for sales amounts of seller users on e-commerce platform
CN110443061A (en) A kind of data ciphering method and device
CN116579852A (en) Financial service providing method, device and storage medium based on meta universe
CN112101609A (en) Prediction system, method and device for timeliness of payment of user and electronic equipment
Foroughi et al. A new AHP‐prioritization method based on linear programming for crisp and interval preference relations
US20210357955A1 (en) User search category predictor
CN114723239A (en) Multi-party collaborative modeling method, device, equipment, medium and program product
WO2018184463A1 (en) Statistics-based multidimensional data cloning
Zhang et al. Understanding deep gradient leakage via inversion influence functions
CN114781000B (en) Method and device for determining correlation between object features of large-scale data
US20230385446A1 (en) Privacy-preserving clustering methods and apparatuses
CN114996449B (en) Clustering method and device based on privacy protection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination