CN116975905A - Data processing method, device, computer equipment and readable storage medium - Google Patents

Data processing method, device, computer equipment and readable storage medium Download PDF

Info

Publication number
CN116975905A
CN116975905A CN202310645453.9A CN202310645453A CN116975905A CN 116975905 A CN116975905 A CN 116975905A CN 202310645453 A CN202310645453 A CN 202310645453A CN 116975905 A CN116975905 A CN 116975905A
Authority
CN
China
Prior art keywords
node
boolean
feature
fragment
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310645453.9A
Other languages
Chinese (zh)
Inventor
张凡
蒋杰
刘煜宏
陈鹏
黄晨宇
程勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310645453.9A priority Critical patent/CN116975905A/en
Publication of CN116975905A publication Critical patent/CN116975905A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computational Linguistics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the application provides a data processing method, a data processing device, computer equipment and a readable storage medium, wherein the method comprises the following steps: acquiring first characteristic data of a first service identifier of a first participant and first data fragments of a second service identifier of a second participant participating in longitudinal federal learning; inputting the first characteristic data and the first data fragment into a first decision tree of the first participant; acquiring a first feature Boolean fragment associated with a split feature; the first feature Boolean fragment and the second feature Boolean fragment are commonly used for acquiring node feature data associated with split features from the first feature data, the first data fragment and the second data fragment; the node characteristic data is used for determining a predicted value of an intersection service identifier between the first service identifier and the second service identifier; the predicted value is used to determine the traffic processing result of the intersection traffic identity. By adopting the method and the device, the security of the data owned by the participants can be improved.

Description

Data processing method, device, computer equipment and readable storage medium
Technical Field
The present application relates to the field of internet technologies, and in particular, to a data processing method, a data processing device, a computer device, and a readable storage medium.
Background
The longitudinal federation learning algorithm obtains a first service identifier of a first participant engaged in longitudinal federation learning, and obtains a second service identifier of a second participant engaged in longitudinal federation learning, directly compares the first service identifier with the second service identifier (e.g., the first participant sends the first service identifier to the second participant, and the second participant compares the first service identifier with the second service identifier), generates an intersection service identifier between the first service identifier and the second service identifier, and further determines a predicted value of the intersection service identifier based on feature data of service features of the intersection service identifier at the first participant and feature data of service features of the intersection service identifier at the second participant.
However, the first service identity and the second service identity may be private data (e.g., a cell phone number), and the process of generating the predicted value of the intersection service identity directly using the private data (e.g., a cell phone number) inevitably exposes the private data (e.g., exposes the first service identity of the first party to the second party), thereby reducing the security of the data owned by the parties of the longitudinal federal learning algorithm.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device, computer equipment and a readable storage medium, which can improve the security of data owned by a participant.
In one aspect, an embodiment of the present application provides a data processing method, where the method is performed by a first participant participating in longitudinal federal learning, including:
acquiring first characteristic data of a first service identifier of a first participant and first data fragments of a second service identifier of a second participant participating in longitudinal federal learning; the first data fragment and the second data fragment of the second service identifier held by the second participant are fragments of second characteristic data of the second service identifier;
inputting the first characteristic data and the first data fragment into a first decision tree of the first participant; the first decision tree comprises a first split feature fragment corresponding to the first partition node; the first partition node corresponds to a second partition node of a second decision tree of the second party; the second split feature fragments corresponding to the first split feature fragments and the second partition nodes are fragments of feature identifiers of split features commonly corresponding to the first partition nodes and the second partition nodes;
acquiring a first feature Boolean fragment associated with a split feature; the first feature boolean shard and a second feature boolean shard associated with the split feature held by the second participant are shards of a feature boolean vector; the first characteristic Boolean fragment and the second characteristic Boolean fragment are obtained by vector processing of the first split characteristic fragment and the second split characteristic fragment; the feature Boolean vector is used for representing split features in a first service feature of a first service identifier and a second service feature of a second service identifier;
The first feature Boolean fragment and the second feature Boolean fragment are commonly used for acquiring node feature data associated with split features from the first feature data, the first data fragment and the second data fragment; the node characteristic data is used for determining a predicted value of an intersection service identifier between the first service identifier and the second service identifier; the predicted value is used to determine the traffic processing result of the intersection traffic identity.
In one aspect, an embodiment of the present application provides a data processing apparatus, where the apparatus operates on a first participant participating in longitudinal federal learning, including:
the data acquisition module is used for acquiring first characteristic data of a first service identifier of a first participant and first data fragments of a second service identifier of a second participant participating in longitudinal federal learning; the first data fragment and the second data fragment of the second service identifier held by the second participant are fragments of second characteristic data of the second service identifier;
the data input module is used for inputting the first characteristic data and the first data fragments into a first decision tree of the first participant; the first decision tree comprises a first split feature fragment corresponding to the first partition node; the first partition node corresponds to a second partition node of a second decision tree of the second party; the second split feature fragments corresponding to the first split feature fragments and the second partition nodes are fragments of feature identifiers of split features commonly corresponding to the first partition nodes and the second partition nodes;
The feature characterization module is used for acquiring a first feature Boolean fragment associated with the split feature; the first feature boolean shard and a second feature boolean shard associated with the split feature held by the second participant are shards of a feature boolean vector; the first characteristic Boolean fragment and the second characteristic Boolean fragment are obtained by vector processing of the first split characteristic fragment and the second split characteristic fragment; the feature Boolean vector is used for representing split features in a first service feature of a first service identifier and a second service feature of a second service identifier;
the first feature Boolean fragment and the second feature Boolean fragment are commonly used for acquiring node feature data associated with split features from the first feature data, the first data fragment and the second data fragment; the node characteristic data is used for determining a predicted value of an intersection service identifier between the first service identifier and the second service identifier; the predicted value is used to determine the traffic processing result of the intersection traffic identity.
Wherein, the data acquisition module includes:
the Hash mapping unit is used for carrying out Hash mapping on the first service identifier of the first participant to obtain a first Hash table corresponding to the first service identifier;
A first acquisition unit configured to acquire a first boolean intersection slice associated with the first hash table and the second hash table; the first boolean intersection slice and a second boolean intersection slice held by the second participant are slices of boolean intersection vectors; the first hash table and the second hash table are used for carrying out hash table matching through an inadvertently programmable pseudo-random function to generate a first Boolean intersection fragment and a second Boolean intersection fragment; the second hash table is obtained by performing hash mapping on a second service identifier of a second party by the second party participating in longitudinal federal learning; the boolean intersection vector is used for indicating the intersection state of the first service identifier for the second service identifier;
the second acquisition unit is used for acquiring first original data of the first service identifier and first original fragments of the second service identifier; the first original shard and a second original shard held by the second participant are shards of second original data; the first original data are used for representing characteristic data of the first service identifier under the first service characteristic; the second original data is used for representing characteristic data of the second service identifier under the second service characteristic; the first Boolean intersection fragment and the second Boolean intersection fragment are jointly used for carrying out data screening on first original data to obtain first characteristic data of a first service identifier held by a first participant; the first Boolean intersection fragment and the second Boolean intersection fragment are commonly used for carrying out data screening on the first original fragment and the second original fragment to obtain a first data fragment of a second service identifier held by the first participant and a second data fragment of a second service identifier held by the second participant.
Wherein the total number of the first service features and the second service features is Q, and Q is an integer greater than 1; the vector dimension of the feature Boolean vector is equal to Q, and the feature Boolean vector comprises Q feature Boolean parameters; one of the Q characteristic boolean parameters may be used to characterize one of the Q service features; the characteristic Boolean parameters belonging to the characteristic identifiers in the characteristic Boolean vector are matched Boolean parameters, and the characteristic Boolean parameters not belonging to the characteristic identifiers in the characteristic Boolean vector are unmatched Boolean parameters.
The node characteristic data comprises a first node fragment held by a first participant and a second node fragment held by a second participant; if the split feature belongs to the first service feature, determining that the first node fragment and the second node fragment are determined by the first feature data; if the split feature belongs to the second service feature, the first node fragment is acquired from the first data fragment, and the second node fragment is acquired from the second data fragment; the first decision tree further comprises a first split value slice corresponding to the first partition node, and the second decision tree further comprises a second split value slice corresponding to the second partition node; the first split value slice and the second split value slice are slices of split values which are corresponding to the first partition node and the second partition node together;
The apparatus further comprises:
the first acquisition module is used for acquiring a first node Boolean fragment associated with the first partition node; the first node Boolean shard and a second node Boolean shard associated with a second partition node held by a second participant are shards of a first node Boolean vector; the first node Boolean vector is used for representing the relation between node characteristic data and split values; the first node boolean fragment and the second node boolean fragment are obtained by comparing the first node fragment and the second node fragment together with the first split value fragment and the second split value fragment.
Wherein the apparatus further comprises:
the second acquisition module is used for acquiring the first feature fragments corresponding to the Q business features respectively; q traffic characteristics include traffic characteristic V d D is a non-negative integer less than Q; service characteristics V d Corresponding first feature fragments and service features V held by second participants d The corresponding second feature fragments are all composed of service features V d First feature Boolean parameter and service feature V in first feature Boolean fragment d Second characteristic boolean parameters in second characteristic boolean fragments for traffic characteristic V d Is obtained by data selection; if the split feature belongs to the first service feature, the service feature V d Is obtained from the first characteristic data; if the split feature belongs to the second service feature, the service feature V d Is obtained from the first data slice and the second data slice;
the second acquisition module is used for carrying out summation processing on the Q first characteristic fragments to obtain first node fragments; the first node shard and the second node shard are shards of node characteristic data; the second node fragments are obtained by summing second characteristic fragments corresponding to the Q service characteristics respectively by a second participant; the node characteristic data is used for representing characteristic data of the intersection service identification under the split characteristic.
Wherein the apparatus further comprises:
the third acquisition module is used for acquiring the child node weight fragments of the child nodes of the first partition node if the child nodes of the first partition node are leaf nodes; the child node weight slices comprise a first child node weight slice of a first child node of the first partition node and a second child node weight slice of a second child node of the first partition node; the first sub-node weight slice and a third sub-node weight slice of a third sub-node of the second partition node held by the second participant are slices of the first sub-node weight; the first child node weight is used for representing a weight parameter of the first child node; the second sub-node weight slice and the fourth sub-node weight slice of the fourth sub-node of the second partition node held by the second participant are slices of the second sub-node weight; the second child node weight is used for representing a weight parameter of the second child node;
The first node Boolean fragment and the second node Boolean fragment are commonly used for selecting the first sub-node weight fragment, the second sub-node weight fragment, the third sub-node weight fragment and the fourth sub-node weight fragment, so as to obtain a first candidate weight fragment aiming at the first partition node held by the first participant and a second candidate weight fragment aiming at the second partition node held by the second participant; the first candidate weight patch and the second candidate weight patch are patches of candidate weight vectors; the candidate weight vector is used to characterize the weight parameters of the first child node and the second child node.
The first candidate weight slice is determined by a first sub-node weight slice and a second sub-node weight slice, and the second candidate weight slice is determined by a third sub-node weight slice and a fourth sub-node weight slice; if the node Boolean parameter indicated by the first node Boolean fragment and the second node Boolean fragment is the matched Boolean parameter, the candidate weight parameter fragments corresponding to the node Boolean parameter of the first candidate weight fragment and the second candidate weight fragment are obtained by selecting the first sub-node weight fragment and the third sub-node weight fragment; if the node Boolean parameter indicated by the first node Boolean fragment and the second node Boolean fragment is not matched with the Boolean parameter, the candidate weight parameter fragments corresponding to the node Boolean parameter of the first candidate weight fragment and the second candidate weight fragment are obtained by selecting the second sub-node weight fragment and the fourth sub-node weight fragment.
Wherein the apparatus further comprises:
a fourth obtaining module, configured to obtain a first node boolean sub-slice associated with the first sub-node and a second node boolean sub-slice associated with the second sub-node; the first node Boolean sub-segment and a third node Boolean sub-segment associated with a third sub-node held by the second party are segments of a first node Boolean sub-vector; the first node Boolean subvector is used for representing the dividing result of the intersection service identifier at the first subnode; the second node boolean sub-segment and a fourth node boolean sub-segment associated with a fourth sub-node held by the second party are segments of a second node boolean sub-vector; the second node Boolean subvector is used for representing the dividing result of the intersection service identifier in the second subnode;
a fifth obtaining module, configured to obtain a first target weight slice associated with the first child node; the first target weight segment associated with the first child node and the second target weight segment associated with the third child node held by the second participant are segments of a target weight vector; the target weight vector is used for representing weight parameters of the service identification of the first child node; the first target weight slice associated with the first sub-node and the second target weight slice associated with the third sub-node are obtained by jointly carrying out weight selection on the first candidate weight slice and the second candidate weight slice by the first node Boolean sub-slice and the third node Boolean sub-slice;
The summation processing module is used for carrying out summation processing on the first target weight fragments associated with the leaf nodes of the first decision tree to obtain first result fragments corresponding to the intersection service identifiers in the first decision tree; the first result fragments and the second result fragments corresponding to the intersection service identifiers in the second decision tree are fragments of sub-predicted values of the intersection service identifiers; the number of decision trees of the first participant and the number of decision trees of the second participant are K, and K is a positive integer; the sub-predicted value represents the common output of the first decision tree and the second decision tree, and the predicted value represents the common output of the K decision trees of the first participant and the K decision trees of the second participant; the K decision trees of the first participant comprise a first decision tree, and the K decision trees of the second participant comprise a second decision tree; the second resulting patches are obtained by the second participant summing second target weight patches associated with leaf nodes of the second decision tree.
The first node Boolean sub-slice and the third node Boolean sub-slice are determined by the first node Boolean slice, the second node Boolean slice, the third node Boolean slice which is held by the first participant and is associated with the father node of the first partition node, and the fourth node Boolean slice which is held by the second participant and is associated with the father node of the second partition node; the second node Boolean sub-segment and the fourth node Boolean sub-segment are determined by the first node Boolean sub-segment and the third node Boolean sub-segment through segment exclusive OR operation; the third node Boolean segment and the fourth node Boolean segment are segments of the second node Boolean vector; the second node boolean vector is used to characterize a relationship between feature data associated with a parent node split feature corresponding to the parent node and a parent node split value corresponding to the parent node.
The number of decision trees of the first participant and the number of decision trees of the second participant are K, and K is a positive integer;
the apparatus further comprises:
the predicted value determining module is used for obtaining first result fragments corresponding to K-1 decision trees of the intersection service identification in the first participant respectively; the K-1 decision trees of the first participant are the decision trees except the first decision tree in the K decision trees of the first participant;
the prediction value determining module is used for carrying out summation processing on the first result fragments corresponding to the K-1 decision trees of the first participant and the first result fragments corresponding to the first decision trees respectively to generate first output fragments corresponding to the intersection service identifiers; the first output fragment and the second output fragment corresponding to the intersection service identifier held by the second participant are fragments of the output vector; the second output fragments are generated by the second party by summing the second result fragments corresponding to the K-1 decision trees of the second party and the second result fragments corresponding to the second decision trees respectively; k-1 decision trees of the second participant are decision trees except the second decision tree in the K decision trees of the second participant; the first result fragment corresponding to the first decision tree and the second result fragment corresponding to the second decision tree are determined by the node characteristic data;
Wherein the output vector is used to characterize the predicted value of the intersection business identification in longitudinal federal learning.
In one aspect, an embodiment of the present application provides a computer device, including: a processor and a memory;
the processor is connected to the memory, wherein the memory is configured to store a computer program, and when the computer program is executed by the processor, the computer device is caused to execute the method provided by the embodiment of the application.
In one aspect, the present application provides a computer readable storage medium storing a computer program adapted to be loaded and executed by a processor, so that a computer device having the processor performs the method provided by the embodiment of the present application.
In one aspect, embodiments of the present application provide a computer program product comprising a computer program stored on a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device performs the method provided by the embodiment of the present application.
Therefore, the embodiment of the application can acquire the first characteristic data of the first service identifier, and the first data fragment and the second data fragment of the second service identifier, wherein the first data fragment and the second data fragment are respectively stored in the first participant and the second participant in a fragment mode, so that the characteristic data of the intersection service identifier can be not required to be directly acquired, the comparison of the first service identifier and the second service identifier (namely, the acquisition of the intersection service identifier between the first service identifier and the second service identifier) is not required, and the safety of the first service identifier and the second service identifier is ensured. Further, when split features corresponding to partition nodes in the first decision tree and the second decision tree are in a split state, the embodiment of the application can determine the first feature boolean split and the second feature boolean split associated with the split features through the first split feature split held by the first participant and the second split feature split held by the second participant, and further acquire node feature data associated with the partition nodes in the first decision tree and the second decision tree from the first feature data, the first data split and the second data split based on the first feature boolean split and the second feature boolean split, so as to determine a predicted value of the intersection service identifier based on the node feature data. Therefore, the embodiment of the application can safely select the characteristic data (namely the node characteristic data) of the split characteristic when the split characteristic of the non-leaf nodes of the first decision tree and the second decision tree after the longitudinal federal training is hidden is split, so that the predicted value of the intersection service identifier is determined in the longitudinal federal prediction process of the hidden trace, the split characteristic and the intersection service identifier are prevented from being exposed at the same time, and the safety of the data owned by the participants is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a scenario for data interaction according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating a method for determining a predicted value according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a hash map according to an embodiment of the present application;
FIG. 7 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 8 is a schematic view of a predicted scenario of a tree provided by an embodiment of the present application;
FIG. 9 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
Fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Specifically, referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application. As shown in fig. 1, the network architecture may include a service server 2000, a service server 5000, a terminal device cluster 3000, and a terminal device cluster 4000. Wherein the terminal device cluster 3000 may in particular comprise one or more terminal devices, the number of terminal devices in the terminal device cluster 3000 will not be limited here. As shown in fig. 1, the plurality of terminal devices may specifically include a terminal device 3000a, a terminal device 3000b, terminal devices 3000c, …, a terminal device 3000n; the terminal devices 3000a, 3000b, 3000c, …, 3000n may be directly or indirectly connected to the service server 2000 through a wired or wireless communication manner, so that each terminal device may interact with the service server 2000 through the network connection.
Wherein the terminal device cluster 4000 may in particular comprise one or more terminal devices, the number of terminal devices in the terminal device cluster 4000 will not be limited here. As shown in fig. 1, the plurality of terminal devices may specifically include a terminal device 4000a, a terminal device 4000b, terminal devices 4000c, …, a terminal device 4000n; the terminal devices 4000a, 4000b, 4000c, …, 4000n may be directly or indirectly connected to the service server 5000 through wired or wireless communication, respectively, so that each terminal device may interact with the service server 5000 through the network connection.
Wherein each terminal device in the terminal device cluster 3000 and the terminal device cluster 4000 may include: smart phones, tablet computers, notebook computers, desktop computers, intelligent voice interaction devices, intelligent home appliances (e.g., smart televisions), wearable devices, vehicle terminals, aircraft and other intelligent terminals with data processing functions. For easy understanding, in the embodiment of the present application, one terminal device may be selected from the terminal device cluster 3000 shown in fig. 1 as a first terminal device, where the first terminal device may be a terminal device that participates in longitudinal federal learning (i.e., a first participant), and one terminal device may be selected from the terminal device cluster 4000 shown in fig. 1 as a second terminal device, where the second terminal device may be a terminal device that participates in longitudinal federal learning (i.e., a second participant). For example, in the embodiment of the present application, the terminal device 3000c shown in fig. 1 may be used as a first terminal device, and the terminal device 4000n shown in fig. 1 may be used as a second terminal device, where the first terminal device and the second terminal device may be directly or indirectly connected to each other through a network through a wired or wireless communication manner, so that data interaction may be performed through the network connection.
The service server 2000 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. The service server 5000 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms.
The first terminal device may acquire a first service identifier and feature data (i.e., first feature data) of a first service feature for the first service identifier from the service server 2000, where the first service identifier may be an identifier (Identity Document, ID) that uniquely identifies service data in the first terminal device, and the first service identifier and the first feature data may be collectively referred to as service data; the second terminal device may obtain a second service identifier and feature data (i.e. second feature data) of a second service feature for the second service identifier from the service server 5000, where the second service identifier may be an identifier that uniquely identifies service data in the second terminal device, and the second service identifier and the second feature data may be collectively referred to as service data. For ease of understanding, the embodiment of the present application may refer to the first party as a Guest, and the second party as a Host, where Guest has feature data and Host has feature data. In addition, the number of the second terminal devices (i.e., the second participants) may be one or more, and for convenience of understanding, the embodiment of the present application will be described by taking the number of the second terminal devices (i.e., the second participants) as one example.
Federal learning (Federated Learning, FL) is a privacy-preserving distributed machine learning technique that addresses the problem of how to co-train a global model on virtually "aggregated" data while preserving data privacy when sensitive data is in the hands of multiple independent institutions, communities, individuals (e.g., first and second parties). Wherein federal learning may include longitudinal federal learning (Vertical Federated Learning) and transverse federal learning (Horizontal Federated Learning).
The longitudinal federal learning can include a trace longitudinal federal learning, which refers to a method for training a longitudinal federal learning model and predicting the longitudinal federal learning model on the premise of not exposing the ID set after intersection, namely, the sample alignment process is the longitudinal federal learning without information leakage. In federal practice, the vertical federal Xgboost (eXtreme Gradient Boosting) algorithm (i.e., the latent longitudinal federal Xgboost training algorithm and the latent longitudinal federal Xgboost prediction algorithm) is the most widely applied federal machine learning algorithm (i.e., the vertical federal learning model training algorithm and the vertical federal learning model prediction algorithm (e.g., vertical federal logistic regression prediction, vertical federal neural network prediction)). The XGBoost is a lifting version of the gradient lifting decision tree, and the XGBoost is a lifting method taking a classification tree or a regression tree as a basic classifier, so that the method has higher expandability and higher speed and efficiency.
Secret sharing may include arithmetic secret sharing (Arithmetic Secret Sharing) that may be used to generate secret shards, and boolean secret sharing (Boolean Secret Sharing) that may be used to generate arithmetic shards, which may be used to generate boolean shards, which may be collectively referred to as secret shards (simply shards). Arithmetic secret sharing: to be an integer ring Z p (p (i.e. 2 λ ) Is modulo, e.g. p equals 2 128 (i.e. die 2 128 Plus), representing a maximum value of 2 for an integer 128 -1, in which case the addition operation refers to Z p The addition and subtraction operations of the above are referred to as Z p The subtraction of the above) number x is split into two slices (i.e., arithmetic slices)And->(/>And->Can be collectively expressed as<x> A ) Is divided into two parts holding, i.e.)>Boolean secret sharing: splitting a boolean value x into two slices (i.e., boolean slices) based on exclusive or (XOR)>And->(/>And->Can be collectively expressed as<x> B ) Is divided into two parts holding, i.e.)>In other words, secret sharing is performed in an exclusive or mode, and then the random number obtained by the computing party is called Boolean fragmentation; secret sharing is performed by adopting an arithmetic addition mode, and then the random number obtained by the computing party is called arithmetic fragmentation.
It should be appreciated that embodiments of the present application may be used with respect to a first participant and a second participant <·>Representing slices, the slice superscript being an arithmetic slice represented by A, the slice superscript being a Boolean slice represented by B, the slice subscript being 0 representing a slice held by the first party,the shard subscript 1 indicates that it is the shard held by the second party; when the fragment is abbreviated with a subscript, this indicates that the calculation needs to be performed by both the first and second parties. In other words the first and second phase of the process,<x> A arithmetic slicing representing x, i.e. the parties each possess a random value, both random values being at Z p The addition on is x;<x> B the boolean slice representing x, i.e. the two parties each possess a random bit, the exclusive or of which is x.
Multiparty secure computation operator: (1) Two-sided slice addition (Add), here "<x> A +<y> A "means: input device<x> A And<y> A output of<x+y> A . (2) Two-sided slice subtraction (Sub), here "<x> A -<y> A "means: input device<x> A And<y> A output of<x-y> A . (3) Two-sided fragment comparison (Compare), herein "Compare<x> A ,<y> A ) "means: input device<x> A And<y> A output of<x≤y> B Outputting 1 if x is less than or equal to y, otherwise outputting 0; another way is to input<x> A And v, output<x≤v> B I.e. x.ltoreq.v is true, outputting 1, otherwise outputting 0. (4) Two-sided slicing And (And), here "AND<x> B ,<y> B ) "means: input device<x> B And<y> B output of<x&y> B The method comprises the steps of carrying out a first treatment on the surface of the Another way is to input <x> B And y, output<x&y> B The method comprises the steps of carrying out a first treatment on the surface of the When x And y are vectors, the component execution AND (nd) is represented. (5) Two-way selection (multiplexing), here "MUX<x> B ,<y> A ) Or Multiplex<x> B ,<y> A ) "means: input device<x> B And<y> A output of<x·y> A Outputting y corresponding to the fragments if x=1, otherwise outputting 0 corresponding to the fragments; when x and y are both vectors, the representation component performs multiplexing; when x is a vector and y is a scalar, each element representing x performs multiplexing with y. (6) Two-sided sliced exclusive or (Xor),the use of XOR<x> B ,<y> B ) "means: input device<x> B And<y> B output of(7) Two-party selection (CMux), herein "CMux<x> B ,<y 0 > A ,<y 1 > A ) "means: input device<x> B ,<y 0 > A And<y 1 > A output of<x·y 0 +(1-x)·y 1 > A I.e. x=1, then the output slice corresponds to y 0 Otherwise, outputting the corresponding y of the slice 1 The method comprises the steps of carrying out a first treatment on the surface of the When x, y 0 And y 1 Are vectors, representing components performing CMux. (8) two-party slice Sigmoid: input device<x> A Output of<sigmoid(x)> A Here Sigmoid (x) =1/(1+exp (-x)).
It is to be appreciated that the above-described network framework can be applicable in business scenarios of finance (e.g., boost credit amortization), medical (e.g., cross-hospital medical research), unmanned (e.g., boost driving experience), multimedia data recommendation (e.g., video recommendation), and the like. For example, in a medical scenario, rare diseases are frequently encountered in medical research, and at present, practical difficulties such as sample dispersion in different hospitals exist, which greatly prevent diagnosis and treatment work, and samples of different hospitals can be fused through longitudinal federal learning to generate auxiliary advice information corresponding to the rare diseases, so that doctors can diagnose and treat the diseases. For example, in a financial scenario, the credit management of a small and micro enterprise uses federal transfer learning, and the existing model of the financial institution in the past application (for example, a credit model of a medium and large enterprise or a marketing model of a small and micro enterprise) can be used for transfer learning, so that the application effect is improved. For example, in the unmanned scene, the interactive learning of the vehicle and the system environment can assist in fusing different source information under privacy protection by means of a longitudinal federal mode with other information (such as urban cameras, traffic lights and future intelligent roads) of the city, so that the unmanned experience is improved. For another example, in the multimedia data recommendation scene, the data under different platforms can be interactively learned, so that the accuracy of the multimedia data recommendation is improved.
For ease of understanding, further, please refer to fig. 2, fig. 2 is a schematic diagram of a scenario for data interaction according to an embodiment of the present application. The terminal device 20a shown in fig. 2 may be a first participant in the embodiment corresponding to fig. 1, and the terminal device 20b shown in fig. 2 may be a second participant in the embodiment corresponding to fig. 1, where the first participant and the second participant may participate in the vertical federal learning together. The terminal device 20a may include a first service identifier and feature data of the first service identifier in the first service feature (i.e., the first feature data), and the terminal device 20b may include a second service identifier and feature data of the second service identifier in the second service feature (i.e., the second feature data).
As shown in fig. 2, the terminal device 20a may obtain the first feature data of the first service identifier and the first data fragment of the second service identifier, and the terminal device 20b may obtain the second data fragment of the second service identifier. Wherein the first data slice and the second data slice are slices (i.e. arithmetic slices) of the second characteristic data of the second service identity, i.e. the second characteristic data are present in the terminal device 20a and the terminal device 20b in the form of slices.
Wherein, the terminal device 20a may include a first decision tree (i.e., decision tree 21 a), and the terminal device 20b may include a second decision tree (i.e., decision tree 21 b), the first decision tree and the second decision tree corresponding (i.e., the first decision tree and the second decision tree have the same structure, and the nodes in the first decision tree and the nodes in the second decision tree correspond). As shown in fig. 2, the terminal device 20a may input the first feature data and the first data slice into a first decision tree, and the terminal device 20b may input the second data slice into a second decision tree.
The first decision tree may include a first partition node, and the second decision tree may include a second partition node, where the first partition node corresponds to the second partition node; the first decision tree may further comprise other nodes than the first partition node, and the second decision tree may further comprise other nodes than the second partition node, the nodes in the first decision tree and the nodes in the second decision tree being in a one-to-one correspondence (i.e. the first decision tree and the second decision tree being synchronized). As shown in fig. 2, the first partition node may include a first split feature slice, and the second partition node may include a second split feature slice, where the first split feature slice and the second split feature slice are feature-identified slices (i.e., arithmetic slices) of split features that the first partition node and the second partition node commonly correspond to. For example, the feature identifier of the split feature that the first partition node and the second partition node commonly correspond to may be 0, and the first split feature slice and the second split feature slice may be 0 and are slices; the feature identifier of the split feature that corresponds to the first partition node and the second partition node together may be 1, and the first split feature slice and the second split feature slice may be 1 and a slice.
As shown in fig. 2, the terminal device 20a may obtain a first feature boolean fragment associated with the split feature, and the terminal device 20b may obtain a second feature boolean fragment associated with the split feature, where the first feature boolean fragment and the second feature boolean fragment are fragments of a feature boolean vector (i.e., boolean fragments), and the first feature boolean fragment and the second feature boolean fragment are each obtained by vector processing the first split feature fragment and the second split feature fragment.
It can be understood that the first feature boolean slice and the second feature boolean slice are obtained by vector processing the first split feature slice and the second split feature slice through DPFSS (Distributed Point Function Secret Sharing ). Let the terminal device 20a input i 0 (i.e., the first split feature slice), terminal device 20b inputs i 1 (i.e., second split feature slicing), i 0 And i 1 The method meets the following conditions:where i is equal to the feature identification of the split feature. Wherein i is 0 And i 1 May be in the form of binary values, i may be the pair i 0 And i 1 Results of exclusive-or processing of binary parameters at the same position in (a), e.g., i 0 =01,i 1 =10,11 may be a binary value of 3; i.e 0 =11,i 1 =10,/>01 may be a binary value of 1. Further, after executing the distributed point function secret sharing protocol, the terminal device 20a and the terminal device 20b obtain boolean vectors β with a length s (i.e. the total number of the first service features and the second service features), respectively 0 (i.e., first feature Boolean fragment) and beta 1 (i.e. second feature boolean slice) such that +.>Here e i The (i.e., feature boolean vector) represents a unit vector with the i-th position of 1 and the remaining positions of 0.
Thus, the feature boolean vector is used to characterize the split feature in the first traffic feature of the first traffic identity and the second traffic feature of the second traffic identity. For example, the number of the first service features is 1, the number of the second service features is 2, the total number of the first service features and the second service features is 3, the feature identifier of the first service features can be 0, and the feature identifiers of the second service features can be 1 and 2; if the split feature corresponding to the first partition node and the second partition node is the first service feature, the feature identifier of the split feature is 0, and at this time, the feature boolean vector may be (1, 0), that is, the feature boolean vector (1, 0) may be used to characterize the split feature in the first service feature and the second service feature as the service feature with the feature identifier equal to 0. Similarly, a feature boolean vector (0, 1, 0) may be used to characterize split features in the first and second traffic features as traffic features with feature identities equal to 1; the feature boolean vector (0, 1) may be used to characterize split features in the first and second traffic features as traffic features with feature identities equal to 2.
As shown in fig. 2, the first feature boolean vector and the second feature boolean vector are used together to obtain node feature data associated with split features from the first feature data, the first data slice, and the second data slice. If the split feature belongs to the first service feature, the node feature data is determined by the first feature data; optionally, if the split feature belongs to the second traffic feature, the node feature data is determined by the first data slice and the second data slice.
Further, the node characteristic data may be used to determine a sub-predictor of an intersection service identity between the first service identity and the second service identity, the sub-predictor may represent a common output of the first decision tree and the second decision tree. Further, the sub-predictors may be used to determine a predictor of the intersection business identity, the predictor representing a common output of all decision trees of the first party and all decision trees of the second party.
Wherein the predicted value may be used to determine the traffic handling result of the intersection traffic identity. For example, the service data may be symptom information in a medical scenario, the first decision tree and the second decision tree may be used for performing service processing on the symptom information, the service processing result may be auxiliary suggestion information corresponding to the symptom information, and the auxiliary suggestion information may be used for providing the doctor with the auxiliary suggestion information, so that the doctor performs diagnosis and treatment on the disease. For another example, the service data may be vehicle running data in an unmanned scenario, the first decision tree and the second decision tree may be used to perform service processing (e.g., vehicle running state analysis) on the vehicle running data, and the service processing result may be a running state corresponding to the vehicle running data, so as to implement unmanned driving of the vehicle. For another example, the service data may be operation data in a multimedia data recommendation scenario, the first decision tree and the second decision tree may be used to perform service processing (e.g. classification tag identification) on the operation data, and the service processing result may be a classification tag corresponding to the operation data, so as to implement multimedia data recommendation for the classification tag.
For easy understanding, the embodiment of the application is illustrated by taking a multimedia data recommendation scene as an example. For example, the first service identifier and the second service identifier may be object identifiers (i.e., user identifiers), the first service feature may be a video service feature (e.g., hobbies, video viewing duration) of the object identifier in the video client (i.e., first participant), the second service feature may be a news service feature (e.g., comment number) of the object identifier in the news client (i.e., second participant), longitudinal federal learning may be performed by the feature data of the object identifier in the video client and the feature data in the news client, and a common output (i.e., sub-prediction value) of the first decision tree and the second decision tree may be generated, and the sub-prediction value may determine a classification label for the service data, thereby implementing multimedia data recommendation for the service object (i.e., user) corresponding to the service data. In particular embodiments of the present application, where video service features and news service features are involved, user permissions or consents need to be obtained when embodiments of the present application are applied to particular products or technologies, and the collection, use and processing of relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions.
Therefore, when the split feature corresponding to the dividing node of the decision tree (i.e. the first decision tree and the second decision tree) is split, the embodiment of the application can acquire the node feature data associated with the split feature from the first feature data, the first data split and the second data split according to the split feature (i.e. the first split feature split and the second split feature split) of the feature identification of the split feature, so as to determine the predicted value of the intersection service identification based on the node feature data. Therefore, in the process of acquiring the node characteristic data, the application does not need to acquire the intersection service identifier between the first service identifier and the second service identifier, thereby not needing to compare the first service identifier with the second service identifier, and improving the security of the data owned by the first participant and the second participant respectively.
Further, referring to fig. 3, fig. 3 is a flow chart of a data processing method according to an embodiment of the application. The method may be performed by a first terminal device (i.e., a first participant) participating in longitudinal federal learning, or may be performed by a second terminal device (i.e., a second participant) participating in longitudinal federal learning, or may be performed by both the first terminal device and the second terminal device, where the first terminal device may be the terminal device 20a in the embodiment corresponding to fig. 2, and the second terminal device may be the terminal device 20b in the embodiment corresponding to fig. 2. For easy understanding, the embodiment of the present application will be described by taking the method performed by the first terminal device as an example. The data processing method may include the following steps S101 to S103:
Step S101, first characteristic data of a first service identifier of a first participant and first data fragments of a second service identifier of a second participant participating in longitudinal federal learning are obtained;
wherein the first data fragment and the second data fragment of the second service identity held by the second party are fragments of second characteristic data of the second service identity.
For a specific process of the first participant obtaining the first feature data and the first data slicing, refer to the following description of step S1011 to step S1013 in the embodiment corresponding to fig. 5.
Step S102, inputting first characteristic data and first data fragments into a first decision tree of a first participant;
the first decision tree comprises a first split feature fragment corresponding to a first partition node, and the first partition node corresponds to a second partition node of a second decision tree of the second participant; the second split feature segment corresponding to the first split feature segment and the second partition node is a segment of the feature identification of the split feature that the first partition node and the second partition node commonly correspond to (i.e.i may represent the feature identity of the split feature, i 0 Can represent a first split feature slice, i 1 A second split feature slice may be represented). Similarly, the second participant may input the second data slice into a second decision tree.
The number of decision trees of the first participant and the number of decision trees of the second participant are K, where K may be a positive integer. The K decision trees of the first party correspond to the K decision trees of the second party, the K decision trees of the first party may include a first decision tree, the K decision trees of the second party may include a second decision tree, and the first decision tree corresponds to the second decision tree. It will be appreciated that the correspondence of the first decision tree and the second decision tree means that the first decision tree and the second decision tree are co-generated during the vertical federal learning model training process, the first decision tree and the second decision tree are identical in structure, the first decision tree and the second decision tree can be used to store complementary information (e.g., the first split feature slice and the second split feature slice), and the first decision tree and the second decision tree can be commonly applied to vertical federal learning model prediction. Similarly, the first partition node and the second partition node may be two corresponding nodes in the first decision tree and the second decision tree, that is, the first partition node and the second partition node may be two nodes storing complementary information in the first decision tree and the second decision tree.
The first partition node may be a non-leaf node in the first decision tree, i.e., the first partition node may be a root node or an intermediate node in the first decision tree; similarly, the second partition node may be a non-leaf node in the second decision tree, i.e. the second partition node may be a root node or an intermediate node in the second decision tree.
Step S103, acquiring a first feature boolean fragment associated with the split feature.
Wherein the first feature boolean shard and the second feature boolean shard associated with the split feature held by the second party are shards of feature boolean vectors (i.e.e i Can represent the characteristic Boolean vector, beta 0 Can represent a first characteristic Boolean fragment, beta 1 A second feature boolean tile may be represented), the first feature boolean tile and the second feature boolean tile are each performed on the first split feature tile and the second split feature tileA row vector is processed to obtain; the feature boolean vector is used for characterizing the split feature in the first service feature of the first service identity and the second service feature of the second service identity, and the feature boolean vector is a unit vector.
Wherein the total number of the first service features and the second service features is Q, where Q may be an integer greater than 1; the vector dimension of the feature boolean vector is equal to Q (i.e., the dimension of the first feature boolean slice is equal to Q and the dimension of the second feature boolean slice is equal to Q), the feature boolean vector comprising Q feature boolean parameters; one of the Q characteristic boolean parameters may be used to characterize one of the Q service features; the characteristic boolean parameter belonging to the characteristic identifier in the characteristic boolean vector is a matching boolean parameter (e.g., the matching boolean parameter may be 1), and the characteristic boolean parameter not belonging to the characteristic identifier in the characteristic boolean vector is a non-matching boolean parameter (e.g., the non-matching boolean parameter may be 0). In other words, the value of the feature boolean vector at the feature identifier of the split feature is 1, and the value of the feature boolean vector at other identifiers than the feature identifier of the split feature is 0.
The specific process that the first feature boolean fragmentation and the second feature boolean fragmentation are used together to obtain the node feature data associated with the split feature from the first feature data, the first data fragmentation and the second data fragmentation, and the first participant and the second participant are used together to obtain the node feature data associated with the split feature from the first feature data, the first data fragmentation and the second data fragmentation can be seen in the following description of step S201-step S202 in the embodiment corresponding to fig. 7.
The node characteristic data is used for determining a predicted value of an intersection service identifier between the first service identifier and the second service identifier, and the intersection service identifier can represent an intersection between the first service identifier and the second service identifier; the predicted value is used to determine the traffic processing result of the intersection traffic identity.
It can be understood that the first participant can acquire the first result fragments corresponding to the K-1 decision trees of the intersection service identifier in the first participant respectively. The K-1 decision trees of the first participant are decision trees except the first decision tree in the K decision trees of the first participant. Further, the first participant may perform summation processing on the first result slices corresponding to the K-1 decision trees of the first participant and the first result slices corresponding to the first decision trees respectively (i.e., perform summation processing on the first result slices corresponding to the K decision trees of the first participant) to generate the first output slices corresponding to the intersection service identifiers. The second output fragments corresponding to the intersection service identifiers held by the first output fragments and the second participators are fragments of output vectors (i.e. arithmetic fragments), namely the first output fragments and the second output fragments are fragments of predicted values (i.e. arithmetic fragments), the second output fragments are generated by summing the second result fragments corresponding to the K-1 decision trees of the second participators and the second result fragments corresponding to the second decision trees respectively by the second participators (i.e. summing the second result fragments corresponding to the K decision trees of the second participators respectively), and the K-1 decision trees of the second participators are decision trees except the second decision tree in the K decision trees of the second participators; the first result slice corresponding to the first decision tree and the second result slice corresponding to the second decision tree are determined by the node characteristic data, and the first result slice corresponding to the first decision tree and the second result slice corresponding to the second decision tree are slices of sub-predicted values of the intersection service identifier (namely arithmetic slices); the sub-predictor represents a common output of the first decision tree and the second decision tree, and the predictor represents a common output of the K decision trees of the first participant and the K decision trees of the second participant.
Optionally, the first party may obtain first result fragments corresponding to K decision trees of the intersection service identifier in the first party, respectively; similarly, the second party can obtain second result fragments corresponding to the K decision trees of the intersection service identification at the second party respectively. Among the K first result slices and the K second result slices, the first result slices and the second result slices corresponding to the corresponding decision tree may be used to determine sub-predicted values corresponding to the decision tree (e.g., the first result slices corresponding to the first decision tree and the second result slices corresponding to the second decision tree may be used to determine sub-predicted values corresponding to the first decision tree and the second decision tree in common), and the sub-predicted values corresponding to the K decision trees may be used to determine predicted values of the intersection service identifier. It can be understood that the second party can send the second result slices corresponding to the K decision trees to the first party, so that the first party can sum the first result slices corresponding to the K decision trees and the second result slices corresponding to the K decision trees in the first result slices corresponding to the K decision trees respectively, so as to obtain sub-predicted values corresponding to the K decision trees respectively, and further sum the sub-predicted values corresponding to the K decision trees respectively, so as to obtain the predicted value of the intersection service identifier.
The prediction Tree in the embodiment of the present application may be a Regression Tree (Regression Tree) or a classification Tree (Classification Tree). It can be understood that, if the prediction tree in the embodiment of the present application is a regression tree, the embodiment of the present application may directly use the predicted value of the intersection service identifier; optionally, if the prediction tree in the embodiment of the present application is a classification tree (i.e. a classification model), the embodiment of the present application further needs to identify the predicted value of the intersection service (the predicted value is determined by the first output slice and the second output slice (i.e.) Represented) as input, a two-way shard Sigmoid function is performed:get function output slice->Output of slices by function->Updating the first output slice and the second output slice.
It should be appreciated that after the first participant acquires the first output slice and the second participant acquires the second output slice, the first participant mayThe first virtual partition may be acquired and the second participant may acquire the second virtual partition. The first virtual slice held by the first participant and the second virtual slice held by the second participant may be slices of a virtual vector (i.e., arithmetic slices), and the virtual vector may include virtual parameters, which may be used to distinguish real predicted values; for example, the virtual parameter may be a very small negative number, i.e., the virtual vector may be a vector of very small negative numbers, e.g., the virtual parameter may be equal to-2 31
It should be appreciated that when a first participant obtains first characteristic data and a first data slice, a second participant obtains a second data slice, the first participant may obtain a first boolean intersection slice, and the second participant may obtain a second boolean intersection slice. Wherein the first boolean intersection slice and the second boolean intersection slice are slices of boolean intersection vectors (i.e., boolean slices), the boolean intersection vectors being used to indicate the intersection state of the first service identity for the second service identity.
It is to be appreciated that the first boolean intersection tile and the second boolean intersection tile may be utilized in combination for tile selection of the first output tile, the second output tile, the first virtual tile, and the second virtual tile (i.e.And obtaining the selected first output fragment held by the first participant and the selected second output fragment held by the second participant. Wherein, the liquid crystal display device comprises a liquid crystal display device,<Q> B representing a first boolean intersection slice and a second boolean intersection slice, +.>Representing a first output slice and a second output slice,<τ> A representing a first virtual partition and a second virtual partition,<y> A representing the selected first output slice and the selected second output slice. Further, the first participant may update the first output slice through the selected first output slice (i.e., the selected first output slice is the first output slice), and the second participant may update the second output slice The party may update the second output slice by the selected second output slice (i.e., the selected second output slice is the second output slice).
Wherein the selected first output slice is determined by the first output slice and the first virtual slice, and the selected second output slice is determined by the second output slice and the second virtual slice; if the combined boolean intersection parameters indicated by the first boolean intersection fragments and the second boolean intersection fragments are parameters successfully matched, the parameters corresponding to the combined boolean intersection parameters of the selected first output fragments are obtained by selecting the first output fragments, and the parameters corresponding to the combined boolean intersection parameters of the selected second output fragments are obtained by selecting the second output fragments; if the merged boolean intersection parameter indicated by the first boolean intersection fragment and the second boolean intersection fragment is an unsuccessful matching parameter, the parameters corresponding to the merged boolean intersection parameter of the selected first output fragment are obtained by selecting the first virtual fragment, and the parameters corresponding to the merged boolean intersection parameter of the selected second output fragment are obtained by selecting the second virtual fragment. In other words, when Q i When=1, y i For the predicted value (i.e) Otherwise y i Is a virtual parameter (i.e. y i =τ i )。
The output vector is used for representing a predicted value of the intersection business identifier in longitudinal federation learning, and the second participant can send the second output fragment to the first participant, so that the first participant can perform fragment summation on the first output fragment and the second output fragment to obtain an output vector. It may be appreciated that if the output parameter in the output vector is equal to the virtual parameter, the first participant may determine that the output parameter is not a predicted value of the intersection service identity; optionally, if the output parameter in the output vector is not equal to the virtual parameter, the first participant may determine that the output parameter is a predicted value of the intersection service identifier.
Therefore, the embodiment of the application can prevent the second party from revealing the intersection service identification (namely, the second party cannot acquire whether the second service identification is in the intersection or not), protect the intersection service identification to be known only by the sender, and enhance the safety. Since the first party may obtain a predicted value of the intersection service identity (i.e., a true predicted value) or a virtual parameter of the non-intersection service identity, the first party may obtain the intersection service identity (i.e., whether the first service identity is in an intersection).
For ease of understanding, please refer to fig. 4, fig. 4 is a flowchart illustrating a method for determining a predicted value according to an embodiment of the present application. As shown in fig. 4, the longitudinal XGBoost federal prediction process starts, where the first party and the second party may jointly execute step S11, where the first party may act as an initiator, and the second party may act as a receiver, and perform circuit privacy set intersection through step S11. The specific process of performing the circuit privacy set intersection through step S11 may be referred to as a description of step S1011-step S1012 in the embodiment corresponding to fig. 5 below.
Further, as shown in fig. 4, the first participant and the second participant may jointly execute step S12, and the federal prediction tree is obtained through step S12, that is, the sub-prediction value commonly output by the first prediction tree and the second prediction tree is obtained. Further, the first participant and the second participant may jointly execute step S13, and it is determined whether the first participant and the second participant predict all the trees, i.e. determine whether the first participant inputs the first feature data and the first data slices into K decision trees of the first participant, or determine whether the second participant inputs the second data slices into K decision trees of the second participant through step S13.
As shown in fig. 4, if the first participant and the second participant have predicted all the trees, i.e., the first participant has inputted the first feature data and the first data slice into K decision trees of the first participant, and the second participant has inputted the second data slice into K decision trees of the second participant, the first participant and the second participant may jointly execute step S14, and the predicted value is recovered through step S14, i.e., the predicted value of the intersection service identifier is obtained through step S14.
Optionally, if the first participant and the second participant do not predict all the trees, i.e., the first participant does not input the first feature data and the first data slice into the K decision trees of the first participant, and the second participant does not input the second data slice into the K decision trees of the second participant, the first participant may continue to execute step S12, and the sub-prediction values output by the other prediction trees except the first prediction tree and the second prediction tree are obtained through step S12.
Therefore, the embodiment of the application can acquire the first characteristic data of the first service identifier, and the first data fragment and the second data fragment of the second service identifier, wherein the first data fragment and the second data fragment are respectively stored in the first participant and the second participant in a fragment mode, so that the characteristic data of the intersection service identifier can be not required to be directly acquired, the comparison of the first service identifier and the second service identifier (namely, the acquisition of the intersection service identifier between the first service identifier and the second service identifier) is not required, and the safety of the first service identifier and the second service identifier is ensured. Further, when split features corresponding to partition nodes in the first decision tree and the second decision tree are in a split state, the embodiment of the application can determine the first feature boolean split and the second feature boolean split associated with the split features through the first split feature split held by the first participant and the second split feature split held by the second participant, and further acquire node feature data associated with the partition nodes in the first decision tree and the second decision tree from the first feature data, the first data split and the second data split based on the first feature boolean split and the second feature boolean split, so as to determine a predicted value of the intersection service identifier based on the node feature data. Therefore, the embodiment of the application can safely select the characteristic data (namely the node characteristic data) of the split characteristic when the split characteristic of the non-leaf nodes of the first decision tree and the second decision tree after the longitudinal federal training is hidden is split, so that the predicted value of the intersection service identifier is determined in the longitudinal federal prediction process of the hidden trace, the split characteristic and the intersection service identifier are prevented from being exposed at the same time, and the safety of the data owned by the participants is improved.
Further, referring to fig. 5, fig. 5 is a flow chart of a data processing method according to an embodiment of the application. The data processing method may include the following steps S1011-S1013, and steps S1011-S1013 are one embodiment of step S101 in the embodiment corresponding to fig. 3.
Step S1011, performing cuckoo hash mapping on a first service identifier of a first participant to obtain a first hash table corresponding to the first service identifier;
wherein Cuckoo Hash mapping (Cuckoo Hash) represents a method of mapping m elements (i.e., a first service identifier) to n positions through k hashes (i.e., k Hash functions), requiring different elements to be mapped to different positions. Briefly, if the ith element has been mapped to the ith location under the action of the t (t < k) hash, when the jth (j > i) element is also mapped to the ith location, where the ith location generates a conflict, the jth element is mapped to the ith location, the ith element is remapped to the new location under the action of the t+1th hash, and if the locations generate a conflict, the process is repeated until all m elements are mapped.
Similarly, the second participant participating in the longitudinal federation learning may perform hash mapping on the second service identifier of the second participant to obtain a second hash table (i.e., the second hash table is obtained by performing hash mapping on the second service identifier of the second participant by the second participant participating in the longitudinal federation learning).
In the embodiment of the present application, the Hash mapping may be a Simple Hash mapping (Simple Hash) which indicates a method of mapping m elements (i.e. the second service identifier) to n positions through k hashes (i.e. k Hash functions), where the positions may conflict, i.e. different elements may be mapped to the same position. Different elements can be mapped to the same position through simple hash mapping, and the same element is mapped to at least two positions based on at least two hash functions, so that additional processing (i.e. no hash conflict resolution) is needed when hash conflicts exist, and the efficiency of hash mapping is improved.
It should be understood that, in the embodiment of the present application, the number of hash functions is not limited, and for convenience of understanding, the embodiment of the present application is described by taking the number of hash functions as 2 as an example, and 2 hash functions in the simple hash map indicate that each second service identifier may be mapped to 2 positions.
Step S1012, acquiring a first Boolean intersection slice associated with the first hash table and the second hash table;
wherein the first boolean intersection slice and the second boolean intersection slice held by the second participant are slices of boolean intersection vectors for indicating an intersection state of the first service identity for the second service identity (Q when the i-th service identity of the first participant matches a certain service identity of the second participant i =1, otherwise Q i =0); the first hash table and the second hash table are used for carrying out hash table matching through an inadvertently programmable pseudo-random function to generate a first Boolean intersection fragment and a second Boolean intersection fragment.
The hash table dimensions of the first hash table and the second hash table are the same, the first hash table comprises a first hash mapping bucket, and the second hash table comprises a second hash mapping bucket which is the same as the hash table dimension of the first hash mapping bucket. If the first hash mapping bucket comprises the first service identifier, the first random number corresponding to the first service identifier in the first hash mapping bucket is obtained by carrying out random processing on the first service identifier in the first hash mapping bucket based on an inadvertently programmable pseudo-random function. And if the second hash mapping barrel comprises the second service identifier, the second random number corresponding to the second service identifier in the second hash mapping barrel is obtained by carrying out random processing on the second service identifier in the second hash mapping barrel based on an careless programmable pseudo-random function. The first random number and the second random number are used for carrying out random number matching between the first party and the second party, and a first Boolean intersection parameter corresponding to the first Hash mapping bucket and a second Boolean intersection parameter corresponding to the second Hash mapping bucket are generated; the first boolean intersection parameter and the second boolean intersection parameter are slices of a merged boolean intersection parameter (i.e., boolean slices), the merged boolean intersection parameter being used to indicate boolean matching results of the first hash map bucket and the second hash map bucket (i.e., the first boolean intersection parameter and the second boolean intersection parameter may be resulting slices representing whether the first service identity is in the intersection or not); the first Boolean intersection parameters corresponding to each hash table dimension in the first hash table are used for forming first Boolean intersection fragments corresponding to the first service identifiers, and the second Boolean intersection parameters corresponding to each hash table dimension in the second hash table are used for forming second Boolean intersection fragments corresponding to the first service identifiers.
Optionally, if the first hash mapping bucket does not include the first service identifier, the first service identifier does not need to be randomly processed; if the second hash mapping bucket does not contain the second service identifier, the second service identifier does not need to be randomly processed.
It should be appreciated that the inadvertent programmable pseudo-Random Function (Oblivious Programming Pseudo-Random Function) may be an OPPRF protocol that represents a set of party inputs x= { (X) 1 ,y 1 ),(x 2 ,y 2 ),…,(x n ,y n ) The other party inputs x, after executing the protocol, the other party obtains the value y corresponding to x, and the condition that when x=x is satisfied i When y=y i The method comprises the steps of carrying out a first treatment on the surface of the Otherwise y is a random number. Optionally, the embodiment of the present application may also perform Random processing on the service identifier (including the first service identifier and the second service identifier) through an unintentional Pseudo-Random Function (OPRF).
It will be appreciated that the same first service identity and second service identity may be randomly processed to obtain the same random number and different first service identity and second service identity may be randomly processed to obtain different random numbers. If the first random number is the same as the second random number, the embodiment of the application can determine a matching success parameter (for example, the matching success parameter can be 1) as a merging boolean intersection parameter corresponding to the first hash mapping bucket and the second hash mapping bucket; optionally, if the first random number and the second random number are different, the embodiment of the present application may determine a parameter that is unsuccessful in matching (for example, the parameter that is unsuccessful in matching may be 0) as a merged boolean intersection parameter corresponding to the first hash map bucket and the second hash map bucket.
It may be understood that if the first hash mapping bucket includes the first service identifier and the second hash mapping bucket includes the second service identifier, the number of the first random numbers may be one, the number of the second random numbers may be one or more, and in the embodiment of the present application, the random number matching may be performed between the first random number and the one or more second random numbers. Thus, if the one or more second random numbers have the same random number as the first random number, determining that the first random number and the second random number are the same; optionally, if the one or more second random numbers do not have the same random number as the first random number, it is determined that the first random number and the second random number are different.
Optionally, if the first hash mapping bucket includes the first service identifier and the second hash mapping bucket does not include the second service identifier, the embodiment of the present application may determine the unsuccessful matching parameter as a merged boolean intersection parameter corresponding to the first hash mapping bucket and the second hash mapping bucket. Optionally, if the first hash mapping bucket does not include the first service identifier and the second hash mapping bucket includes the second service identifier, the embodiment of the present application may determine the unsuccessful matching parameter as a merged boolean intersection parameter corresponding to the first hash mapping bucket and the second hash mapping bucket.
For ease of understanding, please refer to fig. 6, fig. 6 is a schematic diagram of a scenario for performing hash mapping according to an embodiment of the present application. As shown in fig. 6, the PSI-Circuit (Circuit-PSI, private Set Intersection Circuit, circuit privacy set intersection) implements service identification secrecy, and the Circuit privacy set intersection protocol includes a first party and a second party, where the first party may be an initiator, and the second party may be a receiver. The method comprises the steps that an initiator maps first service identifiers to different sub-buckets (bin, namely hash mapping buckets) through cuckoo hash, and each sub-bucket is provided with at most one first service identifier; the receiver maps the second service identifiers to different sub-buckets through simple hash, and each sub-bucket can be provided with one or more second service identifiers.
The Circuit privacy set interaction refers to that the participating parties input a service identifier set (i.e., an ID set), and finally the participating parties can only obtain the fragmentation information about the intersection (i.e., the intersection service identifier), i.e., whether the initiator data of the PSI-Circuit (i.e., the first service identifier) is in boolean fragmentation in the intersection, so that the intersection service identifier (i.e., the intersection ID) and the non-intersection service identifier (i.e., the non-intersection ID) are not obtained. When the first service identifier is in the intersection, the initiator has a certain hash mapping bucket for containing the service identifier, and the hash mapping bucket corresponding to the receiver also contains the service identifier.
As shown in fig. 6, the first party may perform circuit privacy set intersection as the initiator, at this time, the second party inserts its own data (i.e., the second service identifier 61 b) into the common hash table (i.e., the second hash table 62 b), and the first party fills its own data (i.e., the first service identifier 61 a) into the cuckoo hash table (i.e., the first hash table 62 a). For ease of understanding, the number of first service identifiers 61a is herein described as 4, for example, the first service identifiers 61a may include 2, 3, 7, 10; here, the number of the second service identifiers 61b is described as 6, and for example, the second service identifiers 61b may include 2, 4, 5, 6, 7, and 9.
As shown in fig. 6, the hash table dimensions of the first hash table 62a and the second hash table 62b are the same, the first hash table 62a may include a first hash map bucket, and the second hash table 62b may include a second hash map bucket. For ease of understanding, embodiments of the present application are described with the 6-dimensional hash table dimension of first hash table 62a and second hash table 62b as examples. In the embodiment of the present application, the hash mapping bucket (bin) not filled with data (i.e. service identifier) may be filled with "," "indicates a random number, which refers to a null sample, and feature data corresponding to", "" may be 0.
As shown in fig. 6, for example, the first hash-map bucket may be a second hash-map bucket in the first hash table 62a, the second hash-map bucket may be a second hash-map bucket in the second hash table 62b, the first hash-map bucket may include 7, and the second hash-map bucket may include 4 and 7. At this time, the first random number corresponding to 7 may be 17, the second random number corresponding to 4 may be 14, and the second random number corresponding to 7 may be 17. When the first random number and the second random number are matched, the first random number 17 is equal to the second random number 17, and the embodiment of the application can determine the successful matching parameter as the combined boolean intersection parameter corresponding to the first hash mapping barrel and the second hash mapping barrel, namely, the successful matching parameter is determined as the combined boolean intersection parameter corresponding to the second hash mapping barrel; similarly, the embodiment of the application can determine the successful matching parameter as the merging boolean intersection parameter corresponding to the fourth hash mapping bucket, and determine the unsuccessful matching parameter as the merging boolean intersection parameter corresponding to the first hash mapping bucket, the third hash mapping bucket, the fifth hash mapping bucket and the sixth hash mapping bucket respectively.
As shown in fig. 6, the boolean intersection vector (i.e., Q) may be {0,1,0,1,0,0}, the dimension of the boolean intersection vector being associated with the number of first traffic identifications 61 a. Accordingly, the first boolean intersection slice 63a may include 6 first boolean intersection parameters and the second boolean intersection slice 63b may include 6 second boolean intersection parameters. Wherein the first boolean intersection parameter may be <0>, the first second boolean intersection parameter may be <0>, the exclusive or result of the first boolean intersection parameter and the first boolean intersection parameter is a merged boolean intersection parameter (i.e. 0), for example, the first boolean intersection parameter may be 0, the first second boolean intersection parameter may be 0, the exclusive or result of the first boolean intersection parameter 0 and the first second boolean intersection parameter may be 0 (i.e. a merged boolean intersection parameter).
Step S1013, obtain the first original data of the first service identifier and the first original slice of the second service identifier.
The first original fragments and the second original fragments held by the second participant are fragments (i.e. arithmetic fragments) of second original data, and the first original data is used for representing characteristic data of the first service identifier under the first service characteristic; the second original data is used for representing characteristic data of the second service identifier under the second service characteristic; the first Boolean intersection fragment and the second Boolean intersection fragment are jointly used for carrying out data screening on first original data to obtain first characteristic data of a first service identifier held by a first participant, and the first Boolean intersection fragment and the second Boolean intersection fragment are jointly used for carrying out data screening on the first original fragment and the second original fragment to obtain a first data fragment of a second service identifier held by the first participant and a second data fragment of a second service identifier held by the second participant.
It may be appreciated that the first raw data may comprise characteristic data of a first service identity, and the first characteristic data may comprise characteristic data of an intersection service identity in the first service identity; similarly, the second original data may include feature data of a second service identifier, and the second feature data may include feature data of an intersection service identifier in the second service identifier. Thus, the data filtering means that the characteristic data of the non-intersection service identity are deleted from the first original data and the second original data according to the first boolean intersection fragment and the second boolean intersection fragment, respectively, i.e. the first characteristic data and the second characteristic data each represent the characteristic data of the intersection service identity.
It should be appreciated that the receiver of the Circuit-PSI may also take as input the feature data corresponding to the second service identifier, and then after executing the Circuit-PSI, arithmetic slicing of the second feature data is obtained in addition to slicing of the boolean intersection vector (i.e., Q) (i.e., the first boolean intersection slice and the second boolean intersection slice)<X> A (i.e., the first data slice and the second data slice are equal in length to the boolean intersection vector, when the boolean intersection vector takes a value true at the ith bin, <X> A The value of the bin is the fragment of the characteristic data corresponding to the second service identifier of the bin; otherwise<X> A Fragments with a value of 0 in this bin).
Similarly, the sender of the Circuit-PSI may also use the feature data corresponding to the first service identifier as input, and then, after executing the Circuit-PSI, obtain the first feature data (the first boolean intersection patch and the second boolean intersection patch) in addition to the patches of the boolean intersection vector (i.e., the first boolean intersection patch and the second boolean intersection patch), where the first feature data is equal in length to the boolean intersection vector, and when the boolean intersection vector has a value true at the ith bin, the value of the first feature data has the value of the bin corresponding to the first service identifier that the bin is filled with.
Therefore, according to the first boolean intersection fragment and the second boolean intersection fragment, non-matching feature data can be automatically filtered out, and matching feature data (i.e., feature data of the intersection service identification) is reserved. Wherein the characteristic data of the intersection service identity may comprise first characteristic data at the first party and second characteristic data at the second party.
It should be understood that the first number of identifiers corresponding to the first service identifier may be smaller than the second number of identifiers corresponding to the second service identifier, the first number of identifiers may be equal to the second number of identifiers, and the first number of identifiers may be greater than the second number of identifiers. For easy understanding, the embodiment of the present application is described by taking the case that the first number of identifiers is smaller than the second number of identifiers, and in this case, the embodiment of the present application may refer to the participant with the smaller number of identifiers as the first participant, and the participant with the larger number of identifiers as the second participant. When the number of samples of the first party (i.e. the first identification number) is much smaller than the number of samples of the second party (i.e. the second identification number), the performance of the embodiment of the application will be greatly improved.
Therefore, the embodiment of the application can combine the circuit privacy set intersection to design an efficient prediction method of the longitudinal federal XGBoost model of the two sides, and the method can protect the service identifier (i.e. intersection service identifier) after intersection through the circuit privacy set intersection method. In addition, in the sample alignment stage, the embodiment of the application can compress the data volume of the second party participating in the intersection to be consistent with the first party (namely, the sample number of the second party is compressed to be the sample number of the first party), further, according to the aligned characteristic data, federal prediction is performed on the first party and the second party, and at the moment, the calculation/communication complexity of federal model prediction is irrelevant to the sample number of the second party, so that the performance can be greatly improved, and the actual application requirements can be met.
Further, referring to fig. 7, fig. 7 is a flow chart of a data processing method according to an embodiment of the application. The method may be performed by a first terminal device (i.e., a first participant) participating in longitudinal federal learning, or may be performed by a second terminal device (i.e., a second participant) participating in longitudinal federal learning, or may be performed by both the first terminal device and the second terminal device, where the first terminal device may be the terminal device 20a in the embodiment corresponding to fig. 2, and the second terminal device may be the terminal device 20b in the embodiment corresponding to fig. 2. For easy understanding, the embodiment of the present application will be described by taking the method performed by the first terminal device as an example. The data processing method may include the following steps S201 to S207:
Step S201, obtaining first feature fragments corresponding to Q business features respectively;
wherein the Q service features include service feature V d Where Q may be an integer greater than 1, and where d may be a non-negative integer less than Q; service characteristics V d Corresponding first feature fragments and service features V held by second participants d The corresponding second feature fragments are all composed of service features V d First feature Boolean parameter and service feature V in first feature Boolean fragment d Second characteristic boolean parameters in second characteristic boolean fragments for traffic characteristic V d Is obtained by data selection. Wherein if the split feature belongs to the first service feature, the service feature V d Is obtained from the first characteristic data; optionally, if the split feature belongs to the second service feature, service feature V d Is obtained from the first data slice and the second data slice;
wherein the business characteristics V d The first characteristic Boolean parameter in the first characteristic Boolean segment represents that the first characteristic Boolean segment is in service characteristic V d Characteristic boolean parameter under the identification of (a), service characteristic V d The second characteristic boolean parameter in the second characteristic boolean segment represents the second characteristic boolean segment in the traffic characteristic V d Characteristic boolean parameters under the identification of (a).
Wherein the data selection may be expressed as<z j > A =Multiplex(<e i,j > B ,t),<e i,j > B Can represent a first characteristic Boolean parameter (i.e., beta 0,j ) And a second characteristic Boolean parameter (i.e., beta 1,j ) (i.e. beta 0,j And beta 1,j Is e i,j T may represent feature data of the jth feature (i.e., feature identified as j), resulting in a first feature slice and a second feature slice, which may be represented as<z j > A . It will be appreciated that at e i,j When=0, z j Is an all-zero vector; at e i,j When=1, z j After processing the data selection for the feature data of the jth feature, z is determined whether the feature data of the jth feature is obtained from the first feature data or from the first and second data slices j Are all in the form of slices.
Step S202, summing the Q first characteristic fragments to obtain a first node fragment;
the first node fragments held by the first participant and the second node fragments held by the second participant are fragments of node characteristic data (namely arithmetic fragments), and the second node fragments are obtained by summing second characteristic fragments corresponding to Q service characteristics respectively by the second participant; the node characteristic data is used for representing characteristic data of the intersection service identification under the split characteristic. In the Q first feature fragments and the Q second feature fragments, z which are corresponding to the first feature fragments and the second feature fragments and are corresponding to the split features in common j For feature data of split features, z, which is commonly corresponding to a first feature slice and a second feature slice corresponding to features other than the split features j Is an all-zero vector.
In other words, the node characteristic data includes a first node slice and a second node slice. If the split feature belongs to the first service feature, determining that the first node fragment and the second node fragment are determined by the first feature data; optionally, if the split feature belongs to the second traffic feature, the first node fragment is obtained from the first data fragment and the second node fragment is obtained from the second data fragment (i.e., the first node fragment and the second node fragment are determined by the first data fragment and the second data fragment).
It should be appreciated that the first decision tree further includes a first split value slice corresponding to the first partition node, and the second decision tree further includes a second split value slice corresponding to the second partition node; the first split value slice and the second split value slice are slices (i.e., arithmetic slices) of split values that the first partition node and the second partition node commonly correspond to. Thus, the first partition node and the second partition node may collectively represent a split value (i.e., an optimal split value) of the split feature (i.e., the optimal split feature).
For example, the optimal split feature may be a video service feature (e.g., video viewing duration) in a multimedia data recommendation scene, the optimal split value may be 60 minutes, at this time, the left subtree (i.e., the tree formed by the left child nodes) may be an object identifier with a video viewing duration less than 60 minutes, and the right subtree (i.e., the tree formed by the right child nodes) may be an object identifier with a video viewing duration greater than or equal to 60 minutes. For another example, the optimal splitting feature may be a news service feature (e.g., the number of comments) in the multimedia data recommendation scene, the optimal splitting value may be 5, at this time, the left subtree may be an object identifier with the number of comments less than 5, and the right subtree may be an object identifier with the number of comments greater than or equal to 5.
Step S203, obtaining a first node Boolean fragment associated with a first partition node;
the first node Boolean fragments and the second node Boolean fragments which are held by the second participant and are associated with the second divided nodes are fragments (namely Boolean fragments) of a first node Boolean vector, and the first node Boolean vector is used for representing the relation between node characteristic data and split values; the first node boolean fragment and the second node boolean fragment are obtained by comparing the first node fragment and the second node fragment together with the first split value fragment and the second split value fragment.
Wherein the slice comparison may be expressed as<b> B =compare(<x> A ,<y> A ),<x> A A first node slice and a second node slice may be represented,<y> A the first split value slices and the second split value slices may be represented,<b> B the first node boolean and second node boolean may be represented. b may represent whether the feature data of the split feature is less than the split value, i.e. the fragment comparison may obtain feature data less than the split value from the feature data of the split feature.
Step S204, if the child node of the first partition node is a leaf node, acquiring a child node weight fragment of the child node of the first partition node;
the child node weight fragments comprise a first child node weight fragment of a first child node (namely a left child node) of the first partition node and a second child node weight fragment of a second child node (namely a right child node) of the first partition node; the first sub-node weight slice and the third sub-node weight slice of the second partition node held by the second participant are slices of the first sub-node weight (i.e. arithmetic slices), the first sub-node and the third sub-node correspond, the first sub-node weight is used for representing the weight parameter of the first sub-node (i.e. the weight of the leaf node <wt l > A The weight of the leaf node is the state of the secret shard); the second sub-node weight slice and the fourth sub-node weight slice of the second partition node held by the second participant are slices of the second sub-node weight (i.e. arithmetic slices), the second sub-node and the fourth sub-node correspond, the second sub-node weight is used for representing the weight parameter of the second sub-node (i.e. the weight of the leaf node<wt r > A The weight of a leaf node is the state of the secret shard).
The first node boolean fragment and the second node boolean fragment (the first node boolean fragment may be referred to as a node boolean fragment of the first partition node, the second node boolean fragment may be referred to as a node boolean fragment of the second partition node) are commonly used for performing fragment selection on the first sub-node weight fragment, the second sub-node weight fragment, the third sub-node weight fragment and the fourth sub-node weight fragment, so as to obtain a first candidate weight fragment for the first partition node held by the first participant and a second candidate weight fragment for the second partition node held by the second participant; the first candidate weight patch and the second candidate weight patch are patches (i.e., arithmetic patches) of candidate weight vectors, which are used to characterize the weight parameters of the first child node and the second child node.
Wherein the selection of slices can be expressed as<w> A =CMux(<b> B ,<wt l > A ,<wt r > A ),<b> B A first node boolean slice and a second node boolean slice may be represented,<wt l > A the first child node weight tile and the third child node weight tile may be represented,<wt r > A the second child node weight tile and the fourth child node weight tile may be represented,<w> A the first candidate weight patch and the second candidate weight patch may be represented.
Wherein the first candidate weight slice is determined by the first and second child node weight slices (i.e., the first candidate weight slice is obtained from the first and second child node weight slices), and the second candidate weight slice is determined by the third and fourth child node weight slices (i.e., the second candidate weight slice is obtained from the third and fourth child node weight slices). It may be understood that, if the node boolean parameter indicated by the first node boolean and the second node boolean is a matching boolean parameter (for example, the matching boolean parameter may be 1), the candidate weight parameter slices corresponding to the first candidate weight slice and the second candidate weight slice at the node boolean parameter are obtained by selecting the first sub-node weight slice and the third sub-node weight slice (i.e., the candidate weight parameter slice corresponding to the first candidate weight slice at the node boolean parameter is obtained by selecting the first sub-node weight slice, and the candidate weight parameter slice corresponding to the second candidate weight slice at the node boolean parameter is obtained by selecting the third sub-node weight slice); optionally, if the node boolean parameter indicated by the first node boolean and second node boolean is a non-matching boolean parameter (for example, the non-matching boolean parameter may be 0), the candidate weight parameter slices corresponding to the first candidate weight slice and the second candidate weight slice at the node boolean parameter are obtained by selecting the second sub-node weight slice and the fourth sub-node weight slice (that is, the candidate weight parameter slice corresponding to the first candidate weight slice at the node boolean parameter is obtained by selecting the second sub-node weight slice, and the candidate weight parameter slice corresponding to the second candidate weight slice at the node boolean parameter is obtained by selecting the fourth sub-node weight slice).
Optionally, if the child node of the first partition node is an intermediate node, the first participant may obtain a node boolean partition of the child node of the first partition node, and the second participant may obtain a node boolean partition of the child node of the second partition node. Wherein, the node Boolean sharding of the left child node of the first partition node can be that<b l > B The node boolean shard of the right child node of the first partition node may be<b r > B The specific process of the first participant obtaining the node boolean fragments of the child nodes of the first partition node may be referred to the above description of obtaining the node boolean fragments of the first partition node, and the specific process of the second participant obtaining the node boolean fragments of the child nodes of the second partition node may be referred to the above description of obtaining the node boolean fragments of the second partition node, which will not be described herein.
Step S205, a first node Boolean sub-fragment associated with a first sub-node and a second node Boolean sub-fragment associated with a second sub-node are obtained;
the first node boolean sub-slice and a third node boolean sub-slice associated with a third sub-node held by the second party are slices of a first node boolean sub-vector (i.e., the first node boolean sub-slice may be referred to as a node boolean sub-slice of the first sub-node, the third node boolean sub-slice may be referred to as a node boolean sub-slice of the third sub-node), and the first node boolean sub-vector is used for representing a division result of the intersection service identifier in the first sub-node, i.e., the matching boolean parameter in the first node boolean sub-vector represents that the intersection service identifier corresponding to the matching boolean parameter is divided into the first sub-node; the second node boolean sub-slice and a fourth node boolean sub-slice associated with the fourth sub-node held by the second party are slices of a second node boolean sub-vector (i.e., the second node boolean sub-slice may be referred to as a node boolean sub-slice of the second sub-node, the fourth node boolean sub-slice may be referred to as a node boolean sub-slice of the fourth sub-node), and the second node boolean sub-vector is used for characterizing a division result of the intersection service identifier at the second sub-node, i.e., the matching boolean parameter in the second node boolean sub-vector indicates that the intersection service identifier corresponding to the matching boolean parameter is divided into the second sub-node.
The first node boolean sub-slice And the third node boolean sub-slice are determined by the first node boolean slice, the second node boolean slice, a third node boolean slice associated with a parent node of the first partition node held by the first participant, and a fourth node boolean slice associated with a parent node of the second partition node held by the second participant; the second node Boolean sub-slice and the fourth node Boolean sub-slice are determined by performing a slice exclusive OR operation (i.e. XOR) on the first node Boolean sub-slice and the third node Boolean sub-slice.
The third node boolean slice (i.e., node boolean slice of the parent node of the first partition node) and the fourth node boolean slice (i.e., node boolean slice of the parent node of the second partition node) are slices of a second node boolean vector (i.e., boolean slice), which is used to characterize a relationship between feature data associated with the parent node split feature corresponding to the parent node and the parent node split value corresponding to the parent node. The specific process of the first participant obtaining the node boolean slice of the parent node of the first partition node may be referred to the above description of obtaining the node boolean slice of the first partition node, and the specific process of the second participant obtaining the node boolean slice of the parent node of the second partition node may be referred to the above description of obtaining the node boolean slice of the second partition node, which will not be described herein.
If the first partition node and the second partition node are left child nodes, the slicing and operation can be expressed as<b ll > B =And(<b> B ,<b l > B ),<b> B The third node boolean and fourth node boolean slices may be represented,<b l > B a first node boolean slice and a second node boolean slice may be represented,<b ll > B the first node boolean sub-slice and the third node boolean sub-slice may be represented; the sliced exclusive or operation can be expressed as<b lr > B =Xor(<b ll > B ,1),<b lr > B The second node boolean sub-slice and the fourth node boolean sub-slice may be represented.
Alternatively, if the first partition node and the second partition node are right child nodes, the slicing and operation may be expressed as<b> B The third node boolean and fourth node boolean slices may be represented,<b r > B a first node boolean slice and a second node boolean slice may be represented,<b rl > B the first node boolean sub-slice and the third node boolean sub-slice may be represented; the sliced exclusive or operation can be expressed as<b rr > B =Xor(<b rl > B ,1),<b rr > B The second node boolean sub-slice and the fourth node boolean sub-slice may be represented.
For ease of understanding, please refer to fig. 8, fig. 8 is a schematic diagram of a tree prediction scenario according to an embodiment of the present application. The first prediction tree as shown in fig. 8 may include a node 80a, a node 80b, a node 80c, a node 82a, a node 82b, a node 81a, and a node 81b. Wherein node 80b may be a first partitioning node and node 80a may be As a parent node of the first partition node, the node Boolean shards of node 80b may be<b> B Node 80a may have node Boolean shards of<b l > B Thus, node boolean shard of node 82a may be<b ll > B =And(<b> B ,<b l > B ) (i.e. b ll =b·b l ) Node 82b may have node Boolean shards of<b lr > B =Xor(<b ll > B 1) (i.e)。
Alternatively, node 80c may be a first partition node, node 80a may be a parent node of the first partition node, and node 80b may be a node Boolean partition<b> B Node 80c may have node Boolean shards of<b r > B Thus, node boolean shard of node 81a may be(i.e.)>Node boolean sharding of node 81b may be<b rr > B =Xor(<b rl > B 1) (i.e.)>)。
Step S206, obtaining a first target weight slice associated with the first child node;
wherein the first target weight slice associated with the first child node and the second target weight slice associated with the third child node held by the second participant are slices (i.e., arithmetic slices) of a target weight vector (i.e., a first target weight vector) used to characterize weight parameters of the service identification of the first child node; the first target weight slice associated with the first sub-node and the second target weight slice associated with the third sub-node are obtained by jointly selecting weights of the first candidate weight slice and the second candidate weight slice by the first node Boolean sub-slice and the third node Boolean sub-slice.
Similarly, the first participant may obtain a first target weight slice associated with the second child node. Wherein the first target weight slice associated with the second child node and the second target weight slice associated with the fourth child node held by the second participant are slices (i.e., arithmetic slices) of a target weight vector (i.e., a second target weight vector) used to characterize weight parameters of the service identification of the second child node; the first target weight slice associated with the second sub-node and the second target weight slice associated with the fourth sub-node are obtained by jointly selecting weights of the first candidate weight slice and the second candidate weight slice by the second node Boolean sub-slice and the fourth node Boolean sub-slice.
Step S207, summing the first target weight slices associated with the leaf nodes of the first decision tree to obtain a first result slice corresponding to the intersection service identifier in the first decision tree.
The first result slice and the second result slice corresponding to the intersection service identifier in the second decision tree are slices (i.e. arithmetic slices) of sub-predicted values of the intersection service identifier, and the second result slice is obtained by summing the second target weight slices associated with leaf nodes of the second decision tree by the second participant. Wherein the first result slice and the second result slice can be the output of the t-th tree The first output slice and the second output slice may be the output +.>Is a piece of arithmetic fragmentation of (c).
The number of decision trees of the first participant and the number of decision trees of the second participant are K, wherein K can be a positive integer; the sub-predicted value represents the common output of the first decision tree and the second decision tree, and the predicted value represents the common output of the K decision trees of the first participant and the K decision trees of the second participant; the K decision trees of the first party comprise a first decision tree and the K decision trees of the second party comprise a second decision tree.
Wherein the leaf nodes of the first decision tree may include a first child node and a second child node, and the first target weight patches associated with the leaf nodes of the first decision tree may include a first target weight patch associated with the first child node and a first target weight patch associated with the second child node; similarly, the leaf nodes of the second decision tree may include a third child node and a fourth child node, and the second target weight slices associated with the leaf nodes of the second decision tree may include a second target weight slice associated with the third child node and a second target weight slice associated with the fourth child node.
Therefore, the embodiment of the application can acquire the node Boolean fragments corresponding to each node in the first decision tree and the second decision tree, and determine the first target weight fragments associated with the leaf nodes in the first decision tree and the second target weight fragments associated with the leaf nodes in the second decision tree based on the node Boolean fragments corresponding to each node and the sub-node weight fragments corresponding to each leaf node respectively, so that the first result fragments corresponding to the first decision tree and the second result fragments corresponding to the second decision tree are determined based on the first target weight fragments associated with the leaf nodes in the first decision tree and the second target weight fragments associated with the leaf nodes in the second decision tree. The first result fragment and the second result fragment can be used for determining a sub-predicted value commonly output by the first decision tree and the second decision tree, so that the sub-predicted value of the intersection service identifier can be accurately obtained.
Further, referring to fig. 9, fig. 9 is a schematic structural diagram of a data processing apparatus provided in an embodiment of the present application, where the data processing apparatus 1 operates on a first participant participating in longitudinal federal learning, the data processing apparatus 1 may include: the device comprises a data acquisition module 11, a data input module 12 and a characteristic characterization module 13; further, the data processing apparatus 1 may further include: the device comprises a first acquisition module 14, a second acquisition module 15, a third acquisition module 16, a fourth acquisition module 17, a fifth acquisition module 18, a summation processing module 19 and a predicted value determination module 20;
The data acquisition module 11 is configured to acquire first feature data of a first service identifier of a first participant and a first data fragment of a second service identifier of a second participant participating in longitudinal federal learning; the first data fragment and the second data fragment of the second service identifier held by the second participant are fragments of second characteristic data of the second service identifier;
wherein the data acquisition module 11 comprises: a hash mapping unit 111, a first acquisition unit 112, a second acquisition unit 113;
the hash mapping unit 111 is configured to perform cuckoo hash mapping on a first service identifier of a first participant to obtain a first hash table corresponding to the first service identifier;
a first obtaining unit 112, configured to obtain a first boolean intersection tile associated with the first hash table and the second hash table; the first boolean intersection slice and a second boolean intersection slice held by the second participant are slices of boolean intersection vectors; the first hash table and the second hash table are used for carrying out hash table matching through an inadvertently programmable pseudo-random function to generate a first Boolean intersection fragment and a second Boolean intersection fragment; the second hash table is obtained by performing hash mapping on a second service identifier of a second party by the second party participating in longitudinal federal learning; the boolean intersection vector is used for indicating the intersection state of the first service identifier for the second service identifier;
A second obtaining unit 113, configured to obtain first original data of a first service identifier and a first original slice of a second service identifier; the first original shard and a second original shard held by the second participant are shards of second original data; the first original data are used for representing characteristic data of the first service identifier under the first service characteristic; the second original data is used for representing characteristic data of the second service identifier under the second service characteristic; the first Boolean intersection fragment and the second Boolean intersection fragment are jointly used for carrying out data screening on first original data to obtain first characteristic data of a first service identifier held by a first participant; the first Boolean intersection fragment and the second Boolean intersection fragment are commonly used for carrying out data screening on the first original fragment and the second original fragment to obtain a first data fragment of a second service identifier held by the first participant and a second data fragment of a second service identifier held by the second participant.
The specific implementation manner of the hash mapping unit 111, the first obtaining unit 112, and the second obtaining unit 113 may be referred to the description of step S1011 to step S1013 in the embodiment corresponding to fig. 5, which will not be described herein.
A data input module 12 for inputting the first characteristic data and the first data slice into a first decision tree of the first participant; the first decision tree comprises a first split feature fragment corresponding to the first partition node; the first partition node corresponds to a second partition node of a second decision tree of the second party; the second split feature fragments corresponding to the first split feature fragments and the second partition nodes are fragments of feature identifiers of split features commonly corresponding to the first partition nodes and the second partition nodes;
a feature characterization module 13, configured to obtain a first feature boolean slice associated with the split feature; the first feature boolean shard and a second feature boolean shard associated with the split feature held by the second participant are shards of a feature boolean vector; the first characteristic Boolean fragment and the second characteristic Boolean fragment are obtained by vector processing of the first split characteristic fragment and the second split characteristic fragment; the feature Boolean vector is used for representing split features in a first service feature of a first service identifier and a second service feature of a second service identifier;
wherein the total number of the first service features and the second service features is Q, and Q is an integer greater than 1; the vector dimension of the feature Boolean vector is equal to Q, and the feature Boolean vector comprises Q feature Boolean parameters; one of the Q characteristic boolean parameters may be used to characterize one of the Q service features; the characteristic Boolean parameters belonging to the characteristic identifiers in the characteristic Boolean vector are matched Boolean parameters, and the characteristic Boolean parameters not belonging to the characteristic identifiers in the characteristic Boolean vector are unmatched Boolean parameters.
The first feature Boolean fragment and the second feature Boolean fragment are commonly used for acquiring node feature data associated with split features from the first feature data, the first data fragment and the second data fragment; the node characteristic data is used for determining a predicted value of an intersection service identifier between the first service identifier and the second service identifier; the predicted value is used to determine the traffic processing result of the intersection traffic identity.
The node characteristic data comprises a first node fragment held by a first participant and a second node fragment held by a second participant; if the split feature belongs to the first service feature, determining that the first node fragment and the second node fragment are determined by the first feature data; if the split feature belongs to the second service feature, the first node fragment is acquired from the first data fragment, and the second node fragment is acquired from the second data fragment; the first decision tree further comprises a first split value slice corresponding to the first partition node, and the second decision tree further comprises a second split value slice corresponding to the second partition node; the first split value slice and the second split value slice are slices of split values which are corresponding to the first partition node and the second partition node together;
Optionally, the first obtaining module 14 is configured to obtain a first node boolean partition associated with the first partition node; the first node Boolean shard and a second node Boolean shard associated with a second partition node held by a second participant are shards of a first node Boolean vector; the first node Boolean vector is used for representing the relation between node characteristic data and split values; the first node boolean fragment and the second node boolean fragment are obtained by comparing the first node fragment and the second node fragment together with the first split value fragment and the second split value fragment.
Optionally, the second obtaining module 15 is configured to obtain first feature fragments corresponding to the Q service features respectively; q traffic characteristics include traffic characteristicsV d D is a non-negative integer less than Q; service characteristics V d Corresponding first feature fragments and service features V held by second participants d The corresponding second feature fragments are all composed of service features V d First feature Boolean parameter and service feature V in first feature Boolean fragment d Second characteristic boolean parameters in second characteristic boolean fragments for traffic characteristic V d Is obtained by data selection; if the split feature belongs to the first service feature, the service feature V d Is obtained from the first characteristic data; if the split feature belongs to the second service feature, the service feature V d Is obtained from the first data slice and the second data slice;
the second obtaining module 15 is configured to sum the Q first feature fragments to obtain a first node fragment; the first node shard and the second node shard are shards of node characteristic data; the second node fragments are obtained by summing second characteristic fragments corresponding to the Q service characteristics respectively by a second participant; the node characteristic data is used for representing characteristic data of the intersection service identification under the split characteristic.
Optionally, the third obtaining module 16 is configured to obtain a child node weight slice of the child node of the first partition node if the child node of the first partition node is a leaf node; the child node weight slices comprise a first child node weight slice of a first child node of the first partition node and a second child node weight slice of a second child node of the first partition node; the first sub-node weight slice and a third sub-node weight slice of a third sub-node of the second partition node held by the second participant are slices of the first sub-node weight; the first child node weight is used for representing a weight parameter of the first child node; the second sub-node weight slice and the fourth sub-node weight slice of the fourth sub-node of the second partition node held by the second participant are slices of the second sub-node weight; the second child node weight is used for representing a weight parameter of the second child node;
The first node Boolean fragment and the second node Boolean fragment are commonly used for selecting the first sub-node weight fragment, the second sub-node weight fragment, the third sub-node weight fragment and the fourth sub-node weight fragment, so as to obtain a first candidate weight fragment aiming at the first partition node held by the first participant and a second candidate weight fragment aiming at the second partition node held by the second participant; the first candidate weight patch and the second candidate weight patch are patches of candidate weight vectors; the candidate weight vector is used to characterize the weight parameters of the first child node and the second child node.
The first candidate weight slice is determined by a first sub-node weight slice and a second sub-node weight slice, and the second candidate weight slice is determined by a third sub-node weight slice and a fourth sub-node weight slice; if the node Boolean parameter indicated by the first node Boolean fragment and the second node Boolean fragment is the matched Boolean parameter, the candidate weight parameter fragments corresponding to the node Boolean parameter of the first candidate weight fragment and the second candidate weight fragment are obtained by selecting the first sub-node weight fragment and the third sub-node weight fragment; if the node Boolean parameter indicated by the first node Boolean fragment and the second node Boolean fragment is not matched with the Boolean parameter, the candidate weight parameter fragments corresponding to the node Boolean parameter of the first candidate weight fragment and the second candidate weight fragment are obtained by selecting the second sub-node weight fragment and the fourth sub-node weight fragment.
Optionally, a fourth obtaining module 17 is configured to obtain a first node boolean sub-slice associated with the first sub-node and a second node boolean sub-slice associated with the second sub-node; the first node Boolean sub-segment and a third node Boolean sub-segment associated with a third sub-node held by the second party are segments of a first node Boolean sub-vector; the first node Boolean subvector is used for representing the dividing result of the intersection service identifier at the first subnode; the second node boolean sub-segment and a fourth node boolean sub-segment associated with a fourth sub-node held by the second party are segments of a second node boolean sub-vector; the second node Boolean subvector is used for representing the dividing result of the intersection service identifier in the second subnode;
the first node Boolean sub-slice and the third node Boolean sub-slice are determined by the first node Boolean slice, the second node Boolean slice, the third node Boolean slice which is held by the first participant and is associated with the father node of the first partition node, and the fourth node Boolean slice which is held by the second participant and is associated with the father node of the second partition node; the second node Boolean sub-segment and the fourth node Boolean sub-segment are determined by the first node Boolean sub-segment and the third node Boolean sub-segment through segment exclusive OR operation; the third node Boolean segment and the fourth node Boolean segment are segments of the second node Boolean vector; the second node boolean vector is used to characterize a relationship between feature data associated with a parent node split feature corresponding to the parent node and a parent node split value corresponding to the parent node.
A fifth obtaining module 18, configured to obtain a first target weight slice associated with the first child node; the first target weight segment associated with the first child node and the second target weight segment associated with the third child node held by the second participant are segments of a target weight vector; the target weight vector is used for representing weight parameters of the service identification of the first child node; the first target weight slice associated with the first sub-node and the second target weight slice associated with the third sub-node are obtained by jointly carrying out weight selection on the first candidate weight slice and the second candidate weight slice by the first node Boolean sub-slice and the third node Boolean sub-slice;
the summation processing module 19 is configured to perform summation processing on a first target weight slice associated with a leaf node of the first decision tree, so as to obtain a first result slice corresponding to the intersection service identifier in the first decision tree; the first result fragments and the second result fragments corresponding to the intersection service identifiers in the second decision tree are fragments of sub-predicted values of the intersection service identifiers; the number of decision trees of the first participant and the number of decision trees of the second participant are K, and K is a positive integer; the sub-predicted value represents the common output of the first decision tree and the second decision tree, and the predicted value represents the common output of the K decision trees of the first participant and the K decision trees of the second participant; the K decision trees of the first participant comprise a first decision tree, and the K decision trees of the second participant comprise a second decision tree; the second resulting patches are obtained by the second participant summing second target weight patches associated with leaf nodes of the second decision tree.
Optionally, the number of decision trees of the first participant and the number of decision trees of the second participant are K, and K is a positive integer;
the predicted value determining module 20 is configured to obtain first result slices corresponding to K-1 decision trees of the intersection service identifier in the first participant, respectively; the K-1 decision trees of the first participant are the decision trees except the first decision tree in the K decision trees of the first participant;
the predicted value determining module 20 is configured to sum the first result slices corresponding to the K-1 decision trees of the first participant and the first result slices corresponding to the first decision trees respectively, so as to generate first output slices corresponding to the intersection service identifier; the first output fragment and the second output fragment corresponding to the intersection service identifier held by the second participant are fragments of the output vector; the second output fragments are generated by the second party by summing the second result fragments corresponding to the K-1 decision trees of the second party and the second result fragments corresponding to the second decision trees respectively; k-1 decision trees of the second participant are decision trees except the second decision tree in the K decision trees of the second participant; the first result fragment corresponding to the first decision tree and the second result fragment corresponding to the second decision tree are determined by the node characteristic data;
Wherein the output vector is used to characterize the predicted value of the intersection business identification in longitudinal federal learning.
The specific implementation manners of the data obtaining module 11, the data input module 12, the feature characterization module 13 and the predicted value determining module 20 may refer to the descriptions of the steps S101-S103 in the embodiment corresponding to fig. 3 and the steps S1011-S1013 in the embodiment corresponding to fig. 5, which will not be repeated here. The specific implementation manners of the first obtaining module 14, the second obtaining module 15, the third obtaining module 16, the fourth obtaining module 17, the fifth obtaining module 18 and the summation processing module 19 may be referred to the description of step S201 to step S207 in the embodiment corresponding to fig. 7, and will not be repeated here. In addition, the description of the beneficial effects of the same method is omitted.
Further, referring to fig. 10, fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device may be a terminal device or a server. As shown in fig. 10, the computer device 1000 may include: processor 1001, network interface 1004, and memory 1005, and in addition, the above-described computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. In some embodiments, the user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. Alternatively, the network interface 1004 may include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 1005 may also be at least one memory device located remotely from the aforementioned processor 1001. As shown in fig. 10, an operating system, a network communication module, a user interface module, and a device control application program may be included in the memory 1005, which is one type of computer-readable storage medium.
In the computer device 1000 shown in FIG. 10, the network interface 1004 may provide network communication functions; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:
acquiring first characteristic data of a first service identifier of a first participant and first data fragments of a second service identifier of a second participant participating in longitudinal federal learning; the first data fragment and the second data fragment of the second service identifier held by the second participant are fragments of second characteristic data of the second service identifier;
inputting the first characteristic data and the first data fragment into a first decision tree of the first participant; the first decision tree comprises a first split feature fragment corresponding to the first partition node; the first partition node corresponds to a second partition node of a second decision tree of the second party; the second split feature fragments corresponding to the first split feature fragments and the second partition nodes are fragments of feature identifiers of split features commonly corresponding to the first partition nodes and the second partition nodes;
acquiring a first feature Boolean fragment associated with a split feature; the first feature boolean shard and a second feature boolean shard associated with the split feature held by the second participant are shards of a feature boolean vector; the first characteristic Boolean fragment and the second characteristic Boolean fragment are obtained by vector processing of the first split characteristic fragment and the second split characteristic fragment; the feature Boolean vector is used for representing split features in a first service feature of a first service identifier and a second service feature of a second service identifier;
The first feature Boolean fragment and the second feature Boolean fragment are commonly used for acquiring node feature data associated with split features from the first feature data, the first data fragment and the second data fragment; the node characteristic data is used for determining a predicted value of an intersection service identifier between the first service identifier and the second service identifier; the predicted value is used to determine the traffic processing result of the intersection traffic identity.
It should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the data processing method in the embodiments corresponding to fig. 3, 5 and 7, and may also perform the description of the data processing apparatus 1 in the embodiments corresponding to fig. 9, which are not described herein. In addition, the description of the beneficial effects of the same method is omitted.
Furthermore, it should be noted here that: the embodiment of the present application further provides a computer readable storage medium, in which the computer program executed by the aforementioned data processing apparatus 1 is stored, and when the processor executes the computer program, the description of the data processing method in the embodiment corresponding to fig. 3, 5 and 7 can be executed, and therefore, the description will not be repeated here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application.
In addition, it should be noted that: embodiments of the present application also provide a computer program product, which may include a computer program, which may be stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer readable storage medium, and the processor may execute the computer program, so that the computer device performs the description of the data processing method in the embodiments corresponding to fig. 3, 5 and 7, and thus, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer program product according to the present application, reference is made to the description of the method embodiments of the present application.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims (14)

1. A data processing method, the method performed by a first participant engaged in longitudinal federal learning, comprising:
acquiring first characteristic data of a first service identifier of the first participant and first data fragments of a second service identifier of a second participant participating in longitudinal federal learning; the first data fragment and the second data fragment of the second service identifier held by the second participant are fragments of second characteristic data of the second service identifier;
inputting the first characteristic data and the first data slice into a first decision tree of the first participant; the first decision tree comprises a first split feature fragment corresponding to a first partition node; the first partition node corresponds to a second partition node of a second decision tree of the second party; the second split feature fragments corresponding to the first split feature fragments and the second partition nodes are fragments of feature identifiers of split features commonly corresponding to the first partition nodes and the second partition nodes;
Acquiring a first feature Boolean fragment associated with the split feature; the first feature boolean shard and a second feature boolean shard associated with the split feature held by the second participant are shards of a feature boolean vector; the first characteristic Boolean fragments and the second characteristic Boolean fragments are obtained by vector processing of the first split characteristic fragments and the second split characteristic fragments; the feature boolean vector is used for characterizing the split feature in a first service feature of the first service identifier and a second service feature of the second service identifier;
the first feature Boolean fragment and the second feature Boolean fragment are commonly used for acquiring node feature data associated with the split feature from the first feature data, the first data fragment and the second data fragment; the node characteristic data is used for determining a predicted value of an intersection service identifier between the first service identifier and the second service identifier; the predicted value is used for determining a service processing result of the intersection service identifier.
2. The method of claim 1, wherein the obtaining the first characteristic data of the first service identification of the first party and the first data shard of the second service identification of the second party participating in the vertical federal learning comprises:
Performing cuckoo hash mapping on a first service identifier of the first participant to obtain a first hash table corresponding to the first service identifier;
acquiring a first Boolean intersection fragment associated with the first hash table and the second hash table; the first boolean intersection slice and a second boolean intersection slice held by the second participant are slices of boolean intersection vectors; the first hash table and the second hash table are used for carrying out hash table matching through an inadvertently programmable pseudo-random function to generate the first Boolean intersection fragment and the second Boolean intersection fragment; the second hash table is obtained by hash mapping of a second service identifier of a second participant participating in longitudinal federal learning by the second participant; the boolean intersection vector is used for indicating the intersection state of the first service identifier for the second service identifier;
acquiring first original data of the first service identifier and first original fragments of the second service identifier; the first original shard and a second original shard held by the second participant are shards of second original data; the first original data is used for representing characteristic data of the first service identifier under a first service characteristic; the second original data is used for representing characteristic data of the second service identifier under a second service characteristic; the first Boolean intersection fragment and the second Boolean intersection fragment are jointly used for carrying out data screening on the first original data to obtain first characteristic data of the first service identifier held by the first participant; the first boolean intersection slice and the second boolean intersection slice are used together for data screening of the first original slice and the second original slice, so as to obtain a first data slice of the second service identifier held by the first participant and a second data slice of the second service identifier held by the second participant.
3. The method of claim 1, wherein the total number of the first service features and the second service features is Q, the Q being an integer greater than 1; the vector dimension of the characteristic Boolean vector is equal to the Q, and the characteristic Boolean vector comprises Q characteristic Boolean parameters; one of the Q characteristic boolean parameters may be used to characterize one of the Q service characteristics; the characteristic Boolean parameters belonging to the characteristic identifiers in the characteristic Boolean vector are matched Boolean parameters, and the characteristic Boolean parameters not belonging to the characteristic identifiers in the characteristic Boolean vector are unmatched Boolean parameters.
4. A method according to claim 3, wherein the node characteristic data comprises a first node shard held by the first party and a second node shard held by the second party; if the split feature belongs to the first service feature, determining that the first node fragment and the second node fragment are determined by the first feature data; if the split feature belongs to the second service feature, the first node fragment is acquired from the first data fragment, and the second node fragment is acquired from the second data fragment; the first decision tree further comprises a first split value slice corresponding to the first partition node, and the second decision tree further comprises a second split value slice corresponding to the second partition node; the first split value slice and the second split value slice are slices of split values which are corresponding to the first partition node and the second partition node together;
The method further comprises the steps of:
acquiring a first node Boolean fragment associated with the first partition node; the first node boolean shard and a second node boolean shard associated with the second partition node held by the second party are shards of a first node boolean vector; the first node boolean vector is used for characterizing the relation between the node characteristic data and the split value; the first node Boolean fragment and the second node Boolean fragment are obtained by comparing the first node fragment and the second node fragment with the first split value fragment and the second split value fragment together.
5. The method according to claim 4, wherein the method further comprises:
acquiring Q first feature fragments corresponding to the service features respectively; q of the service features include service feature V d The d is a non-negative integer less than the Q; the service characteristics V d Corresponding first feature fragments and the service features V held by the second party d The corresponding second feature fragments are all formed by the service feature V d First feature boolean parameters in the first feature boolean fragmentation and the traffic feature V d A second characteristic Boolean parameter in the second characteristic Boolean segment for the service characteristic V d Is obtained by data selection; if the split feature belongs to the first service feature, the service feature V d Is obtained from the first characteristic data; if the split feature belongs to the second service feature, the service feature V d Is obtained from the first data slice and the second data slice;
summing the Q first characteristic fragments to obtain the first node fragments; the first node shard and the second node shard are shards of the node characteristic data; the second node fragments are obtained by summing second characteristic fragments corresponding to the Q service characteristics respectively by the second participator; the node characteristic data is used for representing characteristic data of the intersection service identification under the split characteristic.
6. The method according to claim 4, wherein the method further comprises:
if the child node of the first partition node is a leaf node, acquiring a child node weight fragment of the child node of the first partition node; the child node weight shards comprise a first child node weight shard of a first child node of the first partition node and a second child node weight shard of a second child node of the first partition node; the first sub-node weight slice and a third sub-node weight slice of a third sub-node of the second partition node held by the second participant are slices of the first sub-node weight; the first sub-node weight is used for representing a weight parameter of the first sub-node; the second sub-node weight slice and the fourth sub-node weight slice of the fourth sub-node of the second partition node held by the second participant are slices of the second sub-node weight; the second sub-node weight is used for representing a weight parameter of the second sub-node;
The first node boolean fragment and the second node boolean fragment are used together for selecting the first sub-node weight fragment, the second sub-node weight fragment, the third sub-node weight fragment and the fourth sub-node weight fragment, so as to obtain a first candidate weight fragment for the first partition node held by the first participant and a second candidate weight fragment for the second partition node held by the second participant; the first candidate weight patch and the second candidate weight patch are patches of candidate weight vectors; the candidate weight vector is used to characterize weight parameters of the first child node and the second child node.
7. The method of claim 6, wherein the first candidate weight slice is determined by the first child node weight slice and the second child node weight slice, and wherein the second candidate weight slice is determined by the third child node weight slice and the fourth child node weight slice; if the node boolean parameter indicated by the first node boolean fragment and the second node boolean fragment is the matching boolean parameter, selecting the candidate weight parameter fragments corresponding to the node boolean parameter from the first candidate weight fragment and the second candidate weight fragment to obtain the first sub-node weight fragment and the third sub-node weight fragment; and if the node Boolean parameter indicated by the first node Boolean fragment and the second node Boolean fragment is the unmatched Boolean parameter, selecting the second sub-node weight fragment and the fourth sub-node weight fragment from the candidate weight parameter fragments corresponding to the node Boolean parameter by the first candidate weight fragment and the second candidate weight fragment.
8. The method of claim 6, wherein the method further comprises:
acquiring a first node Boolean sub-fragment associated with the first sub-node and a second node Boolean sub-fragment associated with the second sub-node; the first node boolean sub-segment and a third node boolean sub-segment associated with the third sub-node held by the second party are segments of a first node boolean sub-vector; the first node Boolean subvector is used for representing the division result of the intersection service identifier at the first subnode; the second node boolean sub-segment and a fourth node boolean sub-segment associated with the fourth sub-node held by the second party are segments of a second node boolean sub-vector; the second node Boolean subvector is used for representing the division result of the intersection service identifier at the second subnode;
acquiring a first target weight slice associated with the first child node; a first target weight patch associated with the first child node and a second target weight patch associated with the third child node held by the second participant are patches of a target weight vector; the target weight vector is used for representing weight parameters of the service identifier of the first child node; the first target weight slice associated with the first sub-node and the second target weight slice associated with the third sub-node are obtained by jointly carrying out weight selection on the first candidate weight slice and the second candidate weight slice by the first node Boolean sub-slice and the third node Boolean sub-slice;
Summing the first target weight fragments associated with the leaf nodes of the first decision tree to obtain a first result fragment corresponding to the intersection service identifier in the first decision tree; the first result fragments and the second result fragments corresponding to the intersection service identifiers in the second decision tree are fragments of sub-predicted values of the intersection service identifiers; the number of the decision trees of the first participant and the number of the decision trees of the second participant are K, and K is a positive integer; the sub-predictor represents a common output of the first decision tree and the second decision tree, the predictor representing a common output of the K decision trees of the first participant and the K decision trees of the second participant; the K decision trees of the first participant comprise the first decision tree, and the K decision trees of the second participant comprise the second decision tree; the second result shard is obtained by the second participant summing a second target weight shard associated with a leaf node of the second decision tree.
9. The method of claim 8, wherein the first node boolean sub-slice and the third node boolean sub-slice are each determined by the first node boolean slice, the second node boolean slice, a third node boolean slice associated with a parent node of the first partition node held by the first participant, and a fourth node boolean slice associated with a parent node of the second partition node held by the second participant; the second node Boolean sub-segment and the fourth node Boolean sub-segment are determined by segment exclusive OR operation of the first node Boolean sub-segment and the third node Boolean sub-segment; the third node Boolean segment and the fourth node Boolean segment are segments of a second node Boolean vector; the second node boolean vector is used to characterize a relationship between feature data associated with a parent node split feature corresponding to the parent node and a parent node split value corresponding to the parent node.
10. The method of claim 1, wherein the number of decision trees for the first party and the number of decision trees for the second party are each K, the K being a positive integer;
the method further comprises the steps of:
obtaining first result fragments corresponding to K-1 decision trees of the intersection service identification in the first party respectively; k-1 decision trees of the first participant are decision trees except the first decision tree in the K decision trees of the first participant;
summing the first result fragments corresponding to the K-1 decision trees of the first participant and the first result fragments corresponding to the first decision trees respectively to generate first output fragments corresponding to the intersection service identifiers; the first output fragment and a second output fragment corresponding to the intersection service identifier held by the second participant are fragments of an output vector; the second output fragments are generated by the second party through summation processing of second result fragments corresponding to K-1 decision trees of the second party and second result fragments corresponding to the second decision trees; the K-1 decision trees of the second participant are the decision trees except the second decision tree in the K decision trees of the second participant; the first result fragment corresponding to the first decision tree and the second result fragment corresponding to the second decision tree are determined by the node characteristic data;
Wherein the output vector is used to characterize the predicted value of the intersection service identity in longitudinal federal learning.
11. A data processing apparatus, the apparatus operating on a first party participating in longitudinal federal learning, comprising:
the data acquisition module is used for acquiring first characteristic data of a first service identifier of the first participant and first data fragments of a second service identifier of a second participant participating in longitudinal federal learning; the first data fragment and the second data fragment of the second service identifier held by the second participant are fragments of second characteristic data of the second service identifier;
the data input module is used for inputting the first characteristic data and the first data fragments into a first decision tree of the first participant; the first decision tree comprises a first split feature fragment corresponding to a first partition node; the first partition node corresponds to a second partition node of a second decision tree of the second party; the second split feature fragments corresponding to the first split feature fragments and the second partition nodes are fragments of feature identifiers of split features commonly corresponding to the first partition nodes and the second partition nodes;
The feature characterization module is used for acquiring a first feature Boolean fragment associated with the split feature; the first feature boolean shard and a second feature boolean shard associated with the split feature held by the second participant are shards of a feature boolean vector; the first characteristic Boolean fragments and the second characteristic Boolean fragments are obtained by vector processing of the first split characteristic fragments and the second split characteristic fragments; the feature boolean vector is used for characterizing the split feature in a first service feature of the first service identifier and a second service feature of the second service identifier;
the first feature Boolean fragment and the second feature Boolean fragment are commonly used for acquiring node feature data associated with the split feature from the first feature data, the first data fragment and the second data fragment; the node characteristic data is used for determining a predicted value of an intersection service identifier between the first service identifier and the second service identifier; the predicted value is used for determining a service processing result of the intersection service identifier.
12. A computer device, comprising: a processor and a memory;
The processor is connected to the memory, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program to cause the computer device to perform the method of any of claims 1-10.
13. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-10.
14. A computer program product, characterized in that the computer program product comprises a computer program stored in a computer readable storage medium and adapted to be read and executed by a processor to cause a computer device with the processor to perform the method of any of claims 1-10.
CN202310645453.9A 2023-06-01 2023-06-01 Data processing method, device, computer equipment and readable storage medium Pending CN116975905A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310645453.9A CN116975905A (en) 2023-06-01 2023-06-01 Data processing method, device, computer equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310645453.9A CN116975905A (en) 2023-06-01 2023-06-01 Data processing method, device, computer equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN116975905A true CN116975905A (en) 2023-10-31

Family

ID=88473840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310645453.9A Pending CN116975905A (en) 2023-06-01 2023-06-01 Data processing method, device, computer equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN116975905A (en)

Similar Documents

Publication Publication Date Title
AU2021218110B2 (en) Learning from distributed data
CN110084377B (en) Method and device for constructing decision tree
US10181168B2 (en) Personal safety verification system and similarity search method for data encrypted for confidentiality
CN111967615B (en) Multi-model training method and device based on feature extraction, electronic equipment and medium
CN108491267B (en) Method and apparatus for generating information
CN110929806B (en) Picture processing method and device based on artificial intelligence and electronic equipment
CN112529101B (en) Classification model training method and device, electronic equipment and storage medium
CN109829320B (en) Information processing method and device
CN109214543B (en) Data processing method and device
CN115730333A (en) Security tree model construction method and device based on secret sharing and homomorphic encryption
CN111949998B (en) Object detection and request method, data processing system, device and storage medium
CN114818000A (en) Privacy protection set confusion intersection method, system and related equipment
CN115438370A (en) Training method, equipment and storage medium of full-hidden Federal learning model
CN116955857A (en) Data processing method, device, medium and electronic equipment
CN114398973B (en) Media content tag identification method, device, equipment and storage medium
CN116975018A (en) Data processing method, device, computer equipment and readable storage medium
CN114329127B (en) Feature binning method, device and storage medium
CN116975905A (en) Data processing method, device, computer equipment and readable storage medium
CN115905608A (en) Image feature acquisition method and device, computer equipment and storage medium
CN115168609A (en) Text matching method and device, computer equipment and storage medium
US20200081875A1 (en) Information Association And Suggestion
CN112597390A (en) Block chain big data processing method based on digital finance and big data server
CN116975017A (en) Data processing method, device, computer equipment and readable storage medium
CN116319084B (en) Random grouping method and device, computer program product and electronic equipment
CN116244650B (en) Feature binning method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication