CN110991552B

CN110991552B - Isolated forest model construction and prediction method and device based on federal learning

Info

Publication number: CN110991552B
Application number: CN201911288850.5A
Authority: CN
Inventors: 宋博文; 叶捷明; 陈帅; 顾曦
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2021-03-12
Anticipated expiration: 2039-12-12
Also published as: CN110991552A; TW202123050A; WO2021114821A1; CN113065610A; CN113065610B; TWI780433B

Abstract

The embodiment of the specification provides a method and a device for constructing an isolated forest model based on federal learning, wherein the method comprises the following steps: acquiring a plurality of sample identifications corresponding to the first node, wherein the plurality of sample identifications correspond to a plurality of samples respectively, and each sample comprises characteristic values of m characteristics; randomly selecting one feature identifier from the m feature identifiers; in the case that the selected feature identifier is a first feature identifier, sending the identifier of the first node, the plurality of sample identifiers and the first feature identifier to a first data party based on the correspondence between the locally stored first feature identifier and the first data party; recording the corresponding relation between the first node and the first data side; and receiving information respectively corresponding to the two child nodes of the first node from the first data side, so that the private data of each data side is protected and an isolated forest model is constructed for business processing.

Description

Isolated forest model construction and prediction method and device based on federal learning

Technical Field

The embodiment of the specification relates to the technical field of machine learning, in particular to a method and a device for constructing an isolated forest model based on federal learning and a method and a device for predicting object abnormality through the isolated forest model based on the federal learning.

Background

At present, more and more internet enterprises as data owners are beginning to pay attention to data privacy and data security issues. The isolated forest model is an unsupervised learning model for predicting abnormal objects, and the model can be used for analyzing user behaviors to identify the abnormal behaviors so as to protect the safety of user funds, such as theft risk prevention and control, fraud risk prevention and control and the like. However, data modeling under the above scenario is often performed under the condition of data fusion (i.e. data centralized storage/visualization), which often requires that data from different sources needs to be completely exposed to each other to complete modeling analysis, which is a great risk at the level of private data. Therefore, there is a need for a more efficient isolated forest model construction and use scheme that protects private data.

Disclosure of Invention

The embodiment of the specification aims to provide a more effective isolated forest model construction and use scheme for protecting private data so as to solve the defects in the prior art.

In order to achieve the above object, an aspect of the present specification provides a method for constructing an isolated forest model based on federal learning, where participants of the federal learning include a computation party and at least two data parties, the method is performed by a device of the computation party with respect to a first node in a first tree in the model, the at least two data parties include a first data party, a correspondence relationship between m feature identifiers and the respective data parties is stored in the computation party device in advance, the m feature identifiers are respectively predetermined identifiers of m features, and the method includes:

acquiring a plurality of sample identifications corresponding to a first node, wherein the plurality of sample identifications correspond to a plurality of samples respectively, and each sample comprises characteristic values of the m characteristics;

randomly selecting one feature identifier from the m feature identifiers;

in the case that the selected feature identifier is a first feature identifier, sending the identifier of the first node, the plurality of sample identifiers and the first feature identifier to a first data party based on a corresponding relationship between the locally stored first feature identifier and the first data party;

recording the corresponding relation between the first node and the first data party;

and receiving information respectively corresponding to the two child nodes of the first node from the first data side, so as to construct an isolated forest model for business processing.

In one embodiment, the first node is a root node, wherein obtaining the plurality of sample identifiers corresponding to the first node includes obtaining N sample identifiers, and randomly obtaining N sample identifiers from the N sample identifiers, where N > N.

In an embodiment, the two child nodes include a second node, and the information corresponding to the second node includes that the second node is a leaf node, and the method further includes recording a correspondence between the second node identifier and the first data party.

In one embodiment, the two child nodes include a third node, and the information corresponding to the third node includes u sample identifiers assigned to the third node, where the u sample identifiers are part of the plurality of sample identifiers.

In one embodiment, the at least one data party is at least one network platform, and the plurality of samples respectively correspond to a plurality of objects in the network platform.

In one embodiment, the object is any one of: consumer, transaction, merchant, commodity.

Another aspect of the present specification provides a method for constructing an isolated forest model based on federal learning, in which the participants of the federal learning include a calculator and at least two data parties, a first tree of the model includes a first node, the method is performed by a device of a first data party of the at least two data parties, the device of the first data party has feature values of first features of respective samples, and stores corresponding relationships between the first features and predetermined first feature identifiers, and the method includes:

receiving an identification of a first node, a plurality of sample identifications and a first feature identification from a device of the computing side, wherein the plurality of sample identifications correspond to a plurality of samples respectively;

randomly selecting a characteristic value from the characteristic values of the first characteristics of the plurality of samples as a splitting value of a first node based on the corresponding relation between the locally stored first characteristic identification and the first characteristics;

recording the corresponding relation between the first node and the first characteristic and the splitting value;

grouping the plurality of samples based on the split value to construct two child nodes of the first node;

respectively determining whether the two child nodes are leaf nodes;

and based on the grouping and the determined result, sending information respectively corresponding to the two child nodes to the equipment of the calculator so as to construct an isolated forest model for business processing.

In one embodiment, the two child nodes include a second node, where the information corresponding to the second node includes that the second node is a leaf node, and the method further includes calculating and storing a node depth of the second node.

Another aspect of the present specification provides a method for predicting abnormality of an object by an isolated forest model based on federal learning, the participants of the federal learning including a calculation party and at least two data parties, a tree structure of a first tree in the model and the data parties corresponding to nodes in the first tree being stored in a device of the calculation party, the method being performed by the device of the calculation party, and the method including:

acquiring an object identifier of a first object;

sending the object identification to each data party;

receiving at least one division result of the first object, which is respectively performed by the data side at least one corresponding non-leaf node, from each data side device;

determining a first leaf node into which the first object falls based on a tree structure of a first tree and results of partitioning the first object at respective non-leaf nodes from the at least two data side devices;

based on the data parties corresponding to the leaf nodes in the first tree, sending the identification of the first leaf node to a first data party corresponding to the first leaf node;

receiving a node depth of the first leaf node from the first data party;

predicting an abnormality of the first object based on the node depth for business processing.

In one embodiment, the method further comprises, based on the prediction of the first subject, obtaining training samples for training a supervised learning model.

In one embodiment, the method further comprises optimizing sample features of the isolated forest model based on parameters of the trained supervised learning model.

Another aspect of the present specification provides a method for predicting abnormalities in an object based on federal learning by an isolated forest model, the participants of the federal learning including a calculation party and at least two data parties, a first data party of the at least two data parties having recorded in its device: the first characteristic and the splitting value of the first node in the first tree corresponding to the first characteristic and the splitting value, and the characteristic value of the first characteristic of each object stored in the device of the first data side, wherein the method is executed by the device of the first data side, and comprises the following steps:

receiving an object identification of a first object from a device of the computing party;

acquiring a characteristic value of a first characteristic of the first object from the local based on a first characteristic of a first node stored locally;

partitioning, at a first node, the first object based on locally stored feature values of a first feature of the first object and split values of the first node;

sending the results of the partitioning to the computing side's device for predicting abnormalities of the first object for business processing.

In one embodiment, the node depth of the second node in the first tree is recorded in the device of the first data side, and the method further includes receiving, from the device of the computing side, an identification of the second node into which the first object falls, and sending the node depth of the second node to the device of the computing side.

Another aspect of the present specification provides an apparatus for constructing an isolated forest model based on federal learning, where participants of the federal learning include a calculator and at least two data parties, the apparatus is deployed in a device of the calculator relative to a first node in a first tree in the model, the at least two data parties include a first data party, a correspondence relationship between m feature identifiers and each data party is pre-stored in the calculator device, and the m feature identifiers are respective predetermined identifiers of m features, and the apparatus includes:

an obtaining unit configured to obtain a plurality of sample identifiers corresponding to the first node, the plurality of sample identifiers corresponding to a plurality of samples, respectively, each sample including feature values of the m features;

a selecting unit configured to randomly select one feature identifier from the m feature identifiers;

a sending unit, configured to, in a case where the selected feature identifier is a first feature identifier, send the identifier of the first node, the plurality of sample identifiers, and the first feature identifier to a first data party based on a correspondence relationship between the locally stored first feature identifier and the first data party;

the first recording unit is configured to record the corresponding relation between the first node and the first data party;

a receiving unit configured to receive information corresponding to the two child nodes of the first node from the first data party, respectively, so as to construct an isolated forest model for business processing.

In an embodiment, the first node is a root node, wherein the obtaining unit is further configured to obtain N sample identifiers, and randomly obtain N sample identifiers from the N sample identifiers, where N > N.

In an embodiment, the two child nodes include a second node, the information corresponding to the second node includes that the second node is a leaf node, and the apparatus further includes a second recording unit configured to record a correspondence between the second node identifier and the first data party.

Another aspect of the present specification provides an apparatus for constructing an isolated forest model based on federal learning, where the participants of the federal learning include a calculator and at least two data parties, a first tree of the model includes a first node, the apparatus is deployed in a device of a first data party of the at least two data parties, the device of the first data party has feature values of first features of respective samples, and stores corresponding relationships between the first features and predetermined first feature identifiers, and the apparatus includes:

a receiving unit configured to receive, from the device of the computing side, an identifier of a first node, a plurality of sample identifiers, and a first feature identifier, where the plurality of sample identifiers correspond to a plurality of samples, respectively;

a selecting unit configured to randomly select one feature value from feature values of first features of the plurality of samples as a split value of a first node based on a correspondence relationship between a locally stored first feature identifier and the first feature;

a recording unit configured to record a correspondence relationship between the first node and the first feature and the split value;

a grouping unit configured to group the plurality of samples based on the split values to construct two child nodes of the first node;

a determining unit configured to determine whether the two child nodes are leaf nodes, respectively;

a sending unit configured to send information corresponding to the two child nodes, respectively, to the device of the computation side based on the grouping and the result of the determination, thereby constructing an isolated forest model for business processing.

In one embodiment, the two child nodes include a second node, where the information corresponding to the second node includes that the second node is a leaf node, and the apparatus further includes a calculation unit configured to calculate and store a node depth of the second node.

Another aspect of the present specification provides an apparatus for predicting abnormality of an object through an isolated forest model based on federal learning, where the participants of the federal learning include a calculation party and at least two data parties, a tree structure of a first tree in the model and the data parties corresponding to nodes in the first tree are stored in a device of the calculation party, and the apparatus is deployed in the device of the calculation party, and includes:

a first acquisition unit configured to acquire an object identification of a first object;

the first sending unit is configured to send the object identification to each data party;

a first receiving unit, configured to receive, from each data side device, at least one division result of the first object, which is performed by the data side at least one corresponding non-leaf node of the data side, respectively;

a first determining unit configured to determine a first leaf node into which the first object falls, based on a tree structure of a first tree and a result of dividing the first object at each non-leaf node from the at least two data side devices;

a second sending unit, configured to send, based on the data parties corresponding to the leaf nodes in the first tree, the identifier of the first leaf node to the first data party corresponding to the first leaf node;

a second receiving unit configured to receive the node depth of the first leaf node from the first data side;

a prediction unit configured to predict an abnormality of the first object based on the node depth for business processing.

In one embodiment, the apparatus further comprises a second obtaining unit configured to obtain training samples for training a supervised learning model based on the prediction result of the first subject.

In an embodiment, the apparatus further comprises a second determining unit configured to determine features comprised by the samples of the isolated forest model based on parameters of the trained supervised learning model.

Another aspect of the present specification provides an apparatus for predicting abnormalities in an object based on federal learning by an isolated forest model, the participants of the federal learning including a calculation party and at least two data parties, a first data party of the at least two data parties having recorded in its equipment: the apparatus is deployed in the device of the first data side, and includes:

a first receiving unit configured to receive an object identification of a first object from the device of the computing side;

an acquisition unit configured to acquire a feature value of a first feature of the first object locally based on a first feature of a first node stored locally;

the dividing unit is configured to divide the first object at a first node based on a feature value of a first feature of the first object stored locally and a split value of the first node;

a first sending unit configured to send a result of the division to the device of the computing side for predicting an abnormality of the first object for business processing.

In one embodiment, the node depth of the second node in the first tree is recorded in the device of the first data side, the apparatus further includes a second receiving unit configured to receive, from the device of the computing side, an identification of the second node into which the first object falls, and a second sending unit configured to send the node depth of the second node to the device of the computing side.

Another aspect of the present specification provides a computer readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform any one of the above methods.

Another aspect of the present specification provides a computing device comprising a memory and a processor, wherein the memory stores executable code, and the processor implements any one of the above methods when executing the executable code.

According to the scheme for constructing the isolated forest model based on the federal learning and predicting the abnormality by using the model, the isolated forest model can be constructed by using the data of a plurality of data parties together, the abnormality of the object can be predicted by using the data of the plurality of data parties and the data of the model together, meanwhile, the data of each data party is protected from being leaked to other parties, the data volume for constructing the isolated forest model is expanded, the prediction accuracy of the model is increased, and meanwhile, the data safety of each data party is protected.

Drawings

The embodiments of the present specification may be made more clear by describing the embodiments with reference to the attached drawings:

FIG. 1 illustrates a schematic view of a scene in which an isolated forest model is constructed and used according to an embodiment of the present description;

FIG. 2 is a diagram schematically illustrating the structure of a tree 1 in a constructed model that is made available to party B through the above construction process;

fig. 3 schematically shows a timing diagram of a method for constructing the node 1 in fig. 2 based on federal learning according to an embodiment of the present description;

FIG. 4 schematically shows a timing diagram of a method for constructing node 2 of FIG. 2 based on federated learning, in accordance with an embodiment of the present description;

FIG. 5 schematically illustrates a timing diagram of a method for predicting abnormalities in a subject based on federated learning by an isolated forest model, in accordance with an embodiment of the present description;

FIG. 6 is a diagram illustrating the B-party determining that the object x falls into a leaf node according to the partitioning result of the A-party and the C-party

FIG. 7 schematically illustrates a process of mutual optimization between a multi-party unsupervised learning model and a supervised learning model in accordance with an embodiment of the present description;

fig. 8 illustrates an apparatus 800 for constructing an isolated forest model based on federal learning in an embodiment of the present description;

fig. 9 illustrates an apparatus 900 for constructing an isolated forest model based on federal learning in accordance with an embodiment of the present description;

FIG. 10 illustrates an apparatus 1000 for predicting abnormalities in a subject based on federated learning using isolated forest models, in accordance with an embodiment of the present description;

fig. 11 illustrates an apparatus 1100 for predicting abnormalities in a subject based on federal learning using isolated forest models, in accordance with an embodiment of the present disclosure.

Detailed Description

The embodiments of the present specification will be described below with reference to the accompanying drawings.

FIG. 1 illustrates a scene schematic diagram for constructing and using an isolated forest model according to an embodiment of the present description. As shown in fig. 1, the scene includes at least two data parties (only the party a and the party C are schematically shown in the figure) and a calculation party (the party B), and the two data parties are described as an example in the following. Parties a and C are, for example, shopping platforms and payment platforms, and the isolated forest model may be used, for example, to predict abnormalities of individual transactions that are commonly associated with the two platforms. The party a has, for example, commodity characteristics, user purchasing behavior characteristics, and the like of each transaction, and the party C has, for example, payment characteristics, user payment behavior characteristics, and the like of each transaction, that is, the data of the party a and the party C together constitute the characteristic data of the transaction. Thus, parties a and C can build an isolated forest model based on their respective data, along with party B, where the samples that build the model include feature values for various features of a transaction. The calculation party B can be a party with corresponding calculation equipment for performing model construction and calculation of a prediction process, and can also be any one of the A party and the C party.

In the model building process, firstly, the party B obtains N transaction numbers associated with both the party a and the party C, and feature data of N transactions corresponding to the N transaction numbers is used as a training sample set of the model, and the feature data may be represented as a matrix X, for example, where the matrix X includes N rows and m columns, each row corresponds to one transaction, and each column corresponds to one feature of a transaction, that is, each transaction has m features. Suppose party A has a portion X of the profile data for the N transactions_AParty C has another part X in the profile of the N transactions_CWhereby X is ═ X_A X_C). Randomly acquiring N transaction numbers from the N transaction numbers to obtain the N transaction numbersThe feature data of the corresponding n transactions is used as a training sample set of a tree in the model.

Before training begins, A, B, C parties may negotiate the feature identities of the respective features together, while leaving parties B unaware of the features of parties A and C, and parties A and C unaware of the features that each other has. For example, party a and party C set the feature identifiers corresponding to the features of the party and send the feature identifiers to party B, where party a and party C may negotiate to determine that there is no duplication between the feature identifiers of the two parties. So that party B records m signatures and their respective corresponding data parties. In the B-party device, one signature is randomly selected from m signatures for the root node (node 1) of the model (f 1). Assume that f1 is recorded in the B-party device to correspond to the a-party, and thus the B-party records that node 1 corresponds to the a-party and transmits, for example, "node 1, f1, n transaction numbers" to the a-party. After receiving the information, the party a determines f1 as the identification of the feature q1 (e.g., the commodity price) based on the local record, thereby randomly selecting one value from the values of the feature q1 of the local n transactions corresponding to the n transaction numbers as the split value p1 of the node 1, and splitting the n transactions based on q1 and p1 to obtain the sets Sl and Sr of the transaction numbers of the two

child nodes

2 and 3 respectively falling into the node 1. After judging that neither node 2 nor node 3 is a leaf node based on a predetermined rule, party a sends Sl and Sr to party B, so that party B repeats the above-described process for node 1 with respect to node 2 and node 3, respectively, thereby constructing an isolated tree as shown in the figure. When it is determined that the node 3 corresponds to the a-party and the a-party determines that the child node 7 of the node 3 is a leaf node, for example, the a-party notifies the B-party that the node 7 is a leaf node, and at the same time, the node depth of the node 7 is calculated and stored. After receiving that the node 7 is a leaf node, the party B constructs the leaf node 7 in the tree and records that the node 7 corresponds to the party A. And constructing a plurality of isolated trees by the same method, thereby constructing the isolated forest. After the construction is completed, the B side records the tree structure of each tree and the data side corresponding to each node in the tree, and the A side records the partial parameter omega of the model_AWhich comprises the following steps: of respective non-leaf nodes corresponding to party ASplitting characteristics, splitting values and node depths of all leaf nodes corresponding to the A side. Similarly, the C-side records the partial parameter ω of the model_C。

Fig. 2 schematically shows a structure diagram of a tree (for example, tree 1) in a constructed model obtained by the party B through the above construction process, where the structure diagram schematically shows 11 nodes and connection relationships between the nodes, where a number marked inside each node is a node identifier of the node, and a letter (such as a or C) marked outside each node is an identifier of a data party to indicate the data party corresponding to the node.

After the isolated forest model is constructed, the model can be used for carrying out abnormality prediction on an object to be predicted. For example, if it is desired to predict the abnormality for transaction 1, party B sends the number of transaction 1 to parties a and C. And the A party and the C party divide the transaction 1 in the corresponding nodes thereof based on the respective partial model parameters and the characteristic values of the partial characteristics of the transaction 1, and send the division results to the B party. And combining the division results of the A party and the C party by the B party to determine a leaf node into which the transaction 1 falls, and receiving the node depth of the leaf node from a party (such as the A party) corresponding to the leaf node. Thus, B-party transaction 1 falls into a node depth based on leaf nodes in the various trees in the model, an average node depth for transaction 1 may be calculated, and the anomaly for transaction 1 may be determined based on the average depth.

It is to be understood that the above description with reference to fig. 1 is merely illustrative and not restrictive, for example, the at least two data parties may include more data parties, the samples are not limited to transaction samples, and the like, and the above model building process and model prediction process will be described in detail below.

Fig. 3 schematically shows a timing diagram of a method for constructing the node 1 in fig. 2 based on federal learning according to an embodiment of the present disclosure. As described above, the participants in the federal learning include, for example, the A, B, C three parties mentioned above, and the sequence chart shows the interaction sequence between the a party as the data party and the B party as the calculation party in the construction process. It will be appreciated that the interaction between the other data parties involved in federal learning and party B is similar. The construction process with respect to node 1 will be described below in conjunction with fig. 2 and 3. Where both parties a and B shown in fig. 3 and described in the following steps represent a-party equipment and B-party equipment.

As described above with reference to fig. 1, party B stores in advance the correspondence between m feature identifications and the respective data parties, for example, the m features include a feature q1 (feature q1 is, for example, "commodity price"), and the feature data of the feature is owned by party a, so that party a can determine in advance that the feature identification corresponding to the feature q1 is f1, record the correspondence between q1 and f1 locally, and send f1 to party B, so that party B can record that f1 corresponds to party a. In this way, party B cannot know what features party a has.

After the start of the construction, referring to fig. 3, first, at step 302, the B-party acquires n sample identifications corresponding to node 1. The node 1 is a root node of the tree 1, and as described above, the sample identifier corresponding to the node 1 is N sample identifiers randomly selected from the N sample identifiers. As mentioned above, the N sample identifications are, for example, transaction numbers associated with both parties a and C, and are not described in detail herein. Multiple sets of sample identifications can be determined by randomly selecting from the N sample identifications for multiple times, each set comprises N sample identifications, and therefore one tree in the N sample training models corresponding to each set can be used for training the whole isolated forest. By determining multiple sample sets in this manner to train each tree in the forest separately, each tree can be trained with reduced data while ensuring the prediction accuracy of the entire model.

In step 304, party B randomly selects a signature from the m signatures, for example, the randomly selected signature is f 1.

At step 306, party B determines that f1 corresponds to party a based on the locally stored correspondence. As described above, party B stores in advance the correspondence between m signatures and the respective data parties, including f1 corresponding to party a. As described above, the correspondence relationship is determined by a mutual negotiation performed by the a-party, the B-party, and the C-party in advance, and is obtained by the B-party, which is not described in detail herein.

At step 308, party B sends the node 1 identification (i.e., "node 1"), the n sample identifications, and "f 1" to party a.

At step 310, party B locally records the correspondence between node 1 and party a. This recording may be performed in various ways, for example, as shown in fig. 2, a "may be marked at node 1 of tree 1 in the figure, indicating that node 1 corresponds to party a, or" node 1 "may be recorded in association with" a "in the form of a table, thereby determining that node 1 corresponds to party a.

In step 312, after receiving "node 1", n sample identifications and "f 1" sent by the B-party, the a-party determines that f1 corresponds to the feature q1 based on the locally stored correspondence, thereby having q1 as the splitting feature of node 1.

At step 314, party a randomly selects one eigenvalue from the eigenvalues of q1 for n samples corresponding to the n sample identifications as the split value for node 1, for example, the selected value is p 1.

At step 316, party a records the split signature q1 and the split value p1 for node 1 after determining the split signature q1 and the split value p1 for node 1 through the above steps.

At step 318, party a divides the n samples into two children of node 1, node 2 and node 3 in fig. 2, based on the split value p 1. For example, it can be set that if the q1 value of a sample is < p1, the sample is divided to the left child node, node 2, and if the q1 value of a sample is ≧ p1, the sample is divided to the right child node, node 3.

At step 320, party a determines whether

nodes

2 and 3 are leaf nodes. Whether the

nodes

2 and 3 are child nodes may be determined based on a predetermined rule. For example, if the node depth of a node reaches a predetermined depth (e.g., a maximum depth), the node is a leaf node, and if only one sample in the node or a plurality of samples in the node have the same feature data and thus cannot be distinguished, the node is a leaf node.

After determining that neither node 2 nor node 3 is a leaf node, party a sends the sample identifications included in each of node 2 and node 3 to party B at step 322. The B-party thus has u node ids for constructing node 2 and v node ids for constructing node 3, so that the above-mentioned process performed for node 1 can be performed for node 2 and node 3, respectively, for continuing to construct node 2 and node 3, and thus constructing the whole tree.

Fig. 4 schematically shows a timing diagram of a method for constructing the node 2 in fig. 2 based on federal learning according to an embodiment of the present disclosure. The timing diagram shows the interaction timing diagram between party C and party B as data parties in the construction process. The construction process with respect to node 2 will be described below in conjunction with fig. 2 and 4. Where, similarly to the above, the C-party in fig. 4 and in the following description denotes a C-party device.

The m features further include, for example, a feature q2 (e.g., "payment amount" for the feature q 2), feature data of which is owned by the C party, so that the C party can determine in advance that the feature corresponding to q2 is identified as f2, locally record the correspondence between q2 and f2, and send f2 to the B party, so that the B party records the correspondence between f2 and the C party.

After the start of the build, referring to fig. 4, in step 402, the B-party obtains u sample identifications corresponding to node 2, i.e., the B-party receives u sample identifications distributed to node 2 from the a-party. Party B randomly selects a signature, e.g., f2, from the m signatures at step 404. At step 406, party B determines that f2 corresponds to party C based on the locally stored correspondence. At step 408, party B sends "node 2", u sample identifications and "f 2" to party C. At step 410, party B records that node 2 corresponds to party C. In step 412, after receiving "node 2", u sample identifications and "f 2" sent from the B party, the C party determines the feature q2 corresponding to f2 as the splitting feature of node 2 based on the locally stored correspondence. At step 414, party C randomly selects one eigenvalue, e.g., p2, from the eigenvalues of q2 of u samples as the split value of node 2. At step 416, party C records the split signature q2 and the split value p2 for the node. At step 418, party C sorts the u samples into node 4 and node 5 based on p 2. At step 420, party C determines whether

nodes

4 and 5 are leaf nodes. The steps 404 to 420 can refer to the above description of the steps 304 to 320, and are not described herein again.

At step 422, party C determines, based on the determination of step 420, for example, that node 4 is not a leaf node and node 5 is a leaf node, so that party C sends g sample identifications assigned to node 4 to party B while informing party B that node 5 is a leaf node.

After the B party receives "node 5 is a leaf node" in step 424, the B party may mark node 5 as a leaf node, so that no sample splitting is performed on node 5 any more, and the B party locally records that node 5 corresponds to party C.

At step 426, party C, after determining that node 5 is a leaf node, calculates and stores the node depth of node 5. In one embodiment, the node depth of node 5 may be calculated by equation (1) as follows:

H＝e+c(n) (1)

wherein, c (n) is shown in formula (2):

c(n)＝2H(n-1)-2(n-1)/n， (2)

where e is the number of edges (i.e., 2) between node 5 and the root node (node 1), n is the number of training samples of the tree, and h (n) is the harmonic progression, which can be estimated by ln (n) +0,5772156649 (euler constant). In the isolated forest model, the smaller the node depth of a leaf node is, the greater the possibility that a sample classified into the leaf node is an abnormal sample.

After node 2 is constructed as described above, several non-leaf nodes in tree 1, node 3, node 4, and node 6, can be constructed in the same manner, thereby constructing the structure of tree 1 as shown in fig. 2. For example, by way of the random determination described above, it may be determined that node 1, node 3, and node 4 correspond to party a, and node 2 and node 6 correspond to party C, so that it may be determined accordingly that

leaf nodes

7, 8, 9 correspond to party a, and

leaf nodes

5, 10, 11 correspond to party C, as shown in fig. 2. And the A side and the C side respectively record the corresponding nodes, the splitting characteristics and the splitting values of the nodes. That is, the a-party, the B-party, and the C-party possess partial data of the isolated forest model, respectively. Therefore, when object prediction is performed by using the model, three parties are required to perform cooperation.

Fig. 5 schematically illustrates a timing diagram of a method for predicting abnormalities in a subject based on federal learning using an isolated forest model, in accordance with an embodiment of the present disclosure.

As shown in fig. 5, first, in step 502, party B obtains an object identifier x of an object to be predicted, which is, for example, a transaction number, similar to the sample identifier described above, and the object to be predicted is a transaction to be predicted, and similarly, the transaction characteristic data of the transaction x is composed of data of parties a and B. Party B may initiate a prediction of transaction x abnormalities on its own initiative, or party B may begin executing the method by acting as a server receiving a request from a client to predict transaction x abnormalities.

In step 504, the party B sends the object identifier x to the party a and the party C, respectively, although the party B is shown to send the party a and the party C at the same time, this embodiment is not limited thereto.

At step 506, party a and party C each partition the object x at its corresponding at least one node. By the above, for example, the a-side corresponds to the node 1, the node 3, and the node 4, which has the feature q1 and the split value p1 of the node 1, the feature q3 and the split value p3 of the node 3, and the feature q4 and the split value p4 of the node 4, and the a-side has the value v1 of the feature q1 of the object, the value v2 of the feature q3, and the value v4 of the feature q 4. Thus, party a may partition object x at node 1 based on v1 and p1, e.g., v1< p1, thereby partitioning object x into the left child node of node 1, and similarly, party a partitions object x into the left child node of node 3 based on v3 and q3, and partitions object x into the right child node of node 4 based on v4 and q 4. Similarly, party C corresponds to node 2 and node 6, which partitions object x into the left child node at node 2 and into the right child node at node 6.

At step 508, parties a and C send their partitioning results for object x at each node to party B. It is to be understood that, although the a-party and the C-party are shown to perform the steps at the same time, the embodiment is not limited thereto.

In step 510, the B-party determines a leaf node, i.e., node 9, into which the object x falls based on the received division result. Fig. 6 is a schematic diagram illustrating that the B party determines that the object x falls into a leaf node according to the division result of the a party and the C party. As shown in fig. 6, the B-party merges the partitions of the object x by the a-party and the C-party at the respective nodes, so that a partition path of the object x can be found from the node 1, i.e., the node 1 → the node 2 → the node 4 → the node 9, so that it can be determined that the object x finally falls into the leaf node 9.

At step 512, party B determines that node 9 corresponds to party a based on the local correspondence, and sends "node 9" to party a.

At step 514, party a sends the node depth of node 9 to party B.

At step 516, the B-party predicts an abnormality of the object x based on the node depth of the node 9. In one embodiment, the abnormality of object x can be predicted by the average node depth of node 9. After obtaining the node depth of the object x in each tree according to the same method, the party B may calculate an average node depth E (h (x)) of the object x, where the larger the average node depth is, the farther the leaf node divided by the object x is from the root node is, and thus the abnormality of the object x is smaller, and conversely, the smaller the average node depth is, the larger the abnormality of the object x is.

In one embodiment, the abnormality of subject x can be predicted by the abnormality score shown in equation (3):

wherein c (n) is as shown in the above formula (2). It can be verified that the value of s is between 0 and 1, the smaller s represents the abnormality of the subject, and the larger s represents the abnormality of the subject.

After acquiring the abnormality of the subject, various business processes can be performed. For example, the object is a transaction, and after determining that the transaction is an anomalous transaction, a manual audit of the transaction can be conducted to prevent a fraudulent event from occurring. Alternatively, the data and the tag value of the transaction may be used as training samples for training a multi-party supervised learning model, such as a fraud prevention multi-party supervised learning model.

Fig. 7 schematically illustrates a mutual optimization process between a multi-party unsupervised learning model and a multi-party supervised learning model according to an embodiment of the present specification. As shown in fig. 7, a set of training samples can be semi-automatically obtained in combination with manually (e.g., expert) labeled samples and samples labeled by isolated forests according to the present specification, thereby training a multi-part supervised learning model; the sample characteristics for training the isolated forest model can be determined semi-automatically by combining the characteristics determined manually and the characteristics determined based on the parameters of the multi-party supervised learning model, so that the training of the isolated forest model is optimized. Specifically, after determining a plurality of features of a sample used for training the isolated forest model, a plurality of feature identifiers corresponding to the plurality of features respectively may be sent to the B-party, so that the B-party performs the method shown in fig. 3 or 4 based on the plurality of feature identifiers when performing the training on the multi-party isolated forest model again. Meanwhile, the abnormality of the object can be automatically predicted through the trained multi-party supervised learning model, for example, risk identification is carried out on the basis of the abnormality of the object to be predicted, and the like.

Fig. 8 shows an apparatus 800 for constructing an isolated forest model based on federal learning, in which the participants of the federal learning include a computation party and at least two data parties, the apparatus is deployed in a device of the computation party with respect to a first node in a first tree in the model, the at least two data parties include a first data party, a correspondence relationship between m feature identifiers and each data party is stored in the computation party device in advance, the m feature identifiers are predetermined identifiers of m features, respectively, in an embodiment of the present specification, the apparatus includes:

an obtaining unit 81 configured to obtain a plurality of sample identifiers corresponding to the first node, the plurality of sample identifiers corresponding to a plurality of samples, respectively, each sample including feature values of the m features;

a selecting unit 82 configured to randomly select one feature identifier from the m feature identifiers;

a sending unit 83, configured to, in a case where the selected feature identifier is a first feature identifier, send the identifier of the first node, the plurality of sample identifiers, and the first feature identifier to a first data party based on a correspondence relationship between the locally stored first feature identifier and the first data party;

a first recording unit 84 configured to record a correspondence relationship between the first node and the first data party;

a receiving unit 85 configured to receive information corresponding to the two child nodes of the first node from the first data party, so as to construct an isolated forest model for business processing.

In an embodiment, the first node is a root node, wherein the obtaining unit 81 is further configured to obtain N sample identifiers, and randomly obtain N sample identifiers from the N sample identifiers, where N > N.

In an embodiment, the two child nodes include a second node, the information corresponding to the second node includes that the second node is a leaf node, and the apparatus further includes a second recording unit 86 configured to record a corresponding relationship between the second node identifier and the first data party.

Fig. 9 shows an apparatus 900 for constructing an isolated forest model based on federal learning according to an embodiment of the present specification, where the participants of the federal learning include a calculator and at least two data parties, a first tree of the model includes a first node, the apparatus is deployed in a device of a first data party of the at least two data parties, the device of the first data party has feature values of first features of respective samples, and stores corresponding relationships between the first features and predetermined first feature identifiers, and the apparatus includes:

a receiving unit 91 configured to receive, from the device of the computing side, an identifier of a first node, a plurality of sample identifiers and a first feature identifier, where the plurality of sample identifiers correspond to a plurality of samples, respectively;

a selecting unit 92 configured to randomly select one feature value from feature values of the first feature of each of the plurality of samples as a split value of the first node based on a correspondence between a locally stored first feature identifier and the first feature;

a recording unit 93 configured to record a correspondence relationship between the first node and the first feature and the split value;

a grouping unit 94 configured to group the plurality of samples based on the split values to construct two child nodes of the first node;

a determining unit 95 configured to determine whether the two child nodes are leaf nodes, respectively;

a sending unit 96 configured to send information corresponding to the two child nodes, respectively, to the device of the computation side based on the grouping and the result of the determination, thereby constructing an isolated forest model for business processing.

In an embodiment, the two child nodes include a second node, where the information corresponding to the second node includes that the second node is a leaf node, and the apparatus further includes a calculating unit 97 configured to calculate and store a node depth of the second node.

Fig. 10 shows an apparatus 1000 for predicting abnormality of an object through an isolated forest model based on federal learning according to an embodiment of the present specification, where the participants of the federal learning include a calculator and at least two data parties, a tree structure of a first tree in the model and the data parties corresponding to nodes in the first tree are stored in a device of the calculator, and the apparatus is deployed in the device of the calculator and includes:

a first acquisition unit 101 configured to acquire an object identification of a first object;

a first sending unit 102, configured to send the object identifier to each data party;

a first receiving unit 103, configured to receive, from each data side device, at least one division result of the first object, which is performed by the data side at least one corresponding non-leaf node of the data side, respectively;

a first determining unit 104 configured to determine a first leaf node into which the first object falls based on a tree structure of a first tree and a result of dividing the first object at each non-leaf node from the at least two data side devices;

a second sending unit 105, configured to send, based on the data parties corresponding to the leaf nodes in the first tree, an identifier of the first leaf node to the first data party corresponding to the first leaf node;

a second receiving unit 106 configured to receive the node depth of the first leaf node from the first data party;

a prediction unit 107 configured to predict an abnormality of the first object based on the node depth for performing a business process.

In an embodiment, the apparatus further comprises a second obtaining unit 108 configured to obtain training samples for training a supervised learning model based on the prediction result of the first subject.

In an embodiment, the apparatus further comprises a second determining unit 109 configured to determine features comprised by the samples of the isolated forest model based on parameters of the trained supervised learning model.

Fig. 11 illustrates an apparatus 1100 for predicting abnormalities of an object based on federal learning by an isolated forest model, the federal learning participants including a calculator and at least two data parties, a first data party of the at least two data parties having recorded in a device: the apparatus is deployed in the device of the first data side, and includes:

a first receiving unit 111 configured to receive an object identification of a first object from the device of the computing side;

an obtaining unit 112 configured to obtain a feature value of a first feature of the first object from a local area based on a first feature of a first node stored locally;

a dividing unit 113 configured to divide the first object at a first node based on a locally stored feature value of a first feature of the first object and a split value of the first node;

a first sending unit 114 configured to send the result of the division to the device of the computing side, so as to be used for predicting the abnormality of the first object for business processing.

In one embodiment, the node depth of the second node in the first tree is recorded in the device of the first data side, the apparatus further includes a second receiving unit 115 configured to receive, from the device of the computing side, an identification of the second node into which the first object falls, and a second sending unit 116 configured to send the node depth of the second node to the device of the computing side.

It is to be understood that the terms "first," "second," and the like, herein are used for descriptive purposes only and not for purposes of limitation, to distinguish between similar concepts.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for constructing an isolated forest model based on federated learning, wherein participants of the federated learning comprise a calculator and at least two data parties, the method is executed by equipment of the calculator, the at least two data parties comprise a first data party, the calculator is pre-stored with corresponding relations between m feature identifiers and the data parties, the m feature identifiers are respectively predetermined identifiers of m features, and the method comprises the following steps:

obtaining a plurality of sample identifications corresponding to a first node of a first tree in the model, wherein the plurality of sample identifications correspond to a plurality of samples respectively, and each sample comprises characteristic values of the m characteristics;

randomly selecting one feature identifier from the m feature identifiers;

and receiving information respectively corresponding to the two child nodes of the first node from the first data side, so that the private data of each data side is protected and an isolated forest model is constructed for business processing.

2. The method of claim 1, the first node being a root node, wherein obtaining a plurality of sample identifications corresponding to the first node comprises obtaining N sample identifications, and randomly obtaining N sample identifications from the N sample identifications, where N > N.

3. The method according to claim 1, wherein the two child nodes include a second node, and the information corresponding to the second node includes that the second node is a leaf node, and the method further includes recording a correspondence between the second node identifier and the first data party.

4. The method of claim 3, wherein a third node is included in the two child nodes, and the information corresponding to the third node includes u sample identifiers assigned to the third node, wherein the u sample identifiers are a part of the plurality of sample identifiers.

5. The method of claim 1, wherein the at least one data party is at least one network platform, and the plurality of samples respectively correspond to a plurality of objects in the network platform.

6. The method of claim 5, wherein the object is any one of: consumer, transaction, merchant, commodity.

7. A method of constructing an isolated forest model based on federal learning, the participants of which include a calculator and at least two data parties, a first node being included in a first tree of the model, the method being performed by a device of a first data party of the at least two data parties, the device of the first data party having feature values of first features of respective samples stored therein and having correspondence of the first features to predetermined first feature identifications, the method comprising:

respectively determining whether the two child nodes are leaf nodes;

and based on the grouping and the determined result, sending information respectively corresponding to the two child nodes to the equipment of the calculating party, so that the private data of each data party is protected and simultaneously an isolated forest model is constructed for business processing.

8. The method of claim 7, wherein a second node is included in the two child nodes, wherein the information corresponding to the second node includes that the second node is a leaf node, the method further comprising calculating and storing a node depth of the second node.

9. A method for predicting abnormalities of objects based on federated learning through an isolated forest model, participants of the federated learning including a computational party and at least two data parties, a tree structure of a first tree in the model, the data parties corresponding to respective nodes in the first tree being stored in a device of the computational party, the method being performed by the device of the computational party and comprising:

acquiring an object identifier of a first object;

sending the object identification to each data party;

receiving a node depth of the first leaf node from the first data party;

10. The method of claim 9, further comprising, based on the prediction of the first subject, obtaining training samples for training a supervised learning model.

11. A method as claimed in claim 10, further comprising optimizing sample features of the orphan forest model based on parameters of the trained supervised learning model.

12. A method of predicting abnormalities in a subject by an isolated forest model based on federal learning, participants of which include a calculator and at least two data parties, a first of which has recorded in a device: the method is executed by the equipment of the first data side and comprises the following steps:

13. The method of claim 12, wherein a node depth of a second node in the first tree is recorded in the first data-party's device, the method further comprising receiving from the computing-party's device an identification of the second node into which the first object falls, and sending the node depth of the second node to the computing-party's device.

14. An apparatus for constructing an isolated forest model based on federated learning, wherein participants of the federated learning include a calculator and at least two data parties, the apparatus is deployed in equipment of the calculator, the at least two data parties include a first data party, corresponding relations between m feature identifiers and the data parties are stored in the equipment of the calculator in advance, the m feature identifiers are respectively predetermined identifiers of the m features, and the apparatus comprises:

an obtaining unit configured to obtain a plurality of sample identifiers corresponding to a first node of a first tree in the model, the plurality of sample identifiers corresponding to a plurality of samples, respectively, each sample including feature values of the m features;

15. The apparatus of claim 14, the first node being a root node, wherein the obtaining unit is further configured to obtain N sample identities, and randomly obtain N sample identities from the N sample identities, where N > N.

16. The apparatus according to claim 14, wherein the two child nodes include a second node, and the information corresponding to the second node includes that the second node is a leaf node, and the apparatus further includes a second recording unit configured to record a correspondence relationship between the second node identifier and the first data party.

17. The apparatus of claim 16, wherein a third node is included in the two child nodes, and the information corresponding to the third node includes u sample identifiers assigned to the third node, wherein the u sample identifiers are a part of the plurality of sample identifiers.

18. The apparatus of claim 14, wherein the at least one data party is at least one network platform, and the plurality of samples respectively correspond to a plurality of objects in the network platform.

19. The apparatus of claim 18, wherein the object is any one of: consumer, transaction, merchant, commodity.

20. An apparatus for constructing an isolated forest model based on federated learning, wherein the participants of the federated learning include a calculator and at least two data parties, a first tree of the model includes a first node therein, the apparatus is deployed in a device of a first data party of the at least two data parties, the device of the first data party has feature values of first features of respective samples therein and stores corresponding relationships of the first features to predetermined first feature identifications, the apparatus comprising:

21. The apparatus according to claim 20, wherein the two child nodes include a second node, wherein the information corresponding to the second node includes that the second node is a leaf node, the apparatus further comprising a calculation unit configured to calculate and store a node depth of the second node.

22. An apparatus for predicting object abnormalities through an isolated forest model based on federated learning, wherein participants of the federated learning include a calculator and at least two data parties, a tree structure of a first tree in the model and the data parties corresponding to nodes in the first tree are stored in a device of the calculator, the apparatus is deployed in the device of the calculator, and comprises:

23. The apparatus of claim 22, further comprising a second obtaining unit configured to obtain training samples for training a supervised learning model based on the prediction result of the first subject.

24. An apparatus as claimed in claim 23, further comprising a second determining unit configured to determine features comprised by the samples of the orphan forest model based on parameters of the trained supervised learning model.

25. An apparatus for predicting abnormalities in a subject based on federal learning by an isolated forest model, the participants of said federal learning including a calculator and at least two data parties, a first data party of said at least two data parties having recorded in its device: the device is deployed in the equipment of the first data side and comprises:

26. The apparatus of claim 25, wherein a node depth of a second node in the first tree is recorded in the device of the first data party, the apparatus further comprising a second receiving unit configured to receive, from the device of the computing party, an identification of the second node into which the first object falls, and a second transmitting unit configured to transmit the node depth of the second node to the device of the computing party.

27. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-13.

28. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-13.