CN114418120A

CN114418120A - Data processing method, device, equipment and storage medium of federal tree model

Info

Publication number: CN114418120A
Application number: CN202210080616.9A
Authority: CN
Inventors: 陈伟敬; 马国强; 范涛; 徐倩
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-04-29

Abstract

The application provides a data processing method, a device and equipment of a federated tree model, which are applied to first participant equipment; the method comprises the following steps: acquiring a first node route of a target node in a federated tree model, wherein the target node corresponds to at least two first characteristics provided by first participant equipment; receiving at least one anonymous feature sent by a second participant device and a second node route corresponding to each anonymous feature; simulating a federal tree model based on a first node route and a second node route to obtain a pseudo-federal tree model corresponding to the federal tree model, and predicting a feature subset included in a feature set of a training sample used as the federal tree model through the pseudo-federal tree model to obtain a corresponding predicted value; and determining contribution information of each feature in the feature set corresponding to the target prediction result by combining the predicted value and the target prediction result. By the method and the device, the contribution information of each feature provided by each participant in the sample can be determined quickly and accurately.

Description

Data processing method, device, equipment and storage medium of federal tree model

Technical Field

The present application relates to artificial intelligence technologies, and in particular, to a data processing method and apparatus for a federated tree model, an electronic device, a computer-readable storage medium, and a computer program product.

Background

With the trend of gradually strengthening data privacy protection in various industries, federal learning is a technology which can cooperate with multi-party data to establish machine learning under the condition of protecting data privacy, and becomes one of the key points of cooperation among various enterprises/industries. Nowadays, in a vertical scenario, a vertical tree model is already widely used in a vertical federal scenario, and becomes one of common and powerful algorithms in the fields of finance and wind control.

In the fields of finance and wind control, the influence of each feature in a single sample on the output result of the federal tree model is often acquired. For example, for a specific sample (for example, a default customer), it is necessary to obtain which feature and which value of the feature are specific, which has an important influence on determining that the user is a default user. In addition, there is a need to determine the impact of partner-provided features on the model output.

According to the related federal tree model interpretation scheme, the tree model is interpreted on the whole by acquiring the feature importance, and the contribution degree of each feature in a single sample cannot be specifically interpreted. In addition, although the use feature importance degree can be known how many times the feature of the partner is used, the positive and negative of the influence of the partner feature on the model output result are unknown, and the model calculation amount is large when the feature contribution information is determined.

Disclosure of Invention

The embodiment of the application provides a data processing method and device of a federated tree model, an electronic device, a computer-readable storage medium and a computer program product, which can quickly and accurately determine contribution information of each feature provided by each participant in a sample.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a data processing method of a federated tree model, which is based on a federated learning system, wherein the federated learning system comprises a first participant device and at least one second participant device, and the method is applied to the first participant device and comprises the following steps:

obtaining a first node route of a target node in the federated tree model, wherein the target node corresponds to at least two first features provided by the first participant device;

receiving at least one anonymous feature sent by the second participant device and a second node route corresponding to each anonymous feature;

the anonymous feature corresponds to a second feature used for training the federated tree model, and the second node route is used for indicating a sub-node path corresponding to a split node when the anonymous feature is used as the split node of the federated tree model;

simulating the federal tree model based on the first node route and the second node route to obtain a pseudo-federal tree model corresponding to the federal tree model, and predicting a feature subset included in a feature set used as a training sample of the federal tree model through the pseudo-federal tree model to obtain a corresponding predicted value;

wherein the feature set comprises: the at least two first characteristics carrying a target prediction result and the second characteristics provided by at least one second participant device;

and determining contribution information of each feature in the feature set corresponding to the target prediction result by combining the prediction value and the target prediction result.

An embodiment of the present application provides a data processing apparatus of a federated tree model, including:

an obtaining module, configured to obtain a first node route of a target node in the federated tree model, where the target node corresponds to at least two first features provided by the first participant device;

the receiving module is used for receiving at least one anonymous feature sent by the second participant device and a second node route corresponding to each anonymous feature; the anonymous feature corresponds to a second feature used for training the federated tree model, and the second node route is used for indicating a sub-node path corresponding to a split node when the anonymous feature is used as the split node of the federated tree model;

the simulation module is used for simulating the federal tree model based on the first node route and the second node route to obtain a pseudo-federal tree model corresponding to the federal tree model, and predicting a feature subset included in a feature set of a training sample used as the federal tree model through the pseudo-federal tree model to obtain a corresponding predicted value; wherein the feature set comprises: the at least two first characteristics carrying a target prediction result and the second characteristics provided by at least one second participant device;

and the determining module is used for determining contribution information of each feature in the feature set corresponding to the target prediction result by combining the prediction value and the target prediction result.

In the above scheme, the receiving module is further configured to send an anonymous feature obtaining request carrying a sample identifier to the second party device;

the anonymous feature obtaining request is used for responding to the anonymous feature obtaining request by the second participant device, and determining the second feature corresponding to the sample identifier, the anonymous feature corresponding to the second feature and a second node route corresponding to the anonymous feature;

and receiving an anonymous feature corresponding to the sample identification and a second node route corresponding to the anonymous feature, which are returned by the second participant device.

In the above scheme, the receiving module is further configured to create an anonymous relation record table locally, where the anonymous relation record table is used to record a sample identifier of a training sample used for training the federated tree model, an anonymous feature of the second participant device corresponding to the sample identifier, and a second node route corresponding to the anonymous feature;

and storing the received anonymous characteristics corresponding to the sample identification and the second node route corresponding to the anonymous characteristics in the anonymous relation record table.

In the foregoing solution, the determining module is further configured to select at least two features from at least two first features provided by the first party device and at least one second feature provided by the second party device;

constructing a feature interaction group comprising the at least two features, and determining at least one interaction marginal contribution value corresponding to the feature interaction group;

and determining the interaction contribution information of the feature interaction group corresponding to the target prediction result based on the at least one interaction marginal contribution value.

In the foregoing solution, the determining module is further configured to determine a first feature subset of the feature set, where the first feature subset includes at least one of the at least two features;

determining a second feature subset of the feature set, the second feature subset being in a complementary relationship with the feature interaction group;

and obtaining a predicted value corresponding to the first feature subset and a predicted value corresponding to the second feature subset, and determining an interaction marginal contribution value corresponding to the feature interaction group based on the predicted value corresponding to the first feature subset and the predicted value corresponding to the second feature subset.

In the foregoing solution, the determining module is further configured to sum the multiple interaction marginal contribution values when the number of the interaction marginal contribution values is multiple, so as to obtain the interaction contribution information of the feature interaction group corresponding to the target prediction result.

The embodiment of the present application further provides a data processing method of a federated tree model, which is based on a federated learning system, where the federated learning system includes a first participant device and at least one second participant device, and the method is applied to the second participant device, and includes:

respectively generating anonymous features corresponding to each second feature aiming at each second feature used for training the federated tree model, and acquiring second node routes corresponding to each anonymous feature;

the second node route is used for indicating a sub-node path corresponding to a split node when the anonymous feature is used as the split node of the federated tree model;

sending anonymous features corresponding to each of the second features and second node routes corresponding to each of the anonymous features to the first participant device;

the second node route is used for acquiring a pseudo-federation tree model corresponding to the federation tree model by the first participant equipment based on the second node route, and determining contribution information of target prediction results corresponding to each feature in the feature set through the pseudo-federation tree model;

wherein the feature set is used as a training sample of the federated tree model, including: the first participant device provides at least two first features carrying a target prediction result and at least one second feature.

An embodiment of the present application further provides a data processing apparatus of a federated tree model, including:

the generating module is used for respectively generating anonymous features corresponding to the second features aiming at the second features used for training the federated tree model and acquiring second node routes corresponding to the anonymous features; the second node route is used for indicating a sub-node path corresponding to a split node when the anonymous feature is used as the split node of the federated tree model;

a sending module, configured to send anonymous features corresponding to the second features, and a second node route corresponding to the anonymous features to the first participant device; the second node route is used for acquiring a pseudo-federation tree model corresponding to the federation tree model by the first participant equipment based on the second node route, and determining contribution information of target prediction results corresponding to each feature in the feature set through the pseudo-federation tree model; wherein the feature set is used as a training sample of the federated tree model, including: the first participant device provides at least two first features carrying a target prediction result and at least one second feature.

In the above scheme, the sending module is further configured to receive an anonymous feature obtaining request carrying a sample identifier sent by the first party device;

analyzing the anonymous characteristic acquisition request to obtain the sample identifier;

determining the second feature corresponding to the sample identifier, an anonymous feature corresponding to the second feature, and a second node route corresponding to the anonymous feature;

sending the anonymous feature and the second node route to the first participant device.

In the above scheme, the sending module is further configured to search a local anonymous relationship record table according to the sample identifier, so as to obtain a second feature corresponding to the sample identifier, an anonymous feature corresponding to the second feature, and a second route corresponding to the anonymous feature; the anonymous relation recording table is used for recording a sample identifier of a training sample used for training the federated tree model, a second feature corresponding to the sample identifier, an anonymous feature corresponding to the second feature, and a second node route corresponding to the anonymous feature.

In the foregoing scheme, the generating module is further configured to perform hash processing on each second feature used for training the federated tree model, to obtain a hash value corresponding to each second feature, and use the hash value as an anonymous feature corresponding to the second feature.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the data processing method of the federal tree model provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for processing data of the federate tree model provided in the embodiment of the present application.

The embodiment of the present application provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the data processing method of the federate tree model provided in the embodiment of the present application.

The embodiment of the application has the following beneficial effects:

compared with the technology of integrally explaining the tree model by acquiring the feature importance in the related federal tree model, the first participant device in the embodiment of the application simulates the federal tree model by combining the first feature in the sample, the first node route corresponding to the first feature, the anonymous feature corresponding to the second feature acquired from the second participant device and the second node route corresponding to the anonymous feature to obtain a pseudo-federal tree model, so that the pseudo-federal tree model can be directly constructed locally on the first participant device, and the communication times among the participant devices are reduced; and finally, determining contribution information of each feature in the feature set corresponding to the target prediction result by combining each predicted value and the target prediction result of the sample, so that the contribution information corresponding to the feature provided by each participant in a single sample can be accurately measured.

Drawings

FIG. 1 is an architectural diagram of a data processing system of the federated tree model provided in an embodiment of the present application;

2A-2B are schematic structural diagrams of an electronic device provided by an embodiment of the application;

FIG. 3 is a flow chart illustrating a data processing method of a federated tree model provided in an embodiment of the present application;

fig. 4 is a flowchart of an anonymous feature and a second node route acquisition method according to an embodiment of the present disclosure;

FIG. 5 is a diagram of a pseudo federated tree model provided by an embodiment of the present application;

FIG. 6 is a flowchart of a method for constructing a pseudo federated tree model provided in an embodiment of the present application;

FIG. 7 is a flowchart of a method for constructing a pseudo federated tree model provided in an embodiment of the present application;

FIG. 8 is a diagram illustrating examples of information about the contribution of features to a predicted result according to an embodiment of the present application;

FIG. 9 is a diagram of a tree model in a non-federated learning scenario as provided by an example of the present application;

FIG. 10 is a schematic diagram of a method for determining a predicted value by a pseudo federated tree model provided in an embodiment of the present application;

fig. 11 is a schematic diagram illustrating a method for determining a predicted value of a feature subset according to an embodiment of the present application;

fig. 12 is a schematic diagram of a specific method for determining a predicted value by a pseudo federal tree model according to an embodiment of the present application;

FIG. 13 is a flowchart of a method for determining interaction contribution information according to an embodiment of the present application;

FIG. 14 is a flowchart of a method for determining interaction contribution information according to an embodiment of the present disclosure;

FIG. 15 is a flowchart illustrating a data processing method of a federated tree model provided in an embodiment of the present application;

fig. 16 is a schematic diagram of an anonymous feature sending method provided in an embodiment of the present application;

FIG. 17 is a code segment diagram of a tree model estimation function provided by an embodiment of the present application;

FIG. 18 is a flowchart of a data processing method of the federated tree model provided in an embodiment of the present application;

fig. 19 is a schematic diagram of a pseudo tree model construction method provided in an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Where similar language of "first/second" appears in the specification, the following description is added, and where reference is made to the term "first \ second \ third" merely for distinguishing between similar items and not for indicating a particular ordering of items, it is to be understood that "first \ second \ third" may be interchanged both in particular order or sequence as appropriate, so that embodiments of the application described herein may be practiced in other than the order illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Decision Tree (Decision Tree): the machine learning method is a tree structure, wherein each internal node represents a judgment on an attribute, each branch represents the output of a judgment result, and finally each leaf node represents a classification result.

2) Charapril value (SHAP, SHApley Additive extation): a model-independent interpretable analysis mode based on a cooperative game theory is provided, each prediction record has a corresponding shape value, and each feature also has a corresponding shape value. When the shape value is larger than 0, the current feature in the current sample is indicated to advance the model prediction result to the forward direction, and the reverse direction is indicated to advance to the reverse direction.

The embodiment of the application provides a data processing method and device of a federated tree model, an electronic device, a computer readable storage medium and a computer program product, which can quickly and accurately determine contribution information of each feature in a sample.

Based on the above explanations of terms and terms involved in the embodiments of the present application, first, a data processing system of the federal tree model provided in the embodiments of the present application is described, referring to fig. 1, where fig. 1 is a schematic structural diagram of a data processing system of the federal tree model provided in the embodiments of the present application, in the data processing system 100 of the federal tree model, a first participant device 400 and a second participant device 410 (exemplarily, 2 second participant devices are shown, respectively designated as 410-1 and 410-2 for distinction), the first participant device 400 and the second participant device 410 are connected to each other through a network 300 while being connected to a server device through the network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and data transmission is implemented using a wireless link.

In some embodiments, the first participant device 400 and the second participant device 410 are interconnected via the network 300, while third party devices (collaborators, servers, etc.) that may be involved in the federated tree model may be connected via the network 300.

In some embodiments, the first participant device 400 and the second participant device 410 may be, but are not limited to, a laptop computer, a tablet computer, a desktop computer, a smart phone, a dedicated messaging device, a portable gaming device, a smart speaker, a smart watch, etc., and may also be client terminals of federal learning participants, such as participant devices storing user characteristic data at various banks or financial institutions, etc. The third-party device may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like, and is used to assist each participant device in federal learning to obtain a federal learning model. The network 300 may be a wide area network or a local area network, or a combination of both. The first participant device 400 and the second participant device 410 may be directly or indirectly connected through wired or wireless communication, and the embodiments of the present application are not limited thereto.

A first participant device 400 configured to obtain a first node route of a target node in a federated tree model, where the target node corresponds to at least two first features provided by the first participant device; then, at least one anonymous feature sent by the second participant device and a second node route corresponding to each anonymous feature are received; the anonymous feature corresponds to a second feature used for training the federated tree model, and the second node route is used for indicating a sub-node path corresponding to a split node when the anonymous feature is used as the split node of the federated tree model; simulating a federal tree model based on a first node route and a second node route to obtain a pseudo-federal tree model corresponding to the federal tree model, and predicting a feature subset included in a feature set of a training sample through the pseudo-federal tree model to obtain a corresponding predicted value; and finally, determining contribution information of each feature in the feature set corresponding to the target prediction result by combining the predicted value and the target prediction result.

The first participant device 400 is further configured to send an anonymous feature obtaining request to the second participant device, so as to obtain an anonymous feature corresponding to the sample identifier and a second node route corresponding to the anonymous feature, which are returned by the second participant device based on the anonymous feature obtaining request.

The second participant device 410 is configured to generate anonymous features corresponding to the second features used for training the federated tree model, and obtain second node routes corresponding to the anonymous features; anonymous features corresponding to the second features are sent, and second node routes corresponding to the anonymous features are routed to the first participant device.

Referring to fig. 2A-2B, fig. 2A-2B are schematic structural diagrams of an electronic device provided in this embodiment, and in practical applications, the electronic device 500 may be implemented as the first party device 400 or the second party device 410 in fig. 1, which is described for implementing the data processing method of the federate tree model in this embodiment of the present application. The electronic device 500 shown in fig. 2A-2B includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It will be appreciated that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 2.

The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 530 includes one or more output devices 531 enabling presentation of media content, including one or more speakers and/or one or more visual display screens. The user interface 530 also includes one or more input devices 532, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.

The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 553 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;

an input processing module 554 to detect one or more user inputs or interactions from one of the one or more input devices 532 and to translate the detected inputs or interactions.

In some embodiments, the data processing apparatus of the federal tree model provided in this embodiment of the present application may be implemented in software, and fig. 2A illustrates a schematic structural diagram of the electronic device, which is the first participant device 400, and a data processing apparatus 555 of the federal tree model stored in the memory 550, which may be software in the form of programs and plug-ins, and includes the following software modules: the obtaining module 5551, the receiving module 5552, the simulating module 5553 and the determining module 5554 are logical, and thus may be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be explained below.

In some embodiments, as shown in fig. 2B, fig. 2B is a schematic structural diagram of the electronic device, which is the second participant device 410 according to the embodiment of the present application, and the software modules stored in the data processing apparatus 555 of the federal tree model in the memory 550 may include: a generating module 5555 and a sending module 5556, which are logical and thus can be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be explained below.

In other embodiments, the data processing apparatus of the federate model provided in this embodiment may be implemented in hardware, and for example, the data processing apparatus of the federate model provided in this embodiment may be a processor in the form of a hardware decoding processor, which is programmed to execute the data processing method of the federate model provided in this embodiment, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The data processing method of the federated tree model provided in the embodiment of the present application will be described in conjunction with exemplary application and implementation of the first participant device provided in the embodiment of the present application. The data processing method of the federal tree model provided in the embodiment of the application is based on a federal learning system, wherein the federal learning system comprises a first participant device and at least one second participant device. Referring to fig. 3, fig. 3 is a schematic flow chart of a data processing method of the federal tree model provided in the embodiment of the present application, and will be described with reference to the steps shown in fig. 3.

In step 101, a first participant device obtains a first node route of a target node in a federated tree model, where the target node corresponds to at least two first features provided by the first participant device.

In practical implementation, in the context of a vertical federated tree model, at least two parties are typically involved, the first party being a party holding data and tags (target prediction results) (also referred to as the Guest party or the active party), and the second party being a party holding data (also referred to as the Host party or the passive party). The method provided by the embodiment of the application can be suitable for a longitudinal federated tree model in which a Guest party and at least one Host party participate.

Illustratively, taking a Guest party and a Host party as an example for cooperatively training the federal tree model, in a single training sample, the Guest party and the Host party respectively hold partial features, wherein the Guest party provides G _1, G _2, G _3, … … and G _ n, n first features are provided, n is greater than or equal to 1, and n is an integer; the Host party provides m second characteristics of H _1, H _2, H _3, … … and H _ m, wherein m is more than or equal to 1 and is an integer; if there are k hosts (k ≧ 1 and k is an integer), the number of features included in the kth Host can be represented by m _ k. It is understood that the feature set of a single training sample corresponding to one Guest party and one Host party includes n first features and m second features, and the feature set can be represented as { G _1, G _2, G _3, … …, G _ n, H _1, H _2, H _3, … …, H _ m }.

According to the training sample of the federated tree model, at least two first features { G _1, G _2, G _3, … …, G _ n } provided by the first participant device are disclosed to the local (first participant device), and the first participant device can directly acquire target nodes corresponding to the first features one to one and routing information corresponding to the target nodes (which can be regarded as first node routing) in the federated tree model.

In step 102, at least one anonymous feature sent by the second participant device and a second node route corresponding to each anonymous feature are received, wherein the anonymous feature corresponds to a second feature used for training the current federated tree model, and the second node route is used for indicating a child node path corresponding to a split node when the anonymous feature is used as the split node of the federated tree model.

In actual implementation, in order to protect privacy and security of data of each participant, for at least one second feature { H _1, H _2, H _3, … …, H _ m } provided by the second participant device, the first participant device cannot directly acquire information related to the second feature, and in order to acquire contribution information of a label carried by a training sample corresponding to each second feature, the contribution information of each feature in the sample may be locally determined by the first participant device by acquiring an anonymous feature corresponding to each second feature and a second node route corresponding to the anonymous feature from the second participant device. The anonymous feature (anonymous feature) can be used locally to replace the second feature provided by the Host party, and at this time, the first party device does not need to know the specific information of the second feature, and only needs to acquire the anonymous feature corresponding to the second feature and the routing information (second node route) corresponding to the anonymous feature. That is to say, when the first party device determines the contribution information of the second feature in the sample to be interpreted, it first needs to receive the anonymous feature corresponding to the second feature and the second node route corresponding to the anonymous feature, which are sent by the second party device.

In the above example, by obtaining anonymous features corresponding to m second features corresponding to the Host party, the feature set of the first participant device is changed to { G _1, G _2, G _3, … …, G _ n, a _1_ i, a _2_ i, a _3_ i, … …, a _ m _ i }, where a _ m _ i may represent the m-th anonymous feature of the i-th Host party, and when i is 1, the 1-th Host party is represented.

In some embodiments, referring to fig. 4, fig. 4 is a flowchart of an anonymous feature and a second node route obtaining method provided in an embodiment of the present application, and a first participant device may obtain, through the steps shown in fig. 4, an anonymous feature and a second node route corresponding to the anonymous feature, which is described with reference to the steps shown in fig. 4.

Step 1021, the first party device sends an anonymous feature obtaining request carrying the sample identifier to the second party device.

In actual implementation, when the first party device performs sample interpretation, and when a target node corresponding to the second feature is traversed in the federal tree model, an anonymous feature request may be sent to the second party device, where the anonymous feature request carries a sample identifier, and it needs to be noted that one or more sample identifiers may be carried in the anonymous feature request. When the number of the sample identifications is multiple, the first participant device can interpret at least two samples in batch, so that the sample interpretation efficiency can be improved.

In step 1022, the first participant device receives the anonymous feature corresponding to the sample identifier returned by the second participant device and the second node route corresponding to the anonymous feature.

In actual implementation, after the first participant device sends an anonymous feature request to the second participant device, the first participant device may receive an anonymous feature corresponding to the sample identifier and a second node route corresponding to the anonymous feature, which are returned by the second participant device.

In some embodiments, the first participant device may store the received anonymous characteristics and the second node route to which the anonymous characteristics correspond by: and the first participant device locally creates an anonymous relation record table, and stores the received anonymous characteristics corresponding to the sample identification and the second node route corresponding to the anonymous characteristics in the anonymous relation record table. It should be noted that the anonymous relation record table local to the first participant device is used to record a sample identifier of a training sample used for training the federated tree model, an anonymous feature of the second participant device corresponding to the sample identifier, and a second node route corresponding to the anonymous feature.

In practical implementation, the first participant device may create an anonymous relationship record table locally for recording an anonymous feature corresponding to the sample identifier and a second node route corresponding to the anonymous feature, and when the sample corresponding to the sample identifier is reused, the first participant device may directly read from the local anonymous relationship record table, so as to reduce the number of communications with the second participant device.

Illustratively, the anonymous relation record table is stored locally in a file format, wherein the recorded information can be { "sample identity": xxx, an "anonymous feature": xxx, second node route: "left" }, may also include a device identification of the second participant device, such as { "sample identification": xxx, an "anonymous feature": xxx, second node route: "left", "equipment identification": host _ x }.

It should be noted that there is no strict order of execution between step 101 and step 102.

In step 103, a federal tree model is simulated based on the first node route and the second node route, and a pseudo-federal tree model corresponding to the federal tree model is obtained.

And acquiring a second node route of the nodes corresponding to the at least two first characteristics in the federal tree model.

In some embodiments, the first participant device obtains a corresponding second node route for each first feature in the sample, and each first node route is known to the first participant device and can be obtained directly because the first feature is provided by the first participant device.

In actual implementation, in a scenario of a federal tree model, information about a second feature provided by a second participant device is unknown to a first participant device, and the first participant device may simulate the federal tree model according to a known first node route and a second node route requested by the second participant device, obtain a tree model in a non-federal learning scenario locally, and may regard the simulated tree model as a pseudo-federal tree model or a pseudo-tree model of the federal tree model. And interpreting the training sample by using the locally created pseudo tree model, namely acquiring the feature contribution degrees of anonymous features corresponding to the first feature provided by the first participant device and the second feature provided by the second participant device in the sample. That is, the first participant device locally simulates a federal tree model according to the first feature provided locally, the first node route, the anonymous feature corresponding to the second feature obtained from the second participant device, and the second node route corresponding to each anonymous feature, so as to obtain a new tree model for calculating contribution information corresponding to each first feature and each anonymous feature.

Illustratively, referring to fig. 5, fig. 5 is a schematic diagram of a pseudo-federal tree model provided in this embodiment, and assuming that features of a training sample used as the federal tree model include { income, age, height }, where "height" is provided by a second participant device, "a _ k _ 1" shown in number 1 in the figure represents a second feature "height" provided in the second participant device in an original federal tree model, and the height feature in the sample is provided by the second participant device and is replaced with a corresponding anonymous feature a _ k _1, when traversing the pseudo-tree model shown in the figure, when a current node is the node shown in number 1, routing information at the node can be determined according to a second node route returned by the second participant.

Explaining a construction process of the pseudo federal tree model, in some embodiments, referring to fig. 6, fig. 6 is a flowchart of a construction method of the pseudo federal tree model provided in the embodiments of the present application, and based on fig. 3, step 103 may be implemented by step 1031A to step 1034A.

Step 1031A, the first participant device creates a model copy of the federated tree model, and traverses each node of the model copy from the root node.

In actual implementation, the pseudo-federal tree model locally constructed by the first participant device may be regarded as a variant of the target federal tree model, and in order to improve the creation efficiency of the pseudo-tree model, the pseudo-federal tree model for simulating the federal tree model can be obtained by performing corresponding modification on a model copy of the federal tree model.

Step 1032A, when the traversed feature corresponding to the current node is the second feature, obtaining an anonymous feature corresponding to the second feature, and replacing the feature corresponding to the current node with the anonymous feature.

In practical implementation, traversal is started from the root node of the model copy, and when the traversed feature corresponding to the current node is the second feature provided by the second participant device, the anonymous feature can be directly used to replace the second feature in the model copy.

Illustratively, in fig. 5, the anonymous feature "a _ k _ 1" is used to replace the "height" feature provided by the second participant, and a second node route corresponding to the anonymous feature "a _ k _ 1" is obtained, that is, when traversing to the "a _ k _ 1" node, since the specific value of "height" in the sample is provided by the second participant device, for data security, the first participant does not need to obtain the specific value of "height" at the node, and only needs to determine which branch the "height" node goes on to.

Step 1033A, determining a child node included in the current node according to the child node path indicated by the second node route.

In practical implementation, in the model copy, after the anonymous feature is used to replace the second feature provided by the corresponding second participant device, the child node of the node corresponding to the anonymous feature is determined through the second node routing.

And 1034A, when the traversed feature corresponding to the current node is the first feature, executing traversal aiming at other nodes until the nodes in the model copy are traversed completely, and taking a tree model formed by the first node corresponding to the first feature, the second node corresponding to the anonymous feature and child nodes included in the second node in the model copy as a pseudo-federated model.

And repeating the steps 1031A to 1034A to obtain the pseudo federal tree model used for simulating the federal tree model.

Continuing with the description of the construction process of the pseudo-federal tree model, in some embodiments, referring to fig. 7, fig. 7 is a flowchart of a construction method of the pseudo-federal tree model provided in the embodiments of the present application, and the construction process of the pseudo-federal tree model is described with reference to the steps shown in fig. 7.

Step 1031B, the first participant device creates an initial tree model, the initial tree model has nodes corresponding to the first characteristic, and the nodes in the initial tree model correspond to the first node route.

In actual implementation, because the first feature in the sample to be interpreted (the training sample) is provided by the first participant device, that is, the information related to the first feature is known information for the first participant device, in order to increase the speed of constructing the pseudo-federation tree model, the first participant device may directly construct an initial tree model according to the known first features, where the initial tree model has nodes corresponding to the first features, and the nodes in the initial tree model correspond to the first node routes.

Step 1032B, the anonymous features having a one-to-one correspondence relationship with the second features of the second participant device and the second node routes corresponding to the anonymous features are obtained.

In practical implementation, the first participant device reads the anonymous feature corresponding to the sample identifier of the sample to be explained and stored in the local anonymous relation record, and the second node route corresponding to the anonymous feature.

And 1033B, creating nodes corresponding to the anonymous characteristics and child nodes included by the nodes in the initial tree model according to the second node route so as to obtain a pseudo federated tree model.

In actual implementation, the first participant device directly creates a node corresponding to the anonymous feature and a child node included in the node according to the first node route in the initial tree model constructed in step 1031B.

In step 104, predicting a feature subset included in a feature set of a training sample used as a federate tree model through a pseudo federate tree model to obtain a corresponding predicted value;

wherein the feature set includes: the at least two first characteristics carrying a target prediction result, and the at least one second characteristic provided by the second participant device. In practical implementation, the Guest party explains a single training sample corresponding to the federated tree model, and actually determines the influence of the feature value of each feature in the single sample on a target prediction result, wherein the target prediction result refers to a prediction result obtained by the training sample through the federated tree model, and the Guest party can replace a second feature provided by the Host party with a corresponding anonymous feature.

In actual implementation, the meaning of the contribution information of each feature to the target prediction result is as follows: when the characteristic is a first characteristic, the contribution information is used for representing the contribution value of the characteristic value corresponding to the first characteristic to the target prediction result; when the characteristics are anonymous characteristics corresponding to the second party (Host party), the contribution information is used for representing the contribution value of each second characteristic in the second party (Host party) to the target prediction result. For example, determining the contribution information of the feature a in the sample D to the target prediction result R may be understood as determining the contribution information of the value corresponding to the feature a in the sample D to the target prediction result R.

In practical implementation, the contribution information of each feature included in the sample feature set used as the machine learning model may be determined based on a charapril value (Shapely), and the implementation process may be: supposing organicThe method comprises the steps of learning a model f by a machine, inputting a sample X, wherein the sample X comprises features {1, 2, 3, … …, t }, the features can be replaced by subscripts i, t (t is larger than or equal to 1, and t is an integer) features are totally obtained, N is a subset, S is a feature subset not comprising i, and f is_xTo estimate the function, it is used to return the average of the feature subset S output by the model f. The calculation formula is as follows:

as an example, referring to fig. 8, fig. 8 is an exemplary diagram of contribution information of features to a prediction result provided in an embodiment of the present application, assuming that there are samples x { age is 20, height is 170, and income is 100}, a machine learning model f is obtained, a prediction result of f on the sample x is 1.2,

1) enumerating all feature subsets comprised by the feature set of sample x: the term "age" is 20, the term "height" is 170, the term "income" is 100, the term "age" is 20, the term "height" is 170, the term "age" is 20, the term "income" is 100, the term "height" is 170, the term "income" is 100, the term "age" is 20, the term "height" is 170, and the term "height" is 170.

2) Traversing each feature in the sample x, adding a feature value corresponding to each feature into a feature subset not including the feature value, calculating a marginal contribution of each feature, and calculating a prediction result (also called an estimation value) corresponding to each feature subset by using a preset estimation function f _ x, assuming that the prediction result of each feature subset obtained by f _ x is as follows:

f _ x ({ age-20, height-170, income-100 }) 1.2;

f _ x ({ age-20, height-170 }) 0.9;

f _ x ({ age ═ 20, income ═ 100}) -0.8;

f _ x ({ height 170, income 100}) 0.7;

f _ x ({ revenue 100}) ═ 0;

f _ x ({ age ═ 20}) -0;

f _ x ({ height-170 }) -0;

f_x({})＝0；

note that when the feature subset is the full set of features in the sample x, the prediction result calculated by the evaluation function f _ x coincides with the prediction result using the machine learning model f.

Following the above example, in order to calculate the contribution of the feature "{ age ═ 20 }" to the prediction result "1.2" of the sample x, "{ age ═ 20 }" may be sequentially added to the feature subset not containing itself, and the marginal contribution value of "{ age ═ 20 }" may be calculated according to the foregoing formula (1), as follows:

adding 'age-20' to the feature subset { height-170 and income-100 }, and determining the marginal contribution value as: 2/6 {1.2-0.7} ═ 1/6;

adding 'age ═ 20 }' to the feature subset { income ═ 100}, and determining the marginal contribution value as: 1/6 {0.8-0} ═ 0.8/6;

adding 'age 20' to the feature subset 'age 20', and determining the marginal contribution value as: 1/6 {0.9-0} ═ 0.9/6;

adding 'age ═ 20 }' into the empty set, and determining the marginal contribution value as: 0;

the marginal contribution values obtained in the first to fourth steps are added to obtain a total contribution of 2.7/6 to 0.45. Finally, "{ age ═ 20 }" is obtained, which contributes "+ 0.45" to the prediction result "1.2". Similarly, it can be found that "{ income ═ 100 }" contributes "+ 0.35" to the prediction result "1.2", and "+ 0.4" to the prediction result "1.2". The symbol "+" indicates that the influence of "{ age ═ 20 }", "{ income ═ 100 }", and "{ height ═ 170 }" on the prediction result "1.2" is a positive influence, and if the symbol is "-" indicates that the influence of the feature value corresponding to the feature on the target prediction result is a negative influence.

From the foregoing description, it can be determined that, when determining, according to the charpy value, the feature value of each feature included in the feature set used as the training sample for training the federal tree model, and the contribution information to the predicted result of the sample, the first participant device participating in the training of the federal tree model needs to determine each feature subset included in the feature set of the sample (to-be-interpreted sample), and an evaluation function for calculating the predicted result of each feature subset, where an implementation manner of the evaluation function can be determined by a machine learning model actually used.

A description will be given of a manner of obtaining the feature subset, which is obtained by combining the features in the sample feature set. Taking the federal tree model as an example, the feature combination of the sample includes a first feature (including G _1, G _2, G _3, … …, G _ n, n features) provided by the Guest party, and a second feature (including H _1, H _2, H _3, … …, H _ m, m features) provided by the Host party, that is, n + m features in a sample data. And combining the n + m characteristics to obtain a corresponding characteristic subset, wherein the number of the characteristics in the characteristic subset is from 0 to n + m, when the number of the characteristics is 0, the characteristic subset is an empty set, and when the number of the characteristics is n + m, the characteristic subset is a full set. The calculation formula of the number N, N of the feature subsets is as follows:

wherein n + m is the number of features,

indicating the number of feature subsets containing 0 features (i.e. the number of empty sets is 1),

representing the number of feature subsets containing 1 feature,

indicating the number of feature subsets containing n + m features (i.e., the number of complete sets is 1).

Illustratively, in the federal tree model, assuming that a Guest party and a Host party participate in the calculation, for each sample D containing n + m features, where n is the number of first features provided by the Guest party and m is the number of second features provided by the Host party. Aiming at the sample D, in order to ensure the contribution degree of the relevant features of the Host party and ensure the data security between all the participating parties, m second features provided by the Host are used for m corresponding anonymityThe features are replaced, it should be noted that there is a one-to-one correspondence relationship between the anonymous features and the second features, for a sample to be interpreted, there are several second features corresponding to several anonymous features, and each anonymous feature has a unique identifier, and the anonymous features may be denoted as a _ m _ k, where i is used as an index to represent a Host party, and m is used to identify an m-th anonymous feature provided by a k-th Host. In practical applications, there may be multiple Host parties. Samples D { G _1, G _2, G _3, … …, G _ n, H _1, H _2, H _3, … …, H _ m } were changed to samples D' { G _1, G _2, G _3, … …, G _ n, a _1_1, a _2_1, a _3_1, … …, a _ m _1},. Aiming at the n + m characteristics included in the sample D', the n +1 characteristics are combined and calculated according to the formula (2) to obtain 2 corresponding to the sample D^n+mA subset of features.

An evaluation function applied to a tree model is explained, and in actual implementation, the tree model can estimate each feature subset included in a feature set of a sample without depending on any other sample due to the structural particularity of the tree model, and when a certain feature is missing, a predicted value (also referred to as an average value) corresponding to each feature subset is output by the tree model. The implementation principle of the tree model estimation function provided by the embodiment of the application is as follows: .

Start { input a certain feature subset S, starting from the root node of the tree model:

traverse each node j, j-0, 1, 2, … …

Initialization weight w is 1

Executing:

if node j is a non-leaf node, and the feature dj is determined to be in the subset S,

w is passed downwards unchanged, and the result of traversing downwards is returned upwards

If the node j is not a leaf and the decision feature d _ j does not fetch r _ j in the subset S, and fetch r _ (a _ j), r _ (b _ j)

Calculating a new left branch weight w _ l ═ r \ (a _ j)/r _ j, calculating a new right branch weight w _ l ═ r \ (b _ j)/r _ j by downward transfer of the left branch, and returning the sum of the results of the left branch traversal and the right branch traversal upward by downward transfer of the right branch

If the node is a leaf

Upward return weight w multiplied by the value of leaf output value t _ j

And finishing the operation, and finally obtaining a returned value as an estimated value of the subset S after the recursion is finished.

Explaining the implementation algorithm, in the algorithm shown in the figure, S is a feature subset of a sample X to be explained; suppose there are n tree nodes, vectors, in a common tree model

Are all comprised of n elements and are,

the output value of the leaf node is only the leaf node has a value;

as the index vectors of the left and right children of a node, the non-leaf node can pass through

Finding out the corresponding child node by the index information in the table;

a segmentation threshold corresponding to a non-leaf node;

is a characteristic index corresponding to a non-leaf node, the element value corresponding to a leaf node is null,

is the number of samples falling on that node.

Illustratively, referring to fig. 9, fig. 9 is a schematic diagram of a tree model in a non-federal learning scenario provided by an example of the present application, and it is assumed that there are 100 samples, a tree model T, each sample includes 3 features { income, age, height } and a prediction result, the tree model in the diagram includes 7 nodes, {0, 1, 2, 3, 4, 5, 6} represents an index of a node in the tree model, {0.1. 0.2, 0.3, 0.4} are the prediction results corresponding to the tree model, and the value of each vector in the estimation function is:

wherein, the element "-1" in each vector indicates that there is no corresponding value, and "2000" in the figure is the cut (segmentation) threshold of "income" (i.e. if the income of the sample is less than or equal to 2000, the operation is performed along the left branch of the "income" node; if the income of the sample is greater than 2000, the operation is performed along the right branch of the "income" node), "20" is the cut (segmentation) threshold of "age", and "170 cm" is the cut (segmentation) threshold of "height".

Taking the above example as a bearing, in the figure, "income" is taken as a split node, wherein the value corresponding to "income" in 40 samples out of 100 samples corresponds to "income < 2000" (operation along the left branch of the "income" node), and the remaining 60 samples correspond to "income ≧ 2000" (operation along the right branch of the "income" node); that is, 100 samples are divided into two groups, one group containing 40 samples and the other group containing 60 samples, with "revenue 2000". The 'age' is taken as a splitting node, wherein 20 samples meet the condition that the 'age is less than 18' (walking to the left), and 20 samples meet the condition that the 'age is more than or equal to 18'; the height is taken as a segmentation (splitting) point, wherein 15 samples (splitting subsets) meet the condition that the height is less than 170 (walking to the left), and 45 samples meet the condition that the height is more than or equal to 170.

Following the above example, the estimated value (average value) corresponding to the feature subset a { age is 18, income is 900} is determined by the decision tree model T, that is, the value of f ({ age is 18, income is 900}) is calculated, and the missing feature in the feature subset a is "height", and the implementation process is as follows: when the index path is predicted, 0 → 1 → 3, T ({ age ═ 18, income ═ 900}) is 0.1, and the feature subset is missing a feature "height", the predicted value of the feature subset matches the target predicted value of sample X.

As described above, according to the above implementation algorithm, the feature subset B { age is 18 and height is 160} is determined by the decision tree model T, and the corresponding predicted value, that is, the value of f ({ age is 18 and height is 160}) is calculated, and the missing feature in the feature subset B is "income", which is a root node, and thus the feature cannot be predicted by using the index path. For 100 samples, when the segmentation (cut) is performed using the "income" feature, 20 samples are divided into the left branch and 80 samples are divided into the right branch. In the case of missing feature "income" in the feature subset B, reference is made to both the left and right branches corresponding to "income": the left branch is to the feature of "age", because there is feature "age" in the feature subset B, according to "age being 18", go left here in "age", the predicted value is 0.1; the right branch is to the feature of 'height', the feature subset B has 'height', the feature subset B goes to the right according to the 'height equal to 160', and the predicted value is 0.3. Therefore, it can be determined that f ({ age ═ 18, height ═ 160}) -0.1 × (20/100) +0.3 × (80/100) ═ 0.02+0.24 ═ 0.26; here, 20/100 is the sample ratio of missing features "income" (left branch), 0.1 is the predicted value of "age 18", 80/100 is the sample ratio of missing features "income" (right branch), and 0.3 is the predicted value of "height 160".

In actual implementation, in order to determine the predicted value of each feature subset included in the feature set of the sample in the federated tree model according to the estimation function of the tree model, the contribution degree of each feature in the sample to the target prediction result of the federated tree model is determined. The first participant in the federal tree model can locally determine contribution information of each first feature in the sample used as the federal tree model to the sample target prediction result and contribution information of each anonymous feature to the sample target prediction result through each feature subset determined by the anonymous feature used for replacing the anonymous feature of the second feature provided by the second participant, and the contribution information (or contribution degree) of each anonymous feature to the sample target prediction result can be used for reflecting the contribution information of the second feature provided by the second participant to the sample target prediction result because the anonymous feature is used for replacing the second feature provided by the second participant.

In step 105, the predicted values and the target prediction results are combined to determine contribution information of each feature in the feature set corresponding to the target prediction results.

In actual implementation, the first participant device predicts the feature subsets included in the feature set in the sample to be explained through a locally constructed pseudo federal tree model, and obtains the predicted values corresponding to the feature subsets. It should be noted that the feature set here can be regarded as including the first feature and the anonymous feature corresponding to the second feature.

For example, the sample of the input pseudo-federated tree model may correspond to a feature set of { G _1, G _2, G _3, … …, G _ n, a _1_1, a _2_1, … …, a _ m _1}, where { a _1_1, a _2_1, … …, a _ m _1} is the anonymous feature corresponding to each second feature.

In some embodiments, referring to fig. 10, fig. 10 is a schematic diagram of a method for determining a predicted value by using a pseudo-federation tree model provided in this embodiment of the present application, based on fig. 3, in step 105, for each feature subset included in a feature set, the following steps 1051A to 1053A may be respectively performed through the pseudo-federation tree model, so as to determine a predicted value corresponding to each feature subset in the feature set.

Step 1051A, the first participant device traverses the nodes of the pseudo-federated tree model starting from the root node of the pseudo-federated tree model.

For example, referring to fig. 5, assuming that there is a sample X to be interpreted, the sample identification is ID1, { income 1500, age 17, height 165cm }, in actual implementation, assuming that height is a feature provided by the second participant device, the feature value of the "height" feature in the sample X is unknown information, which may use "? "indicates that, in order to determine the predicted value corresponding to the feature subset S { age ═ 18 and income ═ 900}, the first party device starts traversing the nodes of the pseudo-federated model from the root node (i.e., the node corresponding to the feature" income ") shown in fig. 7.

Step 1052A, when the traversed current node is a non-leaf node and the feature subset includes a feature corresponding to the current node, obtaining a node route corresponding to the current node.

Taking the above example, for the feature subset S { age is 18, income is 900} of the sample X, when traversing to the "income" node in the pseudo federal tree model shown in fig. 7, since the node where "income" is located is a non-leaf node, according to the value "900" of the "income" feature, the node route of the "income" node can be determined, which is the left branch operation to the "income" node, where the corresponding child node is "age".

And 1053A, determining a leaf node corresponding to the current node according to the node route, and taking a characteristic value corresponding to the leaf node as a predicted value of the characteristic subset.

In the above example, it can be determined that the leaf node is a node having a value of "0.1" and it can be determined that the predicted value of the feature subset S is 0.1, continuing with the value "18" of the "age" feature in the feature subset S.

In some embodiments, referring to fig. 11, fig. 11 is a schematic diagram of a method for determining a predicted value of a feature subset provided in an embodiment of the present application, and based on fig. 3, in step 105, for each feature subset included in a feature set, through a pseudo federation model, the following steps 1051B to 1055B may also be respectively performed, so as to determine a predicted value corresponding to each feature subset.

Step 1051B, the first participant device traverses the nodes of the pseudo-federated tree model starting from the root node of the pseudo-federated tree model.

Exemplarily, referring to fig. 5, continuing with the sample X to be explained, the sample identification is ID1, { income 1500, age 17, height 165cm }, for example, in order to determine the feature subset Z { age 18, height? The corresponding predicted value, the first participant device starts traversing the nodes of the pseudo-federated model from the root node (i.e., the node corresponding to the feature "revenue") shown in fig. 5.

In step 1052B, when the traversed current node is a non-leaf node and the feature subset does not include the feature corresponding to the current node, a first number of training samples corresponding to the current node is obtained.

In practical implementation, the first Number is a Number of samples falling into a current node, where a parent node of the current node is a split node, and when the current node is a root node, the first Number may be a preset Number of samples, and for convenience of calculation, the Number may be an integer greater than 2, and meanwhile, the Number may also be a total Number of samples to be interpreted when the first participant device performs model interpretation once.

Bearing in mind the above example, assuming that the number of samples to be interpreted for performing model interpretation is 100, the feature subset Z { age-18, height? When traversing to the "income" node in the pseudo federal tree model shown in fig. 5, the node is a non-leaf node because "income" is a missing feature, but the first number of samples corresponding to the "income" node that can be obtained by the root node is "100".

Step 1053B, obtaining a second number of training samples corresponding to each child node of the current node.

In practical implementation, the second number is the number of samples falling into each child node of the node, which is obtained by grouping the training samples falling into the node with the current node as a split node.

Bearing in mind the above example, with reference to FIG. 5, assume that of 100 samples, 40 samples with "income < 1000" and 60 samples with "income ≧ 1000" are known; of the 40 samples, 20 samples with the age of less than 20 were obtained, and 20 samples with the age of more than or equal to 20 were obtained; of the 60 samples, 15 samples having a height of less than 170cm were selected, and 45 samples having a height of 170cm or more were selected.

And 1054B, acquiring the predicted values corresponding to the sub nodes.

Following the above example, see fig. 5, the feature subset Z { age is 18, height is? In the above, the predicted value corresponding to "age ═ 18" is "0.1", and is "height? "corresponding to a prediction value of 0.3 (since the Host side corresponds to the" height "value of" 165cm "for this sample).

And 1055B, determining the predicted value corresponding to the feature subset according to the first quantity, the second quantities and the predicted value corresponding to each sub-node.

In some embodiments, referring to fig. 12, fig. 12 is a schematic diagram of a specific method for determining a predicted value by using a pseudo federal tree model provided in this embodiment, and based on fig. 3, step 1055B may be implemented by steps 201 to 203.

Step 201, the first participant device determines the weight corresponding to each sub-node according to the first quantity and each second quantity, and performs weighting processing on the predicted value corresponding to each sub-node based on the weight corresponding to each sub-node to obtain the predicted value corresponding to the current node.

Step 202, if the current node is the root node, the predicted value corresponding to the current node is used as the predicted value corresponding to the feature subset.

Step 203, if the current node is not a root node, obtaining the weight of each node in the node layer where the current node is located and the predicted value corresponding to each node in the node layer, performing weighting processing on the predicted value corresponding to each node in the node layer to obtain the predicted value corresponding to the father node of the current node, if the father node is a root node, taking the predicted value corresponding to the father node as the predicted value corresponding to the feature subset, if the father node is not a root node, performing the above processing iteratively until the predicted value corresponding to the root node is obtained and taken as the predicted value corresponding to the feature subset.

Following the above example, see fig. 5, the feature subset Z { age is 18, height is? In the above, the predicted value corresponding to "age ═ 18" is "0.1", and is "height? "the corresponding predicted value is 0.3 (since the Host side corresponds to the height of the sample as" 165cm "), and the specific calculation process is as follows: because the feature "age" exists in the feature subset Z, according to "age" 18 ", go to the left here in" age ", the predicted value is 0.1; the right branch is to the "height" feature, there is "height" in the feature subset Z, and height is an anonymous feature, since the route to this anonymous feature is taken, it is determined to go left here at "height", with a predicted value of 0.3. Thus, it can be determined that the feature subset Z { age-18, height? The predicted value of 0.1 × (20/100) +0.3 × (80/100) ═ 0.02+0.24 ═ 0.26; here, 20/100 is a sample ratio of missing feature "income" (left branch), 0.1 is a predicted value of "age 18", 80/100 is a sample ratio of missing feature "income" (right branch), and 0.3 is a predicted value of anonymous feature.

In some embodiments, when the first party determines contribution information of each feature in the feature set corresponding to the target prediction result, the following operations may be performed for each feature in the feature set, respectively: and the first participant adds the feature value of the current feature into the feature subset which does not contain the first participant to obtain a corresponding first predicted value, and determines at least one marginal contribution value of the current feature according to the marginal contribution value calculation formula in the formula (1) on the basis of a second predicted value of the feature subset which does not contain the current feature. When the number of the marginal contribution values is one, the marginal contribution values serve as contribution information of the current features to the target prediction result; and when the number of the marginal contribution values is at least two, taking the accumulated result of the marginal contribution values as the contribution information of the current feature to the target prediction result.

In some embodiments, referring to fig. 13, fig. 13 is a flowchart of an interaction contribution information determining method provided in an embodiment of the present application, and based on fig. 3, after the first participant device performs step 105, the first participant device may perform the steps illustrated in fig. 13, and determine interaction contribution information for a target prediction result when any at least two features interact.

In step 106, the first participant device selects at least two features from the at least two first features provided locally and the at least one second feature provided by the second participant device.

In practical implementation, the mutual contribution information of the at least two features may be arbitrarily determined, and the at least two features may be any at least two features of the at least two first features and the at least two second features.

In step 107, a feature interaction group including at least two features is constructed, and at least one interaction marginal contribution value corresponding to the feature interaction group is determined.

In practical implementation, at least two features determined in step 106 are determined as a feature interaction group, and then all interaction margin contribution values of the feature interaction group are determined.

The explanation is made with respect to determining an interaction margin contribution value of a feature interaction group including two features. In some embodiments, a feature interaction group < i, j >, that is, the feature interaction group includes two features, i and j, and interaction margin contribution information of the feature interaction group to a target prediction result corresponding to a sample is determined according to the following principle:

δ_i，j＝f_x(S∪{i，j})-f_x(S∪{i})-f_x(S∪{j})+f_x(S)

in equation (3) above, S represents any subset that does not contain i, j, S.C { i, j } represents a subset of features that contain feature i, j, S.C { i } represents a subset of features that contain feature i, S.C { j } represents a subset of features that contain feature j, f_xAn estimation function corresponding to the Federal Tree model for determining a predicted value, δ, of each feature subset_i，jIs the interaction contribution margin of the corresponding feature interaction group, and the calculation mode can refer to the mode of calculating the contribution margin in the formula (1). In step 108, interactive contribution information of the feature interaction group corresponding to the target prediction result is determined based on the at least one interactive marginal contribution value.

In practical implementation, when the number of the interaction marginal contribution values is one, the interaction marginal contribution values are directly used as the interaction contribution information of the feature interaction group corresponding to the target predicted value.

In some embodiments, referring to fig. 14, fig. 14 is a flowchart of an interaction contribution information determining method provided in the embodiments of the present application, and based on step 108 shown in fig. 13 and fig. 14, the method can be implemented by step 1081 to step 1083.

At step 1081, the first participant device determines a first subset of features of the set of features, wherein the first subset of features includes at least one of the at least two features.

In practical implementation, when determining the interaction contribution information of the feature interaction group according to the above formula (3), the interaction margin contribution, that is, δ, of the feature interaction group needs to be determined first_i,jAccording to delta_i,jThe first participant device needs to obtain a first subset of features of the set of features.

Describing a first feature subset, taking the feature interaction group as < i, j > as an example based on formula (3), where the first feature subset is a feature subset including < i, j >, a feature subset including only i, and a feature subset including only j; taking the feature interaction group as < i, j, v >, the first feature subset is a feature subset including < i, j, v >, a feature subset including only i, j, a feature subset including only i, v, a feature subset including only j, v, a feature subset including only i, a feature subset including only j, a feature subset including only v, and so on, it can be seen that the greater the number of features in the feature interaction group, the higher the computational power requirement on the first participant device.

Step 1082, the first participant device determines a second subset of features of the set of features, wherein the second subset of features is in a complementary relationship with the interactive set of features.

As can be seen from equation (3), when calculating the interaction margin contribution value, the first participant device needs to obtain a feature subset (which may be referred to as a second feature subset) that does not include each feature in the feature interaction group. Taking the feature interaction group as < i, j > for example, the second feature subset is a feature subset not including < i, j >; taking the feature interaction group as < i, j, v >, the second feature subset is the feature subset that does not include < i, j, v >.

Step 1083, obtaining a predicted value corresponding to the first feature subset and a predicted value corresponding to the second feature subset, and determining an interaction marginal contribution value corresponding to the feature interaction group based on the predicted values corresponding to the first feature subset and the predicted values corresponding to the second feature subset.

In actual implementation, the predicted value corresponding to the first feature subset and the predicted value corresponding to the second feature subset are determined through the pseudo-federation tree model, and the interaction marginal contribution value corresponding to the feature interaction group is determined according to the predicted value corresponding to the first feature subset and the predicted value corresponding to the second feature subset.

Receiving the above example, determining the interaction marginal contribution value of the feature interaction group < i, j > through a pseudo tree federation model f_xDetermining a predicted value f corresponding to the first feature subset S U { i, j }, S U { i }, S U { j }, and S U { j }, wherein the predicted value f corresponds to the first feature subset S U { i, j }, S U { j }, and S U { j } is a predicted value_x(S∪{i，j})、f_x(S∪{i})、f_x(S { j }), and then according to the formula (3), determining an interaction marginal contribution value delta corresponding to the feature interaction group_i，j。

In some embodiments, when the number of the interaction marginal contribution values is at least two, the interaction contribution information corresponding to the feature interaction group may be further determined by: and the first participant equipment sums the plurality of interactive marginal contribution values to obtain interactive contribution information of the characteristic interactive group corresponding to the target prediction result. Bearing the above example, based on formula (3), when the number of interaction marginal contribution information for the feature interaction group < i, j > is at least two, the accumulated result of the at least two interaction marginal contribution information may be used as the interaction contribution information of the feature interaction group to the target prediction result.

In the embodiment of the application, a first participant can obtain corresponding anonymous features of samples to be explained and node routes corresponding to the anonymous features from second participants by sending anonymous feature obtaining requests, and a pseudo tree model corresponding to a federated tree model is locally constructed on the first participant, so that the prediction of all feature subsets can be finished locally and parallelly without complex federated learning prediction processes, the communication traffic is greatly reduced, a tree algorithm is used for explaining a plurality of samples and measuring contribution functions in a federated scene, the explanation of a single sample is provided, and the defects in the federated scene are supplemented; meanwhile, contribution information of each feature of the Host party can be measured, in addition, explanation of each feature contribution in a sample to be explained can be provided under the condition of protecting data privacy of each participant, and a function of feature interaction analysis of modeling personnel can be provided.

Next, a data processing method of the federal tree model provided in the embodiment of the present application will be described in conjunction with an exemplary application and implementation of the second party device provided in the embodiment of the present application. Referring to fig. 15, fig. 15 is a schematic flowchart of a data processing method of the federal tree model provided in the embodiment of the present application, and will be described with reference to the steps shown in fig. 15.

In step 301, for each second feature used for training the federate tree model, the second participant device generates an anonymous feature corresponding to each second feature, and acquires a second node route corresponding to each anonymous feature.

It should be noted that the second node route is used to indicate a child node path corresponding to a split node when the anonymous feature is used as the split node of the federal tree model.

In step 302, the second participant device sends anonymous features corresponding to each of the second features, and second nodes corresponding to each of the anonymous features are routed to the first participant device.

It should be noted that the second node route is used for acquiring, by the first participant device, a pseudo-federal tree model corresponding to the federal tree model based on the second node route, and determining, by the pseudo-federal tree model, contribution information of the target prediction result corresponding to each feature in the feature set. The feature set is used as a training sample of the federated tree model, and comprises the following steps: the first participant device provides at least two first characteristics carrying a target prediction result and at least one second characteristic.

In some embodiments, referring to fig. 16, fig. 16 is a schematic diagram of an anonymous feature sending method provided in an embodiment of the present application, and based on fig. 16, step 302 may be implemented by steps 3021 to 3024.

Step 3021, the second party device receives an anonymous feature obtaining request carrying the sample identifier sent by the first party device.

And step 3022, analyzing the anonymous characteristic obtaining request to obtain a sample identifier.

Step 3023, determining a second feature corresponding to the sample identifier, an anonymous feature corresponding to the second feature, and a second node route corresponding to the anonymous feature.

Step 3024, sending the anonymous feature and the second node route to the first participant device.

In some embodiments, the second participant device may determine the second node route corresponding to the sample identification, the anonymous feature corresponding to the second feature, and the anonymous feature by: and the first participant device searches the local anonymous relation record table according to the sample identifier to obtain a second feature corresponding to the sample identifier, an anonymous feature corresponding to the second feature and a second route corresponding to the anonymous feature. It should be noted that the anonymous relation record table is used for recording a sample identifier of a training sample used for training the federated tree model, a second feature corresponding to the sample identifier, an anonymous feature corresponding to the second feature, and a second node route corresponding to the anonymous feature.

In practical implementation, the second participant device may store the second feature in each sample to be interpreted, the anonymous feature corresponding to the second feature, the routing relationship (second node route) corresponding to each anonymous feature, and the sample identifier of the sample to be interpreted in the local anonymous relationship record table. The anonymous relation record table may be a two-dimensional relation table stored in a database, or may be a file (e.g., json-formatted file) stored locally.

In some embodiments, the second participant device may generate the anonymous characteristic to which the second feature corresponds by: and the second participant equipment respectively carries out hash processing on each second feature used for training the federated tree model to obtain a hash value corresponding to each second feature, and the hash value is used as an anonymous feature corresponding to the second feature.

In practical implementation, the second participant device may set an anonymous feature of the second feature locally provided in each sample to be interpreted, in a format agreed among the participants participating in the federal tree model training; the second participant may further generate anonymous features corresponding to the second features by setting non-repeating random numbers, and generate anonymous features corresponding to each second feature, which may be denoted as a _ j _0, a _ j _1, and a _ j _ m, where a _ j _ m represents the mth feature of the jth host. The second participant can also calculate a hash value corresponding to the second feature through a common hash algorithm, and then the obtained hash value is recorded as an anonymous feature corresponding to the second feature.

In the embodiment of the application, the second participant establishes the anonymous relation record table locally and stores the second characteristics, the anonymous characteristics corresponding to the second characteristics and the second node routes corresponding to the anonymous characteristics, so that the anonymous characteristics corresponding to the sample identifier and the second node routes corresponding to the anonymous characteristics can be directly searched from the anonymous relation record table when the anonymous characteristic acquisition request is received, the query efficiency is improved, and the communication times with other participant devices are reduced.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

The characteristic importance scheme provided by the related longitudinal federated tree model can meet part of requirements, and a longitudinal federated tree scene (two parties, i.e., a guest party and a host party) participated by two parties is taken as an example, and the scheme of the characteristic importance provided by the longitudinal federated tree model is briefly described as follows:

(1) a guest party initializes a table locally, the table comprises all local features of the guest and anonymous feature numbers sent by host, and the count value of each feature is 0;

(2) starting to establish a decision tree, and adding 1 to a corresponding count value in a table or adding a splitting gain value (gain) to the splitting characteristic used by each decision tree node;

(3) and outputting a characteristic importance table after the decision tree is established. The feature importance table can be used to interpret the model: the feature that the count value is high represents that it plays a high role in the modeling process.

However, the feature importance scheme provided by the vertical federal tree model can satisfy the requirement (1), but cannot satisfy the requirements (2) and (3).

First, for requirement (2), the user cannot use the feature importance to specifically interpret a single sample, for example, there is a sample S { age: 28, love and marriage status: married, income: 10000, study calendar: this department, work: IT, native place: … …, etc., is a default customer. The sample is predicted through a longitudinal federal learning model, the obtained score is 0.1, business users hope to know how much each specific characteristic value (such as age being 28) in the sample S contributes to the final 0.1 prediction score, and whether the influence of the characteristic on the prediction score is positive or negative can be judged, so that business elicitations can be obtained by combining the model and the experience of real life.

The influence of a specific characteristic value on the output result of the model and the positive and negative of the influence cannot be judged only according to the obtained characteristic importance. Because feature importance reflects the use of only one global feature, there is no way to analyze a particular sample.

For requirement (3) above, although the feature importance can know how many times the feature of the partner is used, the partner feature is also agnostic to the positive and negative effects of a single sample. If the partner's characteristics provide a large impact for many samples, it can be used to measure the value of the partner's characteristics.

Since in the federal learning scenario, the computation of the xiapril value (SHAP value) requires the enumeration of feature subsets, then under federal conditions, predictions will yield very frequent communications. Based on this, the embodiment of the application provides a data processing method of a federal tree model, and in a longitudinal federal learning scene, a method for explaining a prediction result of the federal tree model and measuring the total value of characteristics of a partner is combined with a SHAP value, so that a plurality of samples can be explained in batches only through one-time communication.

From the description of the knowledge related to the prior value in step 104, it can be determined that, when the machine learning model is interpreted based on the prior value, the information to be determined at least includes: the method comprises the steps of enabling a sample to be explained to correspond to a predicted value of a machine learning model, enabling a plurality of feature subsets included in a feature set of the sample to be explained, and determining an estimation function of an estimation value of the machine learning model corresponding to each feature subset (namely, when a certain feature is absent in the feature subsets, determining an average value based on the machine learning model).

In practical implementation, each feature subset included in the sample feature set to be interpreted may be obtained according to the foregoing formula (2).

For example, assuming that the sample to be interpreted x ═ age-20, height-170, income-100, enumerate all feature subsets: the term "age" is 20, the term "height" is 170, the term "income" is 100, the term "age" is 20, the term "height" is 170, the term "age" is 20, the term "income" is 100, the term "height" is 170, the term "income" is 100, the term "age" is 20, the term "height" is 170, and the term "height" is 170.

Continuing to describe the estimation function f _ x, in actual implementation, when evaluating the SHAP value of each feature value in a sample, the corresponding estimation function f _ x may be determined according to the actual situation of the machine learning model, and f _ x is used to reflect the predicted value obtained by the machine learning model of the feature subset of the corresponding sample.

In the following, a detailed description will be given, taking as an example an estimation function corresponding to a tree model when the machine learning model is the tree model in a non-federal learning scenario. Due to the structural particularity of the tree model, each feature subset included in the feature set of the sample can be estimated without depending on any other sample, and when a certain feature is lacked, the average value of the output of the tree model is obtained. Referring to fig. 17, fig. 17 is a code segment diagram of a tree model estimation function provided in an embodiment of the present application. In the algorithm shown in the figure, S is a feature subset of a sample x to be interpreted; assuming a total of n tree nodes, the vector

Are all comprised of n elements and are,

the output value of the leaf node is only the leaf node has a value;

Finding out the corresponding child node by the index information in the table;

a segmentation threshold corresponding to a non-leaf node;

the number of samples that fall on the node in the training. Based on this, when the estimation tree algorithm lacks an average output value of a certain feature, the following process is provided according to the recursive algorithm:

inputting a subset S, starting from the root node:

start { traversal of each node j, j-0, 1, 2, … … }

Initialization weight w is 1

Executing:

If the node j is not a leaf and the feature d _ j is determined not to be in the subset S

Fetch r _ j, and fetch r _ (a _ j), r _ (b _ j)

Calculating the new weight w _ l ═ r _ (a _ j)/r _ j of the left branch, and transmitting the weight from the left branch to the lower part

Calculating new weight w _ l ═ r _ (b _ j)/r _ j of right branch, and transferring it from right branch to bottom

Returning the sum of the results of the left branch traversal and the right branch traversal upwards

If the node is a leaf

Upward return weight w multiplied by the value of leaf output value t _ j

Exemplarily, referring to fig. 5, it is assumed that there are 100 samples (samples to be explained), eachThe strip sample comprises 3 characteristics { income, age, height } and predicted values, the tree model in the graph comprises 7 nodes, {0, 1, 2, 3, 4, 5, 6} represents the node index in the tree model, {0.1, 0.2, 0.3, 0.4} represents the predicted value corresponding to the tree model, and the value of each vector in the estimation function is as follows:

wherein, the element "-1" in each vector indicates that there is no corresponding value, and "2000" in the figure is the segmentation threshold for "income", 20 "is the segmentation threshold for" age ", and" 170cm "is the segmentation threshold for" height ".

Taking the above example as a bearing example, in the figure, "income" is taken as a split node, wherein, the value corresponding to the characteristic income in 40 samples in 100 samples accords with "income < 2000" (go to the left), and the rest 60 samples accord with "income ≧ 2000"; that is, 100 samples are divided into two groups, one group containing 40 samples and the other group containing 60 samples, with "revenue 2000". The 'age' is taken as a splitting node, wherein 20 samples meet the condition that the 'age is less than 18' (walking to the left), and 20 samples meet the condition that the 'age is more than or equal to 18'; the height is taken as a segmentation (splitting) point, wherein 15 samples (splitting subsets) meet the condition that the height is less than 170 (walking to the left), and 45 samples meet the condition that the height is more than or equal to 170.

Taking the sample X to be interpreted as an example, where the node index path of the sample X to be interpreted in the decision tree model f is 0 → 1 → 3, that is, f ({ revenue is 1500, age is 17, height is 165cm }) is 0.1;

taking the above example as an example, enumerating all feature subsets corresponding to the feature set in the sample X to be interpreted { income 1500, age 17, height 165cm }: the term "age" is used here to mean 18, height 160, income 900, { age 18, height 160, { age 18, income 900, { height 160, income 900, { age 18, income 165, { age 165, { } (this is an empty set).

Following the above example, by using the decision tree model f, a predicted value corresponding to the feature subset a { age ═ 18, income ═ 900} is determined, that is, a value of f ({ age ═ 18, income ═ 900}) is calculated, and the missing feature in the feature subset a is "height", which is implemented as follows: the predicted index path, 0 → 1 → 3, is f ({ age ═ 18, income ═ 900}) 0.1, and is not affected by the missing feature "height";

as an example, by using the decision tree model f, the feature subset B { age is 18 and height is 160} is determined, and the corresponding estimated value (average value), i.e. the value of f ({ age is 18 and height is 160}) is calculated, the missing feature in the feature subset B is "income", because the "income" feature is the root node, the index path cannot be used continuously to predict the value, in this case, the sample proportion of the training sample of each node can be used to predict,

bearing the above example, when the "income" feature is used for the division (splitting) of 100 samples, 20 samples are divided into the left branch and 80 samples are divided into the right branch. In the case of missing feature "income" in the feature subset B, reference is made to both the left and right branches corresponding to "income": the left branch is to the feature of "age", because there is feature "age" in the feature subset B, according to "age being 18", go left here in "age", the predicted value is 0.1; the right branch is to the feature of 'height', the feature subset B has 'height', the feature subset B goes to the right according to the 'height equal to 160', and the predicted value is 0.3. Therefore, it can be determined that f ({ age ═ 18, height ═ 160}) -0.1 × (20/100) +0.3 × (80/100) ═ 0.02+0.24 ═ 0.26; here, 20/100 is the sample ratio of missing features "income" (left branch), 0.1 is the predicted value of "age 18", 80/100 is the sample ratio of missing features "income" (right branch), and 0.3 is the predicted value of "height 160".

In the above example, the predicted value corresponding to the feature subset C { height ═ 160} is determined by the decision tree model f, that is, the value of f ({ height ═ 160}) is calculated, the missing features in the feature subset C are "income" and "age", and the value of f ({ height ═ 160}) is estimated, and the implementation method is as follows: the feature subset C is in the node of "age" (missing), and both the left and right branches corresponding to the node of "age" are considered, that is, 0.1 × (20/40) +0.2 × (20/40) ═ 0.15; i.e., the age branch, the value returned up ("income" node) is 0.15; at the node of 'height', the normal value is 0.3, namely the right branch of 'height', and the value returned upwards (the 'income' node) is 0.3; at the "income" node, both left and right branches are considered, i.e. 0.15 (20/100) +0.3 (80/100) ═ 0.03+0.24 ═ 0.27; therefore, f ({ height ═ 160}) is 0.27.

In order to apply the estimation function of the tree model to a federated tree model scene, the features contained in the samples corresponding to the federated tree model may be adjusted, a pseudo tree model is constructed through the adjusted features, and the estimation value of each feature subset is determined by calling the estimation function applied to the tree model, so that the contribution of each feature to the prediction result of the sample is determined.

Next, referring to fig. 18, fig. 18 is a flowchart of a data processing method of the federal tree model provided in the embodiment of the present application, and the data processing method of the federal tree model provided in the embodiment of the present application is explained with reference to fig. 18.

In step 401, the first party counts the number of first features in the sample to be interpreted and the number of second parties participating in modeling of the federated tree model.

In practical implementation, for convenience of distinction, the feature provided by the Guest party may be referred to as a first feature, and the feature provided by the Host party may be referred to as a second feature. Suppose there are k (k is more than or equal to 1 and k is an integer) Host parties involved in the modeling of the Federal tree model. The Guest party treats all the features of the Host party (H _0 … H _ m) as anonymous features (anonymous features), i.e. the Guest party does not know the specific names of the features of the Host party. Anonymous (short for a) can be used to represent anonymous characteristics corresponding to the characteristics of the Host party, such as a _0_0, a _0_1, a _0_2 to represent the first characteristic of Host _0, the first characteristic of Host _1, and the first characteristic of Host _2, respectively. The Guest side has n + m _0+ m _1+. + m _ k features, G _0, G _1 … G _ n, a _0_0, a _0_1 … a _ k _ (m _ k).

In step 402, the first party sends an anonymous feature acquisition request to the second party, so that the second party returns anonymous features corresponding to each sample to be explained and routing information corresponding to the anonymous features according to the anonymous feature acquisition request.

In actual implementation, the first participant sends an anonymous feature acquisition request to each second participant, and receives anonymous features returned by the second participant and routing information corresponding to the anonymous features.

In actual implementation, after each second party receives the anonymous feature obtaining request sent by the first party, the second party may obtain the routing information corresponding to each second feature in the sample to be interpreted, and may be referred to as: x0_ route ═ result _0, result _1, … …, result _ k-1], X1_ route ═ result _0, result _1, … …, result _ k-1, and X0_ route represent route information returned by each Host party in sample X0. The specific implementation process of the second party for obtaining the routing information is as follows:

beginning

Features of input samples X0, X1 … Xn on the Host side

For sample X_i in[X0，X1…Xn]

Initializing a routing result record (result { }), wherein the record table is empty

Initializing anonymous feature relation record table anonymous { }, which is empty

Executing: traversing each tree in the tree model, the current tree being numbered t

Initializing result [ t ] { }

Traversing each node in the tree, wherein the current node is numbered n

It is determined whether the node belongs to the current host,

if the node belongs to the current host,

then it is continuously determined whether the sample X _ i is going to the left or the right at the node

If go left, record result [ t ] [ n ] ═ left

Otherwise, record result [ t ] [ n ] ═ right

Recording the anonymous of the node corresponding to the splitting feature:

anonymous[t][n]＝anonymous_name

and finishing the process, finishing the route extraction, and sending a result to a guest party and sending an anonymous feature relation record table anonymous by each host.

For example, the anonymous [ t ] [ n ] ═ anonymous _ name, the node uses the signature H _0, corresponding to the anonymous name a _ j _0, and anonymous [ t ] [ n ] ═ a _ j _0.

In step 403, the first participant locally constructs a pseudo tree model corresponding to the sample according to the first features, the anonymous features provided by the second participants, and the routing information corresponding to the anonymous features included in the sample.

In practical implementation, for a single training sample, the Guest known information includes n first features, k anonymous features, and routing information corresponding to each anonymous feature. The Guest party can construct a pseudo tree model through the known information of the samples, and the pseudo tree model is used for determining the estimation value of each feature subset included in the feature set of the samples.

In practical implementation, under a longitudinal federal tree model, each participant has a complete tree model structure, but information (characteristics and split values) in tree nodes only exists in all the parties, so that the node content of the Host party is empty in the Guest party, and the Guest party only has information of nodes and leaf nodes belonging to the Guest party. In order to use the estimation function of the tree model, Guest may construct a pseudo tree model for the sample by using the routing information provided by Host, so that the estimation function of the tree model may smoothly operate.

In practical implementation, assuming that an input sample X _0, a gust original tree model, a host side route X0_ route [ result _0, result _1, … …, result _ k ], and an anonymous relation record table, first, the sample X _0 only has a gust feature, which is an X _0 extension feature, where m _0+ m _1+. + m _ k anonymous features exist, and X _0 adds an anonymous feature a _0_0.. a _ k _ (m _ k) according to an anonymous rule, and feature values of the anonymous features are all 1.

The pseudo code for the first participant to implement the pseudo tree model is as follows:

start { for each tree of Guest,

executing: traverse each node

If the node belongs to the guest, skip

If the node belongs to a host _ m _ k

Then t for that node is replaced with a _ k _ (m _ k)

Inquiring the routing direction corresponding to a _ k _ (m _ k) from the routing table result _ i if the routing direction is left

The separation value in d can be replaced by 1.5

If the routing direction is right,

the separation value in d can be replaced by 0.5

And finishing the construction of the pseudo tree model by the first participant according to the implementation process, and returning an expanded sample X _0 and the pseudo tree model, wherein the expanded sample can be understood as that the characteristics of the sample comprise the first characteristics and the anonymous characteristics.

Illustratively, fig. 19 is a schematic diagram of a pseudo-tree model construction method provided in an embodiment of the present application, and referring to fig. 19, for a sample H { Guest: age: 10Host1: height: 160Host2: weight: 60}, where age is a feature provided by a Guest party, and height and weight are features provided by a Host party, to be explained. In the figure, the "height" feature provided by Host1 is replaced by a _0_0, the "weight" feature provided by Host2 is replaced by a _0_1, the value of the anonymous feature corresponding to each Host party is set to "1", for example, anonymous feature a _0_0 corresponding to Host1 is set to 1, meanwhile, the routing information that the Guest party receives the anonymous feature a _0_0 from the first Host party is set to result _0[1] {0: left }, that is, the anonymous feature a _0_0 node should go left, in order to be able to go left at the anonymous feature a _0_0 node when traversing the tree model (assuming that the condition of going left is that the value of anonymous feature a _0_0 is smaller than the segmentation value at anonymous feature a _0_0 node), the segmentation value at anonymous feature a _0_0 may be set to 1.5 (as long as the value is larger than 1). Similarly, since the anonymous feature a _0_1 is set to 1, and at the same time, the Guest side receives routing information corresponding to the anonymous feature a _0_1 from the second Host side as result _1[1] ═ {1: right }, that is, the anonymous feature a _0_1 node should go right, in order to go right at the anonymous feature a _0_1 ═ 1 (assuming that the condition of going right is set, the value of the anonymous feature a _0_1 is greater than the partition value at the anonymous feature a _0_1 node), the partition value at the anonymous feature a _0_0 may be set to 0.5, so that when prediction is performed in the pseudo-tree model using the anonymous feature, federal prediction behavior may be simulated.

In step 404, the first participant combines the first features and the anonymous features included in the sample to obtain a plurality of feature subsets, and determines the predicted values corresponding to the feature subsets through the pseudo tree model corresponding to the sample.

In actual implementation, in each sample X0, X1 … Xn to be explained, the features provided by the Guest party and the anonymous features corresponding to each Host party form extended samples X0', X1', … …, Xn ' corresponding to the sample to be explained, all feature subsets of the extended samples are obtained, a pseudo tree model corresponding to the sample is created through step 403, and a predicted value corresponding to each feature subset is determined. The implementation process of obtaining the predicted value of the feature subset through the pseudo tree model is as follows: receiving characteristics provided by a Guest party of a single sample to be explained, a pseudo tree model corresponding to the sample, and all enumerated characteristic subsets as input, wherein for each characteristic subset: for each tree in the pseudo tree model, inputting the characteristics provided by the Guest party, the corresponding tree and the characteristic subset S into a tree model estimation function; the pseudo tree contains the vectors required to run the tree model estimation function

And summing the results of each operation to obtain the scores of the feature subset.

In step 405, the first participant determines marginal contribution values of the features in the feature subset that does not include the first participant, adds the marginal contribution values corresponding to the features to obtain a total contribution value of the features, and uses the total contribution value as contribution information of the features to the sample prediction result.

In practical implementation, for the sample to be interpreted, after scoring each feature subset (i.e. obtaining the estimation value corresponding to each feature subset) through step 404, according to the foregoing formula (1):

determining contribution information of each feature to the prediction result of the sample. When the characteristics are provided by a Guest party, the obtained contribution information is used for representing the contribution information of the characteristics to the sample prediction result; and when the characteristics are anonymous characteristics corresponding to the Host party, the obtained contribution information is used for representing the contribution information of each anonymous characteristic to the sample prediction result.

In practical implementation, the Guest obtains, through step 405, the SHAP value of the feature provided by each Guest party of each sample, and the SHAP value corresponding to each anonymous feature of the Host party. The Host party does not obtain any results.

In step 406, the first participant combines the first feature and each anonymous feature in the sample to be interpreted to obtain a feature interaction group, determines at least one interaction marginal contribution value corresponding to the feature interaction group, and adds the interaction marginal contribution values to obtain an interaction contribution corresponding to the feature interaction group.

In practical implementation, after the first participant calculates the contribution information of each first feature in the sample to be interpreted and the contribution information of each anonymous feature through the above steps 401 to 405, the first participant may follow the foregoing formula (3):

δ_i，j＝f_x(S∪{i，j})-f_x(S∪{i})-f_x(S∪{j})+f_x(S)

an interaction contribution value between any two features in the sample to be interpreted is determined. Where S denotes any subset containing no i, j, S.U { i, j } denotes a subset of features containing features i, j, S.U { i } denotes a subset of features containing features i, ji feature subset, S { U } denotes the feature subset containing feature j, f_xAnd the estimation function is an estimation function corresponding to the federated tree model and is used for determining the predicted value of each feature subset.

In the embodiment of the application, a Guest party acquires routing information of all samples to be explained from a Host party through one-time communication, a pseudo tree model corresponding to a federated tree model is locally constructed on the Guest party, prediction of all feature subsets can be finished locally in parallel without a complex federated learning prediction process, communication traffic is greatly reduced, a tree algorithm is used for explaining a plurality of samples and measuring contribution functions under a federated scene, explanation of a single sample is provided, and defects under the federated scene are supplemented; meanwhile, contribution information of each feature of the Host party can be measured, in addition, the explanation of each feature contribution in the sample to be explained can be provided under the condition of protecting data privacy of each participant, and the function of feature interaction analysis of modeling personnel can be provided.

Continuing with the exemplary structure of the data processing apparatus 555 of the federal tree model provided in this embodiment implemented as a software module, in some embodiments, as shown in fig. 2A, fig. 2A is a schematic structural diagram of the first participant device 400 provided in this embodiment, and the software module stored in the data processing apparatus 555 of the federal tree model in the memory 540 may include:

an obtaining module 5551, configured to obtain a first node route of a target node in the federated tree model, where the target node corresponds to at least two first features provided by the first participant device;

a receiving module 5552, configured to receive at least one anonymous feature sent by the second participant device, and a second node route corresponding to each anonymous feature; the anonymous feature corresponds to a second feature used for training the federated tree model, and the second node route is used for indicating a sub-node path corresponding to a split node when the anonymous feature is used as the split node of the federated tree model;

a simulation module 5553, configured to simulate the federal tree model based on the first node route and the second node route, to obtain a pseudo-federal tree model corresponding to the federal tree model, and predict, through the pseudo-federal tree model, a feature subset included in a feature set of a training sample used as the federal tree model, to obtain a corresponding predicted value; wherein the feature set comprises: the at least two first characteristics carrying a target prediction result and the second characteristics provided by at least one second participant device;

a determining module 5554, configured to determine, by combining the predicted value and the target prediction result, contribution information of each feature in the feature set corresponding to the target prediction result.

In some embodiments, the receiving module is further configured to send an anonymous feature obtaining request carrying a sample identifier to the second participant device; the anonymous feature obtaining request is used for responding to the anonymous feature obtaining request by the second participant device, and determining the second feature corresponding to the sample identifier, the anonymous feature corresponding to the second feature and a second node route corresponding to the anonymous feature; and receiving an anonymous feature corresponding to the sample identification and a second node route corresponding to the anonymous feature, which are returned by the second participant device.

In some embodiments, the receiving module is further configured to create an anonymous relation record table locally, where the anonymous relation record table is used to record a sample identifier of a training sample used for training the federated tree model, an anonymous feature of the second participant device corresponding to the sample identifier, and a second node route corresponding to the anonymous feature; and storing the received anonymous characteristics corresponding to the sample identification and the second node route corresponding to the anonymous characteristics in the anonymous relation record table.

In some embodiments, the determining module is further configured to select at least two features from at least two first features provided by the first participant device and at least one second feature provided by the second participant device; constructing a feature interaction group comprising the at least two features, and determining at least one interaction marginal contribution value corresponding to the feature interaction group; and determining the interaction contribution information of the feature interaction group corresponding to the target prediction result based on the at least one interaction marginal contribution value.

In some embodiments, the determining module is further configured to determine a first feature subset of the set of features, the first feature subset including at least one of the at least two features; determining a second feature subset of the feature set, the second feature subset being in a complementary relationship with the feature interaction group; and obtaining a predicted value corresponding to the first feature subset and a predicted value corresponding to the second feature subset, and determining an interaction marginal contribution value corresponding to the feature interaction group based on the predicted value corresponding to the first feature subset and the predicted value corresponding to the second feature subset.

In some embodiments, the determining module is further configured to, when the number of the interaction marginal contribution values is multiple, sum the multiple interaction marginal contribution values to obtain the interaction contribution information of the feature interaction group corresponding to the target prediction result.

In some embodiments, as shown in fig. 2B, fig. 2B is a schematic structural diagram of the second participant device 410 provided in this embodiment, and the software modules stored in the data processing apparatus 555 of the federal tree model in the memory 540 may include:

a generating module 5555, configured to generate, for each second feature used for training the federated tree model, an anonymous feature corresponding to each second feature, and obtain a second node route corresponding to each anonymous feature; the second node route is used for indicating a sub-node path corresponding to a split node when the anonymous feature is used as the split node of the federated tree model;

a sending module 5556, configured to send anonymous features corresponding to the second features, and second node routes corresponding to the anonymous features to the first participant device; the second node route is used for acquiring a pseudo-federation tree model corresponding to the federation tree model by the first participant equipment based on the second node route, and determining contribution information of target prediction results corresponding to each feature in the feature set through the pseudo-federation tree model; wherein the feature set is used as a training sample of the federated tree model, including: the first participant device provides at least two first features carrying a target prediction result and at least one second feature.

In some embodiments, the sending module is further configured to receive an anonymous feature obtaining request carrying a sample identifier sent by the first party device; analyzing the anonymous characteristic acquisition request to obtain the sample identifier; determining the second feature corresponding to the sample identifier, an anonymous feature corresponding to the second feature, and a second node route corresponding to the anonymous feature; sending the anonymous feature and the second node route to the first participant device.

In some embodiments, the sending module is further configured to search a local anonymous relationship record table according to the sample identifier, so as to obtain a second feature corresponding to the sample identifier, an anonymous feature corresponding to the second feature, and a second route corresponding to the anonymous feature; the anonymous relation recording table is used for recording a sample identifier of a training sample used for training the federated tree model, a second feature corresponding to the sample identifier, an anonymous feature corresponding to the second feature, and a second node route corresponding to the anonymous feature.

In some embodiments, the generating module is further configured to perform hash processing on each second feature used for training the federated tree model to obtain a hash value corresponding to each second feature, and use the hash value as an anonymous feature corresponding to the second feature.

It should be noted that the description of the apparatus in the embodiment of the present application is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated.

The embodiment of the present application provides a computer program product, which includes a computer program, and is characterized in that when being executed by a processor, the computer program implements the data processing method of the federal tree model provided in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, will cause the processor to perform a method provided by embodiments of the present application, for example, a data processing method of the federate tree model as shown in fig. 3.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiment of the application, the predicted value corresponding to each feature subset of the sample can be determined through the pseudo federal tree model constructed by the first participant, so that the contribution information of each feature to the target prediction result is determined. Meanwhile, the first participant can obtain the node routes corresponding to the anonymous characteristics of all samples to be explained from each Host party through a route obtaining request, a pseudo tree model corresponding to the federal tree model is locally constructed on the first participant, prediction of all characteristic subsets can be finished locally in parallel without a complex federal learning prediction process, communication traffic is greatly reduced, the tree algorithm is used for explaining a plurality of samples and measuring contribution functions under a federal scene, single-sample explanation is provided, and the loss under the federal scene is supplemented; meanwhile, the feature contribution information of each Host party can be measured, and in addition, the global feature importance can be measured.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A data processing method of a federated tree model is characterized in that the federated learning system comprises a first participant device and at least one second participant device and is based on a federated learning system, the method is applied to the first participant device, and the method comprises the following steps:

2. The method of claim 1, wherein receiving at least one anonymous feature sent by the second participant device and a second node route corresponding to each anonymous feature comprises:

sending an anonymous feature acquisition request carrying a sample identifier to the second participant device;

3. The method of claim 2, further comprising:

creating an anonymous relation record table locally, wherein the anonymous relation record table is used for recording a sample identifier of a training sample used for training the federated tree model, an anonymous feature of the second participant device corresponding to the sample identifier, and a second node route corresponding to the anonymous feature;

4. The method of claim 1, wherein after determining contribution information of each feature in the set of features to the target predictor, the method further comprises:

selecting at least two features from at least two first features provided by the first participant device and at least one second feature provided by the second participant device;

5. The method of claim 4, wherein the determining the interaction contribution margin value corresponding to the set of feature interactions comprises:

determining a first feature subset of the set of features, the first feature subset comprising at least one of the at least two features;

6. The method of claim 4, wherein the determining the interaction contribution information of the feature interaction group corresponding to the target prediction result based on the at least one interaction marginal contribution value comprises:

and when the number of the interaction marginal contribution values is multiple, summing the multiple interaction marginal contribution values to obtain the interaction contribution information of the feature interaction group corresponding to the target prediction result.

7. A data processing method of a federated tree model is characterized in that the federated learning system comprises a first participant device and at least one second participant device and is based on a federated learning system, the method is applied to the second participant device, and the method comprises the following steps:

8. The method of claim 7, wherein sending anonymous features corresponding to each of the second features, and wherein a second node corresponding to each of the anonymous features is routed to the first participant device, comprises:

receiving an anonymous feature acquisition request carrying a sample identifier sent by the first participant device;

9. The method of claim 8, wherein the determining the second feature to which the sample identity corresponds, the anonymous feature to which the second feature corresponds, and a second node route to which the anonymous feature corresponds comprises:

searching a local anonymous relation record table according to the sample identifier to obtain a second feature corresponding to the sample identifier, an anonymous feature corresponding to the second feature and a second route corresponding to the anonymous feature;

the anonymous relation recording table is used for recording a sample identifier of a training sample used for training the federated tree model, a second feature corresponding to the sample identifier, an anonymous feature corresponding to the second feature, and a second node route corresponding to the anonymous feature.

10. The method of claim 7, wherein the generating, for each second feature used for training the federated tree model, an anonymous feature corresponding to each second feature comprises:

and respectively carrying out hash processing on each second feature used for training the federated tree model to obtain a hash value corresponding to each second feature, and taking the hash value as an anonymous feature corresponding to the second feature.

11. A data processing apparatus of a federated tree model, comprising:

12. An electronic device, comprising:

a memory for storing executable instructions;

a processor configured to implement the data processing method of the federated tree model of any one of claims 1 to 10 when executing the executable instructions stored in the memory.

13. A computer-readable storage medium storing executable instructions for implementing the data processing method of the federal tree model as claimed in any one of claims 1 to 10 when executed by a processor.

14. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a data processing method of the federated tree model of any one of claims 1 to 10.