CN117236465A

CN117236465A - Information entropy-based federal decision tree information measurement method

Info

Publication number: CN117236465A
Application number: CN202311107162.0A
Authority: CN
Inventors: 陈爱国; 罗光春; 朱大勇; 李家豪; 陈嘉庚; 蔡政澳
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-12-15

Abstract

The invention relates to federal learning and machine learning tree model technology, in particular to a federal decision tree information quantity measurement method based on information entropy, which comprises a training process and a measurement process. In the training process, firstly configuring a federal learning environment, acquiring sample data based on the configured learning environment, and preprocessing the sample data; constructing CART tree by combining the participants and the parameter server; and finally, collecting a plurality of CART trees to form a federal decision tree forest. When the CART tree is built together, the calculation rate is improved by introducing and screening inferior participants. In the measurement process, for a single decision tree, firstly dividing nodes of the tree according to participant numbers; then, calculating the probability of all flow direction routes, and calculating the entropy value of the tree; and obtaining the information content of the tree through the difference value between the entropy value of the tree and the entropy value of the empty tree. The information content of the decision tree in the forest of the federal decision tree is quantized, and the purpose of optimizing benefit distribution of federal decision tree participants is achieved.

Description

Information entropy-based federal decision tree information measurement method

Technical Field

The invention relates to federal learning and machine learning tree model technology, in particular to a federal decision tree information quantity measurement method based on information entropy.

Background

With the continuous development of the internet, electronic commerce, short video and the like rise, the research of recommendation algorithms is deeper and deeper, and the application is wider and wider. Because decision tree algorithms have good interpretability, they are often used to implement recommendation algorithms. Conventional decision tree algorithms often require centralized collection and processing of user data and then training a global model based on the data, but this approach risks data privacy exposure.

The federal decision tree can realize cooperative sharing among platforms on the premise of protecting data privacy, and accuracy and robustness of the recommendation model are improved. Federal Forest (Federated Forest) is an implementation of a federal decision tree method proposed by Liu et al in 2019, in which each participant, for example, different platforms, retains own data, and a federal learning method is used to train models on each participant, and finally, the models are integrated to obtain a global model. Since each participant only retains own data, the problem of data privacy disclosure does not occur. The federal forest can train a more accurate recommendation model by utilizing user data and behaviors integrated by each platform. Taking an e-commerce platform as an example, the federal forest can recommend more proper online shopping commodities for users by utilizing data such as search records, browsing records and the like of the users on different platforms.

Because federal decision trees require each platform to participate in cooperation when applied, information is contained in the decision tree, resulting in a lack of efficient distribution methods in the final benefit distribution. Therefore, how to provide an effective decision tree information measurement method is an important difficulty in solving the federal decision tree application.

Disclosure of Invention

The invention aims at: the federal decision tree information quantity measuring method based on the information entropy is provided to solve the problem that benefit distribution of federal decision tree participants cannot be quantified in the prior art.

In order to achieve the above object, the present invention adopts the following technical problems:

the federal decision tree information quantity measuring method based on the information entropy comprises the following training process and measuring process:

the training process comprises;

s1, configuring environment and data preparation of federal learning

The federal learning environment configuration comprises configuring a parameter server, and accessing N participants and the parameter server into a private network;

the data preparation comprises sample data acquisition and preprocessing, wherein the sample data acquisition and preprocessing processes are as follows:

each participant obtains respective sample dataAnd data label F _i And sample data->And data label F _i Uploading to a parameter server; sample data->Comprises sample id, commodity id, category, browsing time, browsing duration, etc., wherein +.>Sample data representing the ith participant, F _i A data tag representing the ith participant. In this embodiment, sample data is obtained by collecting user commodity browsing behavior data and adding shopping cart commodity tags>And data label F _i 。

The parameter server combines the sample ids uploaded by all the participants to obtain a sample id set D, and combines the labels F uploaded by all the participants to obtain a label set F;

s2, constructing CART tree by combining participants and parameter server together

S2.1, a parameter server creates a decision tree T, and creates an empty root node on the T, so that the initialization of the decision tree construction is realized;

s2.2, the parameter server distributes the structure of the decision tree T to each participant and marks the current newly added node;

s2.3, after the participant updates the local decision tree structure according to the decision tree structure provided by the S2.2, calculating the segmentation parameters of the current node, and feeding back the calculation result to the parameter server; the node segmentation parameters include: local optimum Gini index Gini _i Optimal split tag j _i Optimal segmentation value s _i Probability of flow p _i ；

S2.4, the parameter server calculates global node segmentation parameters to determine alternative participants of the communication round, and poor-quality participants are screened out to terminate the subsequent training process;

s2.5, selecting to delete or store node information by the participant according to the alternative participant of the communication round determined by the S2.4 parameter server, and deleting or storing the node information and sending the node information to the parameter server;

s2.6, repeating the steps S2.2 to S2.6, and updating the decision tree structure T until the construction of the federal decision tree is completed;

s3, integrating a plurality of CART trees to form a federal decision tree forest

S3, adding the federal decision tree into a federal decision tree forest, determining the size of the forest according to the requirement, and repeatedly executing S2 if the forest is not large enough;

the measurement process comprises the following steps:

s4, respectively calculating the information content of each tree; taking participant i as an example, the information content of each tree is calculated as follows:

s4.1, dividing nodes of the tree according to participant numbers, wherein the node dividing rule is as follows: node information of alternate communication rounds, the numbers of which are the same as those of the participant i, is reserved, and the rest nodes are empty nodes;

s4.2, finding out a node flow direction line combination, wherein the method comprises two parts of finding out all node flow direction lines and calculating flow direction probability; the flow direction line is found out by utilizing a decision tree sample prediction principle, and the flow direction line probability is obtained by a preset calculation rule;

s4.3, summarizing all flow direction line combinations, and constructing a flow direction line probability distribution table;

s4.4, calculating an entropy value CH of each flow direction route by using a formula (1), wherein the formula (1) is as follows:

CH＝-∑p(x _j )*log ₂ (p(x _j ))#(1)

wherein p (x) _j ) Represents the x < th _j Probability distribution of individual leaves;

s4.5, calculating the flow direction probability of each flow direction route;

s4.6, calculating an entropy value TH of the decision tree T by using a formula (2), wherein the formula (2) is as follows:

T＝∑p _k *CH#(2)

wherein p is _k Representing the flow probability of the kth flow route;

s4.7, setting all nodes as empty nodes, and calculating an entropy value TH' of an empty tree of the decision tree T;

s4.8, calculating the information content T of the decision tree T by using a formula (3) according to the entropy TH of the decision tree T and the entropy TH' of the empty tree _info Equation (3) is shown below;

T _info ＝TH′-TH。

further, the specific operation method of the structure of the decision tree T distributed to each participant by the parameter server in S2.2 is as follows:

s2.2.1, the parameter server randomly samples from D according to 80% probability to obtain a global sample id sampling subset D ', and randomly samples from F according to 80% probability to obtain a global data tag sampling subset F';

s2.2.2 the following operations are performed for all i ε N:

d' and D _i Intersection results in sample id sample subset D 'for participant i' _i F' and D _i Intersection is made to obtain a subset F 'of data tag samples for participant i' _i The method comprises the steps of carrying out a first treatment on the surface of the And then distribute D 'to the ith participant' _i With F' _i 。

Further, the specific operation method for calculating the current node partition parameter in S2.3 is as follows:

s2.3.1, receive D' _i With F' _i Will D' _i And (3) withObtaining a data sample sampling subset of the participant i according to the intersection of the sample ids>

S2.3.3 based on construction of CART Tree principleIs a local optimum base index Gini of (a) _i Optimal split tag j _i And the optimal segmentation value s _i ；

Calculating flow probabilityAnd respectively representing the ratio of the number of samples flowing to the left child node to the number of samples flowing to the right child node after being segmented according to the optimal segmentation label and the optimal segmentation value.

Further, the specific operation method in step S2.4 is as follows:

the parameter server gathers the local optimal base index, optimal segmentation label, optimal segmentation value and flow direction probability p of all participants _i Selecting the participant with the highest local optimal base index as the alternative participant of the communication round;

if the communication round alternative participant is the same as the previous round alternative participant, selecting the participant with the second highest local optimal base-Ni index as the communication round alternative participant; informing the present communication wheelThe secondary candidate participant preserving node information, i.e. transmitting sign _i =save, informing other participants to delete node information, i.e. send sign _i =delete;

if a participant is not selected as a candidate participant in the communication round of N|F| 2, the participant is regarded as a bad participant, and the subsequent training process of the participant is finished. It should be noted that inferior participants may retain the generated tree model, but cannot participate in all subsequent processes.

Further, the specific operation method of S2.5 is as follows:

s2.5.1 Signal sign for storing and deleting node information by receiving parameter server _i And performs the following operations:

if sign _i Deleting Gini =delete _i Optimal split tag j _i Optimal segmentation value s _i Probability p of flow direction _i Meanwhile, no inferior participant screening is performed;

if sign _i Save, store Gini in current empty node _i Optimal split tag j _i Optimal segmentation value s _i Probability p of flow direction _i ；

S2.5.2 according to the optimal partition label j _i And the optimal segmentation value s _i SegmentationGet a subset of data samples flowing to the left child node under participant i->With participant i, data sample subset flowing to right child node +>

S2.5.3 and extractionAnd->The data tag of (1) gets the sample id subset of (i) towards the left child node under participant (i)>Sample id subset with participant i, flow to right child node +.>

S2.5.4, willAnd->And sending the parameters to a parameter server.

Further, the specific operation method of S2.6 is as follows:

s2.6.1, storing the number of the alternative participant of the communication round, and the local optimal base index, the optimal segmentation label, the optimal segmentation value and the flow direction probability of the participant in the current newly added node; receiving data from alternate participants of the communication roundAnd->And set it to D' _left With D' _right The method comprises the steps of carrying out a first treatment on the surface of the If D' _left Less than 5% of D or D' _right If the length of (2) is less than 5% of D, stopping the communication round while not executing the remaining steps;

s2.6.2 creating left and right child nodes of a node;

s2.6.3 for the left child node, create a new communication round, set the left child node as node, set D' _left For D' and recursively performing S2.3 to S2.6; for the right child node, a new communication round is created, the right child node is set as a node, and D 'is set' _right For D' and recursively performs S2.3 to S2.6.

After the technical scheme is adopted, the invention has the following beneficial effects:

(1) In the training process, the calculation rate is improved by introducing and screening inferior participants. In addition, the method only needs to upload one parameter additionally by the participant, so that the privacy of the participant is protected to the greatest extent.

(2) According to the scheme, the flow probability of the decision tree is decomposed, so that the information content of the decision tree in the forest of the federal decision tree is quantized, and the purpose of optimizing benefit distribution of federal decision tree participants is achieved.

Drawings

FIG. 1 is a flow chart of a training process in an embodiment of the invention;

FIG. 2 is a flow chart of a metrology process in an embodiment of the present invention;

fig. 3 is an exemplary diagram of a flow direction route probability distribution table in an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is described in detail below with reference to the accompanying drawings and the embodiments.

In the embodiment, the federal decision tree information quantity measuring method based on the information entropy is provided by taking the commodity recommendation of the electronic commerce as an application background, and comprises two parts of a training process and a measuring process. In the training process, a federal learning environment is configured firstly, wherein the environment comprises an additional configuration parameter server and an access participant, and the participant is a server for collecting shopping cart information by an e-commerce platform and is not generally in the same service position; then, the participant acquires sample data and preprocesses the sample data; then, constructing CART trees for predicting commodity preference of the user through the joint participants and the parameter server together; finally, a plurality of CART trees are collected to form a federal decision tree forest. In the measurement process, for a single decision tree, firstly dividing nodes of the tree according to participant numbers; then, calculating the probability of all flow direction routes, and calculating the entropy value of the tree; and finally, obtaining the information content of the tree through the difference value between the entropy value of the tree and the entropy value of the empty tree. The invention quantifies the contribution degree of each participant in prediction and can provide basis for settling the profits of different institutions for the e-commerce platform. The details of each part are as follows:

referring to fig. 1, the training process includes the steps of:

s1, configuring environment and data preparation of federal learning, wherein the environment and data preparation comprises the following steps:

the data preparation comprises sample data acquisition and preprocessing, wherein the sample data acquisition and preprocessing processes are as follows: :

the participants of each business department acquire commodity sample data of each businessWith commodity data label F _i And sample data->And data label F _i Uploading to a parameter server; sample data->Comprises sample id, commodity id, category, browsing time, browsing duration, ordering status, etc., wherein +.>Sample data representing the ith participant, F _i A data tag representing the ith participant. In this embodiment, sample data is obtained by collecting user commodity browsing behavior data and adding shopping cart commodity tags>And data label F _i 。

s2.2, the parameter server distributes the structure of the decision tree T to each participant and marks the current newly added node; the structure of the distribution decision tree T operates specifically as follows:

s2.2.1, the parameter server randomly samples from D according to 80% probability to obtain a global sample id sampling subset D ', and randomly samples from F according to 80% probability to obtain a global sample id sampling subset F';

s2.2.2 the following operations are performed for all i ε N:

d' and D _i Intersection is calculated to obtain D' _i F' and D _i Intersection is calculated to obtain F' _i The method comprises the steps of carrying out a first treatment on the surface of the And then distribute D 'to the ith participant' _i With F' _i 。

S2.3, after the participant updates the local decision tree structure according to the decision tree structure provided by the S2.2, calculating node segmentation parameters of the current node, and feeding back calculation results to the parameter server; the node segmentation parameters include: local optimum Gini index Gini _i Optimal split tag j _i Optimal segmentation value s _i Probability of flow p _i ；

The node segmentation parameters are calculated as follows;

s2.3.1, receive D' _i With F' _i D is to _′i And (3) withSolving intersection according to sample id to obtain +.>

S2.4, calculating global node segmentation parameters by a parameter server, determining alternative participants of the communication round, screening out inferior participants, and terminating a subsequent training process;

if the communication round alternative participant is the same as the previous round alternative participant, selecting the participant with the second highest local optimal base-Ni index as the communication round alternative participant; notifying alternative participants of the communication round to save node information, namely sending sign _i =save, informing other participants to delete node information, i.e. send sign _i =delete.

And S2.5, selecting to delete or retain node information by the participant according to the alternative participant of the communication round determined by the S2.4 parameter server, and completing the construction of the child nodes. The detailed process is as follows:

S2.5.2 according to the optimal partition label j _i And the optimal segmentation value s _i SegmentationObtain->And->

S2.5.3 and extractionAnd->Data tag of->And->

S2.5.4, willAnd->And sending the parameters to a parameter server.

S2.6, updating the tree model by the parameter server;

s2.6.1, storing the number of the alternative participant of the communication round, and the local optimal base index, the optimal segmentation label, the optimal segmentation value and the flow direction probability of the participant in a node; receiving data from alternate participants of the communication roundAnd->And set it to D' _left With D' _right The method comprises the steps of carrying out a first treatment on the surface of the If D' _left Less than 5% of D or D' _right If the length of (2) is less than 5% of D, stopping the communication round while not executing the remaining steps;

s2.6.2 creating left and right child nodes of a node;

s2.6.3 for the left child node, create a new communication round, set the left child node as node, set D' _left For D' and recursively performing S2.3 to S2.6; for the right child node, a new communication round is created, the right child node is set as a node, and D 'is set' _right For D' and recursively performing S2.3 to S2.6;

s3, adding the federal decision tree T into a federal decision tree forest, determining the size of the forest according to the requirement, and repeatedly executing S2 if the forest is not large enough;

referring to fig. 2, the metrology process includes the steps of:

s4, respectively calculating the information content of each tree, and taking a participant i as an example to describe the calculation process:

and S4.2, finding out a node flow direction line combination, wherein the method comprises two parts of finding out all node flow direction lines and calculating the flow direction line probability.

The embodiment utilizes the decision tree sample prediction principle to find out all reserved node information flow lines. Specific: and if the samples flow from the root node to the leaf nodes and meet the non-leaf nodes in the process, dividing the samples flowing to the nodes according to the dividing information, and enabling the divided samples to flow to the left child node and the right child node respectively. If encountering an empty non-leaf node, the sample flows to the left and right child nodes at the same time, namely, an empty non-leaf node creates a new flow direction route;

the embodiment calculates the probability of the flow direction line according to a preset rule. The flow direction route probability represents a probability that an assumed sample flows to a leaf node. The flow direction line probability calculation rule is as follows: assuming that a sample can only flow to a certain determined leaf node in the flow direction route, the probability of the leaf node in the probability distribution is 1, and the probability of the rest leaf nodes is 0; assuming that the sample passes through an empty non-leaf node in the flow direction route, the probability of assuming that the sample flows to a leaf node will be reduced by half.

And S4.3, summarizing all flow direction line combinations, and constructing a flow direction line probability distribution table. As exemplified in fig. 3;

s4.4, calculating the entropy CH of each flow direction route, see formula 1, wherein p (x _j ) Represents the x < th _j Probability distribution of individual leaves:

CH＝-∑p(x _j )*log ₂ (p(x _j ))#(1)

s4.5, calculating the flow direction probability of each flow direction route;

s4.6, calculating the entropy value TH of the decision tree T, see formula 2, wherein p _k The flow probability representing the kth flow route:

TH＝∑p _k *CH#(2)

T _info ＝TH′-TH

s5, obtaining the information content of all federal decision trees in the forest according to the information content of each tree calculated in the S4. The e-commerce platform can determine the contribution degree of different business parts on commodity preference recommending tasks according to the information content of the federal decision tree, so that different benefits are settled.

In summary, the federal decision tree information quantity measurement method based on the information entropy solves the problem that benefit allocation of federal decision tree participants cannot be quantified in the prior art. Because the step of screening inferior participants is introduced, the training process has smaller calculated data volume compared with the traditional federal forest algorithm, and has little influence on the calculated volume of the participants.

Claims

1. The federal decision tree information quantity measuring method based on the information entropy comprises the following training process and measuring process, and is characterized in that:

the training process comprises the following steps:

s1, configuring environment and data preparation of federal learning

each participant obtains respective sample dataAnd data label F _i And sample data->And data label F _i Uploading to a parameter server; sample data->Comprises sample id, commodity id, category, browsing time, browsing duration, etc., wherein +.>Sample data representing the ith participant, F _i A data tag representing the ith participant. In this embodiment, sample data is obtained by collecting user commodity browsing behavior data and adding shopping cart commodity tags>And data label F _i ；

The parameter server merges the sample ids uploaded by the participants to obtain sample idsSet D, merging the labels F uploaded by each participant _i Obtaining a label set F;

the measurement process comprises the following steps:

CH＝-∑p(x _j )*log ₂ (p(x _j )) #(1)

s4.5, calculating the flow direction probability of each flow direction route;

TH＝∑p _k *CH #(2)

wherein p is _k Representing the flow probability of the kth flow route;

T _info ＝TH′-TH。

2. the information entropy-based federal decision tree information volume measurement method according to claim 1, wherein the method for distributing the structure of the decision tree T to each participant by the parameter server in S2.2 comprises:

s2.2.1 the parameter server randomly samples from D according to 80% probability to obtain a global sample id sampling subset D ', and randomly samples from F according to 80% probability to obtain a global data tag sampling subset F';

s2.2.2 the following operations are performed for all i ε N:

d' and D _i Intersection is calculated to obtain participant iSample id sample subset D 'of (2)' _i F' and D _i Intersection is made to obtain a subset F 'of data tag samples for participant i' _i The method comprises the steps of carrying out a first treatment on the surface of the And then distribute D 'to the ith participant' _i With F' _i 。

3. The information entropy-based federal decision tree information volume measurement method according to claim 1, wherein the method for calculating the current node partition parameter in S2.3 comprises:

4. The information entropy-based federal decision tree information quantity measurement method according to claim 1, wherein the step S2.4 detailed procedure is as follows:

the parameter server gathers the local optimal base index, optimal segmentation label, optimal segmentation value and flow direction probability p of all participants _i And selecting the participation with the highest local optimal base indexThe communication is a communication round alternative participant;

if the communication round alternative participant is the same as the previous round alternative participant, selecting the participant with the second highest local optimal base-Ni index as the communication round alternative participant; notifying alternative participants of the communication round to save node information, namely sending sign _i =save, informing other participants to delete node information, i.e. send sign _i =delete;

5. The information entropy-based federal decision tree information quantity measurement method according to claim 1, wherein the detailed process of S2.5 is:

S2.5.2 according to the optimal partition label j _i And the optimal segmentation value s _i SegmentationGet a subset of data samples flowing to the left child node under participant i->With participant i, data sample child flowing to right child nodeCollect->

S2.5.3 and extractionAnd->Obtaining a subset of sample ids flowing to the left child node under participant iSample id subset with participant i, flow to right child node +.>

S2.5.4, willAnd->And sending the parameters to a parameter server.

6. The information entropy-based federal decision tree information quantity measurement method according to claim 1, wherein the detailed process of S2.6 is:

s2.6.2 creating left and right child nodes of a node;