CN117236465A - Information entropy-based federal decision tree information measurement method - Google Patents

Information entropy-based federal decision tree information measurement method Download PDF

Info

Publication number
CN117236465A
CN117236465A CN202311107162.0A CN202311107162A CN117236465A CN 117236465 A CN117236465 A CN 117236465A CN 202311107162 A CN202311107162 A CN 202311107162A CN 117236465 A CN117236465 A CN 117236465A
Authority
CN
China
Prior art keywords
participant
decision tree
node
tree
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311107162.0A
Other languages
Chinese (zh)
Inventor
陈爱国
罗光春
朱大勇
李家豪
陈嘉庚
蔡政澳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202311107162.0A priority Critical patent/CN117236465A/en
Publication of CN117236465A publication Critical patent/CN117236465A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to federal learning and machine learning tree model technology, in particular to a federal decision tree information quantity measurement method based on information entropy, which comprises a training process and a measurement process. In the training process, firstly configuring a federal learning environment, acquiring sample data based on the configured learning environment, and preprocessing the sample data; constructing CART tree by combining the participants and the parameter server; and finally, collecting a plurality of CART trees to form a federal decision tree forest. When the CART tree is built together, the calculation rate is improved by introducing and screening inferior participants. In the measurement process, for a single decision tree, firstly dividing nodes of the tree according to participant numbers; then, calculating the probability of all flow direction routes, and calculating the entropy value of the tree; and obtaining the information content of the tree through the difference value between the entropy value of the tree and the entropy value of the empty tree. The information content of the decision tree in the forest of the federal decision tree is quantized, and the purpose of optimizing benefit distribution of federal decision tree participants is achieved.

Description

Information entropy-based federal decision tree information measurement method
Technical Field
The invention relates to federal learning and machine learning tree model technology, in particular to a federal decision tree information quantity measurement method based on information entropy.
Background
With the continuous development of the internet, electronic commerce, short video and the like rise, the research of recommendation algorithms is deeper and deeper, and the application is wider and wider. Because decision tree algorithms have good interpretability, they are often used to implement recommendation algorithms. Conventional decision tree algorithms often require centralized collection and processing of user data and then training a global model based on the data, but this approach risks data privacy exposure.
The federal decision tree can realize cooperative sharing among platforms on the premise of protecting data privacy, and accuracy and robustness of the recommendation model are improved. Federal Forest (Federated Forest) is an implementation of a federal decision tree method proposed by Liu et al in 2019, in which each participant, for example, different platforms, retains own data, and a federal learning method is used to train models on each participant, and finally, the models are integrated to obtain a global model. Since each participant only retains own data, the problem of data privacy disclosure does not occur. The federal forest can train a more accurate recommendation model by utilizing user data and behaviors integrated by each platform. Taking an e-commerce platform as an example, the federal forest can recommend more proper online shopping commodities for users by utilizing data such as search records, browsing records and the like of the users on different platforms.
Because federal decision trees require each platform to participate in cooperation when applied, information is contained in the decision tree, resulting in a lack of efficient distribution methods in the final benefit distribution. Therefore, how to provide an effective decision tree information measurement method is an important difficulty in solving the federal decision tree application.
Disclosure of Invention
The invention aims at: the federal decision tree information quantity measuring method based on the information entropy is provided to solve the problem that benefit distribution of federal decision tree participants cannot be quantified in the prior art.
In order to achieve the above object, the present invention adopts the following technical problems:
the federal decision tree information quantity measuring method based on the information entropy comprises the following training process and measuring process:
the training process comprises;
s1, configuring environment and data preparation of federal learning
The federal learning environment configuration comprises configuring a parameter server, and accessing N participants and the parameter server into a private network;
the data preparation comprises sample data acquisition and preprocessing, wherein the sample data acquisition and preprocessing processes are as follows:
each participant obtains respective sample dataAnd data label F i And sample data->And data label F i Uploading to a parameter server; sample data->Comprises sample id, commodity id, category, browsing time, browsing duration, etc., wherein +.>Sample data representing the ith participant, F i A data tag representing the ith participant. In this embodiment, sample data is obtained by collecting user commodity browsing behavior data and adding shopping cart commodity tags>And data label F i
The parameter server combines the sample ids uploaded by all the participants to obtain a sample id set D, and combines the labels F uploaded by all the participants to obtain a label set F;
s2, constructing CART tree by combining participants and parameter server together
S2.1, a parameter server creates a decision tree T, and creates an empty root node on the T, so that the initialization of the decision tree construction is realized;
s2.2, the parameter server distributes the structure of the decision tree T to each participant and marks the current newly added node;
s2.3, after the participant updates the local decision tree structure according to the decision tree structure provided by the S2.2, calculating the segmentation parameters of the current node, and feeding back the calculation result to the parameter server; the node segmentation parameters include: local optimum Gini index Gini i Optimal split tag j i Optimal segmentation value s i Probability of flow p i
S2.4, the parameter server calculates global node segmentation parameters to determine alternative participants of the communication round, and poor-quality participants are screened out to terminate the subsequent training process;
s2.5, selecting to delete or store node information by the participant according to the alternative participant of the communication round determined by the S2.4 parameter server, and deleting or storing the node information and sending the node information to the parameter server;
s2.6, repeating the steps S2.2 to S2.6, and updating the decision tree structure T until the construction of the federal decision tree is completed;
s3, integrating a plurality of CART trees to form a federal decision tree forest
S3, adding the federal decision tree into a federal decision tree forest, determining the size of the forest according to the requirement, and repeatedly executing S2 if the forest is not large enough;
the measurement process comprises the following steps:
s4, respectively calculating the information content of each tree; taking participant i as an example, the information content of each tree is calculated as follows:
s4.1, dividing nodes of the tree according to participant numbers, wherein the node dividing rule is as follows: node information of alternate communication rounds, the numbers of which are the same as those of the participant i, is reserved, and the rest nodes are empty nodes;
s4.2, finding out a node flow direction line combination, wherein the method comprises two parts of finding out all node flow direction lines and calculating flow direction probability; the flow direction line is found out by utilizing a decision tree sample prediction principle, and the flow direction line probability is obtained by a preset calculation rule;
s4.3, summarizing all flow direction line combinations, and constructing a flow direction line probability distribution table;
s4.4, calculating an entropy value CH of each flow direction route by using a formula (1), wherein the formula (1) is as follows:
CH=-∑p(x j )*log 2 (p(x j ))#(1)
wherein p (x) j ) Represents the x < th j Probability distribution of individual leaves;
s4.5, calculating the flow direction probability of each flow direction route;
s4.6, calculating an entropy value TH of the decision tree T by using a formula (2), wherein the formula (2) is as follows:
T=∑p k *CH#(2)
wherein p is k Representing the flow probability of the kth flow route;
s4.7, setting all nodes as empty nodes, and calculating an entropy value TH' of an empty tree of the decision tree T;
s4.8, calculating the information content T of the decision tree T by using a formula (3) according to the entropy TH of the decision tree T and the entropy TH' of the empty tree info Equation (3) is shown below;
T info =TH′-TH。
further, the specific operation method of the structure of the decision tree T distributed to each participant by the parameter server in S2.2 is as follows:
s2.2.1, the parameter server randomly samples from D according to 80% probability to obtain a global sample id sampling subset D ', and randomly samples from F according to 80% probability to obtain a global data tag sampling subset F';
s2.2.2 the following operations are performed for all i ε N:
d' and D i Intersection results in sample id sample subset D 'for participant i' i F' and D i Intersection is made to obtain a subset F 'of data tag samples for participant i' i The method comprises the steps of carrying out a first treatment on the surface of the And then distribute D 'to the ith participant' i With F' i
Further, the specific operation method for calculating the current node partition parameter in S2.3 is as follows:
s2.3.1, receive D' i With F' i Will D' i And (3) withObtaining a data sample sampling subset of the participant i according to the intersection of the sample ids>
S2.3.3 based on construction of CART Tree principleIs a local optimum base index Gini of (a) i Optimal split tag j i And the optimal segmentation value s i
Calculating flow probabilityAnd respectively representing the ratio of the number of samples flowing to the left child node to the number of samples flowing to the right child node after being segmented according to the optimal segmentation label and the optimal segmentation value.
Further, the specific operation method in step S2.4 is as follows:
the parameter server gathers the local optimal base index, optimal segmentation label, optimal segmentation value and flow direction probability p of all participants i Selecting the participant with the highest local optimal base index as the alternative participant of the communication round;
if the communication round alternative participant is the same as the previous round alternative participant, selecting the participant with the second highest local optimal base-Ni index as the communication round alternative participant; informing the present communication wheelThe secondary candidate participant preserving node information, i.e. transmitting sign i =save, informing other participants to delete node information, i.e. send sign i =delete;
if a participant is not selected as a candidate participant in the communication round of N|F| 2, the participant is regarded as a bad participant, and the subsequent training process of the participant is finished. It should be noted that inferior participants may retain the generated tree model, but cannot participate in all subsequent processes.
Further, the specific operation method of S2.5 is as follows:
s2.5.1 Signal sign for storing and deleting node information by receiving parameter server i And performs the following operations:
if sign i Deleting Gini =delete i Optimal split tag j i Optimal segmentation value s i Probability p of flow direction i Meanwhile, no inferior participant screening is performed;
if sign i Save, store Gini in current empty node i Optimal split tag j i Optimal segmentation value s i Probability p of flow direction i
S2.5.2 according to the optimal partition label j i And the optimal segmentation value s i SegmentationGet a subset of data samples flowing to the left child node under participant i->With participant i, data sample subset flowing to right child node +>
S2.5.3 and extractionAnd->The data tag of (1) gets the sample id subset of (i) towards the left child node under participant (i)>Sample id subset with participant i, flow to right child node +.>
S2.5.4, willAnd->And sending the parameters to a parameter server.
Further, the specific operation method of S2.6 is as follows:
s2.6.1, storing the number of the alternative participant of the communication round, and the local optimal base index, the optimal segmentation label, the optimal segmentation value and the flow direction probability of the participant in the current newly added node; receiving data from alternate participants of the communication roundAnd->And set it to D' left With D' right The method comprises the steps of carrying out a first treatment on the surface of the If D' left Less than 5% of D or D' right If the length of (2) is less than 5% of D, stopping the communication round while not executing the remaining steps;
s2.6.2 creating left and right child nodes of a node;
s2.6.3 for the left child node, create a new communication round, set the left child node as node, set D' left For D' and recursively performing S2.3 to S2.6; for the right child node, a new communication round is created, the right child node is set as a node, and D 'is set' right For D' and recursively performs S2.3 to S2.6.
After the technical scheme is adopted, the invention has the following beneficial effects:
(1) In the training process, the calculation rate is improved by introducing and screening inferior participants. In addition, the method only needs to upload one parameter additionally by the participant, so that the privacy of the participant is protected to the greatest extent.
(2) According to the scheme, the flow probability of the decision tree is decomposed, so that the information content of the decision tree in the forest of the federal decision tree is quantized, and the purpose of optimizing benefit distribution of federal decision tree participants is achieved.
Drawings
FIG. 1 is a flow chart of a training process in an embodiment of the invention;
FIG. 2 is a flow chart of a metrology process in an embodiment of the present invention;
fig. 3 is an exemplary diagram of a flow direction route probability distribution table in an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is described in detail below with reference to the accompanying drawings and the embodiments.
In the embodiment, the federal decision tree information quantity measuring method based on the information entropy is provided by taking the commodity recommendation of the electronic commerce as an application background, and comprises two parts of a training process and a measuring process. In the training process, a federal learning environment is configured firstly, wherein the environment comprises an additional configuration parameter server and an access participant, and the participant is a server for collecting shopping cart information by an e-commerce platform and is not generally in the same service position; then, the participant acquires sample data and preprocesses the sample data; then, constructing CART trees for predicting commodity preference of the user through the joint participants and the parameter server together; finally, a plurality of CART trees are collected to form a federal decision tree forest. In the measurement process, for a single decision tree, firstly dividing nodes of the tree according to participant numbers; then, calculating the probability of all flow direction routes, and calculating the entropy value of the tree; and finally, obtaining the information content of the tree through the difference value between the entropy value of the tree and the entropy value of the empty tree. The invention quantifies the contribution degree of each participant in prediction and can provide basis for settling the profits of different institutions for the e-commerce platform. The details of each part are as follows:
referring to fig. 1, the training process includes the steps of:
s1, configuring environment and data preparation of federal learning, wherein the environment and data preparation comprises the following steps:
the federal learning environment configuration comprises configuring a parameter server, and accessing N participants and the parameter server into a private network;
the data preparation comprises sample data acquisition and preprocessing, wherein the sample data acquisition and preprocessing processes are as follows: :
the participants of each business department acquire commodity sample data of each businessWith commodity data label F i And sample data->And data label F i Uploading to a parameter server; sample data->Comprises sample id, commodity id, category, browsing time, browsing duration, ordering status, etc., wherein +.>Sample data representing the ith participant, F i A data tag representing the ith participant. In this embodiment, sample data is obtained by collecting user commodity browsing behavior data and adding shopping cart commodity tags>And data label F i
The parameter server combines the sample ids uploaded by all the participants to obtain a sample id set D, and combines the labels F uploaded by all the participants to obtain a label set F;
s2, constructing CART tree by combining participants and parameter server together
S2.1, a parameter server creates a decision tree T, and creates an empty root node on the T, so that the initialization of the decision tree construction is realized;
s2.2, the parameter server distributes the structure of the decision tree T to each participant and marks the current newly added node; the structure of the distribution decision tree T operates specifically as follows:
s2.2.1, the parameter server randomly samples from D according to 80% probability to obtain a global sample id sampling subset D ', and randomly samples from F according to 80% probability to obtain a global sample id sampling subset F';
s2.2.2 the following operations are performed for all i ε N:
d' and D i Intersection is calculated to obtain D' i F' and D i Intersection is calculated to obtain F' i The method comprises the steps of carrying out a first treatment on the surface of the And then distribute D 'to the ith participant' i With F' i
S2.3, after the participant updates the local decision tree structure according to the decision tree structure provided by the S2.2, calculating node segmentation parameters of the current node, and feeding back calculation results to the parameter server; the node segmentation parameters include: local optimum Gini index Gini i Optimal split tag j i Optimal segmentation value s i Probability of flow p i
The node segmentation parameters are calculated as follows;
s2.3.1, receive D' i With F' i D is to ′i And (3) withSolving intersection according to sample id to obtain +.>
S2.3.3 based on construction of CART Tree principleIs a local optimum base index Gini of (a) i Optimal split tag j i And the optimal segmentation value s i
Calculating flow probabilityAnd respectively representing the ratio of the number of samples flowing to the left child node to the number of samples flowing to the right child node after being segmented according to the optimal segmentation label and the optimal segmentation value.
S2.4, calculating global node segmentation parameters by a parameter server, determining alternative participants of the communication round, screening out inferior participants, and terminating a subsequent training process;
the parameter server gathers the local optimal base index, optimal segmentation label, optimal segmentation value and flow direction probability p of all participants i Selecting the participant with the highest local optimal base index as the alternative participant of the communication round;
if the communication round alternative participant is the same as the previous round alternative participant, selecting the participant with the second highest local optimal base-Ni index as the communication round alternative participant; notifying alternative participants of the communication round to save node information, namely sending sign i =save, informing other participants to delete node information, i.e. send sign i =delete.
If a participant is not selected as a candidate participant in the communication round of N|F| 2, the participant is regarded as a bad participant, and the subsequent training process of the participant is finished. It should be noted that inferior participants may retain the generated tree model, but cannot participate in all subsequent processes.
And S2.5, selecting to delete or retain node information by the participant according to the alternative participant of the communication round determined by the S2.4 parameter server, and completing the construction of the child nodes. The detailed process is as follows:
s2.5.1 Signal sign for storing and deleting node information by receiving parameter server i And performs the following operations:
if sign i Deleting Gini =delete i Optimal split tag j i Optimal segmentation value s i Probability p of flow direction i Meanwhile, no inferior participant screening is performed;
if sign i Save, store Gini in current empty node i Optimal split tag j i Optimal segmentation value s i Probability p of flow direction i
S2.5.2 according to the optimal partition label j i And the optimal segmentation value s i SegmentationObtain->And->
S2.5.3 and extractionAnd->Data tag of->And->
S2.5.4, willAnd->And sending the parameters to a parameter server.
S2.6, updating the tree model by the parameter server;
s2.6.1, storing the number of the alternative participant of the communication round, and the local optimal base index, the optimal segmentation label, the optimal segmentation value and the flow direction probability of the participant in a node; receiving data from alternate participants of the communication roundAnd->And set it to D' left With D' right The method comprises the steps of carrying out a first treatment on the surface of the If D' left Less than 5% of D or D' right If the length of (2) is less than 5% of D, stopping the communication round while not executing the remaining steps;
s2.6.2 creating left and right child nodes of a node;
s2.6.3 for the left child node, create a new communication round, set the left child node as node, set D' left For D' and recursively performing S2.3 to S2.6; for the right child node, a new communication round is created, the right child node is set as a node, and D 'is set' right For D' and recursively performing S2.3 to S2.6;
s3, adding the federal decision tree T into a federal decision tree forest, determining the size of the forest according to the requirement, and repeatedly executing S2 if the forest is not large enough;
referring to fig. 2, the metrology process includes the steps of:
s4, respectively calculating the information content of each tree, and taking a participant i as an example to describe the calculation process:
s4.1, dividing nodes of the tree according to participant numbers, wherein the node dividing rule is as follows: node information of alternate communication rounds, the numbers of which are the same as those of the participant i, is reserved, and the rest nodes are empty nodes;
and S4.2, finding out a node flow direction line combination, wherein the method comprises two parts of finding out all node flow direction lines and calculating the flow direction line probability.
The embodiment utilizes the decision tree sample prediction principle to find out all reserved node information flow lines. Specific: and if the samples flow from the root node to the leaf nodes and meet the non-leaf nodes in the process, dividing the samples flowing to the nodes according to the dividing information, and enabling the divided samples to flow to the left child node and the right child node respectively. If encountering an empty non-leaf node, the sample flows to the left and right child nodes at the same time, namely, an empty non-leaf node creates a new flow direction route;
the embodiment calculates the probability of the flow direction line according to a preset rule. The flow direction route probability represents a probability that an assumed sample flows to a leaf node. The flow direction line probability calculation rule is as follows: assuming that a sample can only flow to a certain determined leaf node in the flow direction route, the probability of the leaf node in the probability distribution is 1, and the probability of the rest leaf nodes is 0; assuming that the sample passes through an empty non-leaf node in the flow direction route, the probability of assuming that the sample flows to a leaf node will be reduced by half.
And S4.3, summarizing all flow direction line combinations, and constructing a flow direction line probability distribution table. As exemplified in fig. 3;
s4.4, calculating the entropy CH of each flow direction route, see formula 1, wherein p (x j ) Represents the x < th j Probability distribution of individual leaves:
CH=-∑p(x j )*log 2 (p(x j ))#(1)
s4.5, calculating the flow direction probability of each flow direction route;
s4.6, calculating the entropy value TH of the decision tree T, see formula 2, wherein p k The flow probability representing the kth flow route:
TH=∑p k *CH#(2)
s4.7, setting all nodes as empty nodes, and calculating an entropy value TH' of an empty tree of the decision tree T;
s4.8, calculating the information content T of the decision tree T by using a formula (3) according to the entropy TH of the decision tree T and the entropy TH' of the empty tree info Equation (3) is shown below;
T info =TH′-TH
s5, obtaining the information content of all federal decision trees in the forest according to the information content of each tree calculated in the S4. The e-commerce platform can determine the contribution degree of different business parts on commodity preference recommending tasks according to the information content of the federal decision tree, so that different benefits are settled.
In summary, the federal decision tree information quantity measurement method based on the information entropy solves the problem that benefit allocation of federal decision tree participants cannot be quantified in the prior art. Because the step of screening inferior participants is introduced, the training process has smaller calculated data volume compared with the traditional federal forest algorithm, and has little influence on the calculated volume of the participants.

Claims (6)

1. The federal decision tree information quantity measuring method based on the information entropy comprises the following training process and measuring process, and is characterized in that:
the training process comprises the following steps:
s1, configuring environment and data preparation of federal learning
The federal learning environment configuration comprises configuring a parameter server, and accessing N participants and the parameter server into a private network;
the data preparation comprises sample data acquisition and preprocessing, wherein the sample data acquisition and preprocessing processes are as follows:
each participant obtains respective sample dataAnd data label F i And sample data->And data label F i Uploading to a parameter server; sample data->Comprises sample id, commodity id, category, browsing time, browsing duration, etc., wherein +.>Sample data representing the ith participant, F i A data tag representing the ith participant. In this embodiment, sample data is obtained by collecting user commodity browsing behavior data and adding shopping cart commodity tags>And data label F i
The parameter server merges the sample ids uploaded by the participants to obtain sample idsSet D, merging the labels F uploaded by each participant i Obtaining a label set F;
s2, constructing CART tree by combining participants and parameter server together
S2.1, a parameter server creates a decision tree T, and creates an empty root node on the T, so that the initialization of the decision tree construction is realized;
s2.2, the parameter server distributes the structure of the decision tree T to each participant and marks the current newly added node;
s2.3, after the participant updates the local decision tree structure according to the decision tree structure provided by the S2.2, calculating the segmentation parameters of the current node, and feeding back the calculation result to the parameter server; the node segmentation parameters include: local optimum Gini index Gini i Optimal split tag j i Optimal segmentation value s i Probability of flow p i
S2.4, the parameter server calculates global node segmentation parameters to determine alternative participants of the communication round, and poor-quality participants are screened out to terminate the subsequent training process;
s2.5, selecting to delete or store node information by the participant according to the alternative participant of the communication round determined by the S2.4 parameter server, and deleting or storing the node information and sending the node information to the parameter server;
s2.6, repeating the steps S2.2 to S2.6, and updating the decision tree structure T until the construction of the federal decision tree is completed;
s3, integrating a plurality of CART trees to form a federal decision tree forest
S3, adding the federal decision tree into a federal decision tree forest, determining the size of the forest according to the requirement, and repeatedly executing S2 if the forest is not large enough;
the measurement process comprises the following steps:
s4, respectively calculating the information content of each tree; taking participant i as an example, the information content of each tree is calculated as follows:
s4.1, dividing nodes of the tree according to participant numbers, wherein the node dividing rule is as follows: node information of alternate communication rounds, the numbers of which are the same as those of the participant i, is reserved, and the rest nodes are empty nodes;
s4.2, finding out a node flow direction line combination, wherein the method comprises two parts of finding out all node flow direction lines and calculating flow direction probability; the flow direction line is found out by utilizing a decision tree sample prediction principle, and the flow direction line probability is obtained by a preset calculation rule;
s4.3, summarizing all flow direction line combinations, and constructing a flow direction line probability distribution table;
s4.4, calculating an entropy value CH of each flow direction route by using a formula (1), wherein the formula (1) is as follows:
CH=-∑p(x j )*log 2 (p(x j )) #(1)
wherein p (x) j ) Represents the x < th j Probability distribution of individual leaves;
s4.5, calculating the flow direction probability of each flow direction route;
s4.6, calculating an entropy value TH of the decision tree T by using a formula (2), wherein the formula (2) is as follows:
TH=∑p k *CH #(2)
wherein p is k Representing the flow probability of the kth flow route;
s4.7, setting all nodes as empty nodes, and calculating an entropy value TH' of an empty tree of the decision tree T;
s4.8, calculating the information content T of the decision tree T by using a formula (3) according to the entropy TH of the decision tree T and the entropy TH' of the empty tree info Equation (3) is shown below;
T info =TH′-TH。
2. the information entropy-based federal decision tree information volume measurement method according to claim 1, wherein the method for distributing the structure of the decision tree T to each participant by the parameter server in S2.2 comprises:
s2.2.1 the parameter server randomly samples from D according to 80% probability to obtain a global sample id sampling subset D ', and randomly samples from F according to 80% probability to obtain a global data tag sampling subset F';
s2.2.2 the following operations are performed for all i ε N:
d' and D i Intersection is calculated to obtain participant iSample id sample subset D 'of (2)' i F' and D i Intersection is made to obtain a subset F 'of data tag samples for participant i' i The method comprises the steps of carrying out a first treatment on the surface of the And then distribute D 'to the ith participant' i With F' i
3. The information entropy-based federal decision tree information volume measurement method according to claim 1, wherein the method for calculating the current node partition parameter in S2.3 comprises:
s2.3.1, receive D' i With F' i Will D' i And (3) withObtaining a data sample sampling subset of the participant i according to the intersection of the sample ids>
S2.3.3 based on construction of CART Tree principleIs a local optimum base index Gini of (a) i Optimal split tag j i And the optimal segmentation value s i
Calculating flow probabilityAnd respectively representing the ratio of the number of samples flowing to the left child node to the number of samples flowing to the right child node after being segmented according to the optimal segmentation label and the optimal segmentation value.
4. The information entropy-based federal decision tree information quantity measurement method according to claim 1, wherein the step S2.4 detailed procedure is as follows:
the parameter server gathers the local optimal base index, optimal segmentation label, optimal segmentation value and flow direction probability p of all participants i And selecting the participation with the highest local optimal base indexThe communication is a communication round alternative participant;
if the communication round alternative participant is the same as the previous round alternative participant, selecting the participant with the second highest local optimal base-Ni index as the communication round alternative participant; notifying alternative participants of the communication round to save node information, namely sending sign i =save, informing other participants to delete node information, i.e. send sign i =delete;
if a participant is not selected as a candidate participant in the communication round of N|F| 2, the participant is regarded as a bad participant, and the subsequent training process of the participant is finished. It should be noted that inferior participants may retain the generated tree model, but cannot participate in all subsequent processes.
5. The information entropy-based federal decision tree information quantity measurement method according to claim 1, wherein the detailed process of S2.5 is:
s2.5.1 Signal sign for storing and deleting node information by receiving parameter server i And performs the following operations:
if sign i Deleting Gini =delete i Optimal split tag j i Optimal segmentation value s i Probability p of flow direction i Meanwhile, no inferior participant screening is performed;
if sign i Save, store Gini in current empty node i Optimal split tag j i Optimal segmentation value s i Probability p of flow direction i
S2.5.2 according to the optimal partition label j i And the optimal segmentation value s i SegmentationGet a subset of data samples flowing to the left child node under participant i->With participant i, data sample child flowing to right child nodeCollect->
S2.5.3 and extractionAnd->Obtaining a subset of sample ids flowing to the left child node under participant iSample id subset with participant i, flow to right child node +.>
S2.5.4, willAnd->And sending the parameters to a parameter server.
6. The information entropy-based federal decision tree information quantity measurement method according to claim 1, wherein the detailed process of S2.6 is:
s2.6.1, storing the number of the alternative participant of the communication round, and the local optimal base index, the optimal segmentation label, the optimal segmentation value and the flow direction probability of the participant in the current newly added node; receiving data from alternate participants of the communication roundAnd->And set it to D' left With D' right The method comprises the steps of carrying out a first treatment on the surface of the If D' left Less than 5% of D or D' right If the length of (2) is less than 5% of D, stopping the communication round while not executing the remaining steps;
s2.6.2 creating left and right child nodes of a node;
s2.6.3 for the left child node, create a new communication round, set the left child node as node, set D' left For D' and recursively performing S2.3 to S2.6; for the right child node, a new communication round is created, the right child node is set as a node, and D 'is set' right For D' and recursively performs S2.3 to S2.6.
CN202311107162.0A 2023-08-30 2023-08-30 Information entropy-based federal decision tree information measurement method Pending CN117236465A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311107162.0A CN117236465A (en) 2023-08-30 2023-08-30 Information entropy-based federal decision tree information measurement method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311107162.0A CN117236465A (en) 2023-08-30 2023-08-30 Information entropy-based federal decision tree information measurement method

Publications (1)

Publication Number Publication Date
CN117236465A true CN117236465A (en) 2023-12-15

Family

ID=89097638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311107162.0A Pending CN117236465A (en) 2023-08-30 2023-08-30 Information entropy-based federal decision tree information measurement method

Country Status (1)

Country Link
CN (1) CN117236465A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117455340A (en) * 2023-12-23 2024-01-26 翌飞锐特电子商务(北京)有限公司 Logistics freight transportation information sharing and pushing method based on one record supply chain order

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117455340A (en) * 2023-12-23 2024-01-26 翌飞锐特电子商务(北京)有限公司 Logistics freight transportation information sharing and pushing method based on one record supply chain order
CN117455340B (en) * 2023-12-23 2024-03-08 翌飞锐特电子商务(北京)有限公司 Logistics freight transportation information sharing and pushing method based on one record supply chain order

Similar Documents

Publication Publication Date Title
Buntain et al. Identifying social roles in reddit using network structure
CN102591917B (en) Data processing method and system and related device
CN109643329A (en) Chart is generated from the data in tables of data
CN106844407B (en) Tag network generation method and system based on data set correlation
CN110221965A (en) Test cases technology, test method, device, equipment and system
CN112308157A (en) Decision tree-oriented transverse federated learning method
CN107194672B (en) Review distribution method integrating academic expertise and social network
CN112364908A (en) Decision tree-oriented longitudinal federal learning method
CN117236465A (en) Information entropy-based federal decision tree information measurement method
CN114332984B (en) Training data processing method, device and storage medium
CN111885399A (en) Content distribution method, content distribution device, electronic equipment and storage medium
CN111222847B (en) Open source community developer recommendation method based on deep learning and unsupervised clustering
CN106951471A (en) A kind of construction method of the label prediction of the development trend model based on SVM
CN111382181A (en) Designated enterprise family affiliation analysis method and system based on stock right penetration
CN113902534A (en) Interactive risk group identification method based on stock community relation map
CN109800354A (en) A kind of resume modification intension recognizing method and system based on the storage of block chain
CN106600213A (en) Intelligent resume management system and method
CN109284500A (en) Information transmission system and method based on merchants inviting work process and reading preference
CN112925899B (en) Ordering model establishment method, case clue recommendation method, device and medium
US20210240701A1 (en) Information processing apparatus, determination method, non-transitory computer readable medium storing program, and information processing system
CN107767155A (en) A kind of method and system for assessing user&#39;s representation data
CN111078859B (en) Author recommendation method based on reference times
CN108717445A (en) A kind of online social platform user interest recommendation method based on historical data
CN116150470A (en) Content recommendation method, device, apparatus, storage medium and program product
CN111460300A (en) Network content pushing method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination