CN112699402B

CN112699402B - Wearable device activity prediction method based on federal personalized random forest

Info

Publication number: CN112699402B
Application number: CN202011577207.7A
Authority: CN
Inventors: 王金艳; 刘松逢; 刘静; 颜奇; 李先贤
Original assignee: Guangxi Normal University
Current assignee: Dragon Totem Technology Hefei Co ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-06-17
Anticipated expiration: 2040-12-28
Also published as: CN112699402A

Abstract

The invention discloses a wearable device activity prediction method based on federal personalized random forests, wherein each user uses a local sensitive hash function to generate a respective hash table, and each user finds out own similar user by comparing the quantity of similar sample data among all hash tables. Each user can actively initiate training, and only users similar to the user can be searched for to jointly train a random decision tree which is commonly owned by the users participating in the training. And in the process of dividing the nodes of the random decision tree, protecting the privacy of the users by using a differential privacy mechanism, disturbing the calculated information gain by each party, and aggregating the disturbed information gain of each party at one party to obtain the optimal division of the candidate attribute set. And for the generated random decision tree, the user obtains the final random forest model of the user through incremental selection, the random forest model of each user is different, and the personalized model is more suitable for the user to predict on own data.

Description

Wearable device activity prediction method based on federal personalized random forest

Technical Field

The invention relates to the technical field of federal learning, in particular to a wearable device activity prediction method based on federal personalized random forests.

Background

Activities of daily life are closely related to the health of people. In recent years, with the development of wearable technologies, people track their activities by using wearable devices such as smartphones, bracelets, smart glasses, and the like, so as to know their health conditions. Artificial intelligence technology will also be integrated into wearable devices, enabling a more intelligent, more self-conscious application modality, a trend that has been the first time at the CES2014 electronic consumption exhibition. For example, according to a series of data such as activity data generated by a user and a surrounding environment collected by the wearable device, the activity of the user in progress is accurately identified by using a trained machine learning model, so that the intelligent wearable device can provide a dedicated and personalized service for each user. Because artificial intelligence requires a large amount of data for machine learning to achieve better performance, applying artificial intelligence to wearable devices still faces two major challenges: firstly, under the scene that the personal wearable equipment has a large number of users and needs model individuation, algorithm designs are different; and secondly, data exists in an isolated island form, if the data of the user is collected for centralized training, personal information is speculated through analysis of the data, and therefore personal sensitive information is leaked.

Disclosure of Invention

The invention aims to solve the problem that personal sensitive information is easily leaked in the process of realizing artificial intelligence of the conventional wearable equipment, and provides a wearable equipment activity prediction method based on federal personalized random forests.

In order to solve the problems, the invention is realized by the following technical scheme:

the wearable device activity prediction method based on the federal personalized random forest comprises the following steps:

step 1, calculating hash values of all sample data of wearable equipment of each participant by using local sensitive hashes, sending a hash table formed by the hash values of the sample data to other participants by each participant, and finding similar sample data among different participants by the other participants according to a global hash table;

step 2, determining a similar participant set of each current participant, namely: the current participant respectively counts the quantity of similar sample data of other participants and the current participant, calculates the proportion of the quantity of the similar sample data to the total quantity of all the sample data of the other participants, and arranges other participants with the proportion at the front k bits to form a similar participant set of the current participant; wherein k is a set value;

step 3, randomly selecting part of similar participants from the similar participant set of each current participant, wherein the current participant serves as a current active party, and the selected similar participants serve as current passive parties;

step 4, under the coordination of the current active party, using sample data of the current active party and each current passive party to jointly train the optimal attribute and the attribute division value of each node of the random decision tree, and after the training is finished, the current active party and each current passive party respectively obtain a same random decision tree;

step 5, the current active party and each current passive party add the random decision tree obtained in the step 4 into respective federal random forest models in an increment selection mode;

step 6, repeating the steps 3-5 until the training times of each current participant reach the preset training times, and finally, enabling each participant to have different federal random forest models respectively;

and 7, inputting the measured data of the wearable equipment of the participant into the federal random forest model obtained in the step 6, so as to complete the activity prediction of the participant, wherein the prediction result is an activity label of the human activity recognition task.

In the step 5, the process of jointly training the optimal attribute and the attribute division value of each node of the random decision tree by using the sample data of the current active party and each current passive party is as follows:

step 5.1, the current active party randomly selects a part of attributes from the attribute set to form a candidate attribute set, and sends the candidate attribute set to each current passive party;

step 5.2, the current active party and each current passive party respectively randomly select a value from the range of the local sample data attribute value of each attribute of the candidate attribute set as an attribute random value; each current passive party sends the attribute random values of all the attributes of the candidate attribute set to the current active party;

step 5.3, the current active side randomly selects a value from the range of the attribute random value of each attribute of the candidate attribute set of the current active side and each current passive side as an attribute division value, and sends the attribute division values of all the attributes of the candidate attribute set to each current passive side;

step 5.4, the current active side and each current passive side pre-divide the local sample data thereof by utilizing the candidate attribute set and the corresponding attribute dividing value respectively, and calculate the information gain of each attribute in the pre-divided candidate attribute set;

step 5.5, adding Laplace noise into the information gain of each attribute in the candidate attribute set by each current passive party for disturbance to obtain the disturbance information gain of each attribute in the candidate attribute set, and sending the disturbance information gain of all attributes in the candidate attribute set to the current active party;

step 5.6, the active party carries out weighted summation on the information gain of the current active party and the disturbance information gain of each current passive party of each attribute of the candidate attribute set to obtain the comprehensive score of each attribute of the candidate attribute set, and selects the attribute with the maximum comprehensive score as the optimal attribute;

and 5.7, the current active party sends the optimal attribute and the attribute division value corresponding to the optimal attribute to each current passive party, wherein the optimal attribute and the attribute division value corresponding to the optimal attribute are the optimal attribute and the attribute division value of the current node of the trained random decision tree.

In the step 1, each participant sends the hash table formed by the hash values of the sample data to each participant through the distributed communication framework.

Compared with the prior art, the invention has the following characteristics:

1. and applying a locality sensitive hash function in the federal random forest, calculating similar sample data among users, and enabling the users to find similar users. And each user only selects similar users to perform combined modeling, so that the limitation of the random forest in a federal scene with larger user quantity and needing an individualized model is solved.

2. In the node division process, in order to find the optimal attribute division in the candidate division attribute set, the users participating in training perform noise disturbance on the information gain of the attribute division in the candidate attribute set by using a Laplacian mechanism. When deciding on node partitioning, the privacy of the user data is protected.

3. In the federal learning, an integrated learning increment selection idea is utilized, each user screens a random decision tree in a local model, the model precision is improved, the model complexity is simplified, and the storage and prediction expenses are reduced.

Drawings

FIG. 1 is a flow diagram of a data processing phase of a federated random forest approach to differential privacy protection.

FIG. 2 is a flow diagram of a training phase of a federated random forest approach to differential privacy protection.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to specific examples.

To accommodate the large number of users, the model requires a personalized scenario, treating each user as a compute node, and assuming each user is honest and curious. And each user uses a locality sensitive hash function to generate a respective hash table, and each user finds out the user similar to the user by comparing the quantity of similar sample data among all the hash tables. Each user can actively initiate training, and only users similar to the user can be searched for to jointly train a random decision tree which is commonly owned by the users participating in the training. And in the process of dividing the nodes of the random decision tree, protecting the privacy of the user by using a differential privacy mechanism, disturbing the calculated information gain by each party, and aggregating the disturbed information gain of each party at one party to obtain the optimal division of the candidate attribute set. For the generated random decision tree, a user obtains a final random forest model through increment selection, the random forest model of each user is different, the personalized model is more suitable for the user to predict on data of the user, and the user tracks the activity of the user by using wearable equipment such as a smart phone, a bracelet and smart glasses, so that the health condition of the user can be known.

The wearable device activity prediction method based on the federal personalized random forest specifically comprises the following steps:

first, a data processing stage (as shown in fig. 1).

Step 1, the methodComputing each participant (i.e., user of wearable device) p with Locality-Sensitive Hashing (LSH)_oAnd e, the hash value of all sample data in P, wherein P is all users in the federal environment. And forming a global hash table by the sample data hash values of all parties through Allreduce operation of a distributed communication framework, and sending the global hash table to all the participating parties, wherein all the participating parties find similar sample data among different participating parties according to the received hash table.

In order to obtain the similarity of any two sample data in the combined data without exposing the original data to other parties, a widely used p-stable LSH function is adopted. According to LSH, if two sample data are similar, they have a higher probability of being hashed to the same value. Thus, by applying multiple LSH functions, sample data is mapped to multiple hash values. Hash function F_a,bExpressed as:

wherein v represents a d-dimensional sample data vector, a is a d-dimensional random vector, b ∈ (0, r) is a random number, r is the segment length of a straight line, and a straight line in the space is divided into a plurality of segments with the length r and equal length. Points in the space are mapped to a straight line through a mapping function, and points mapped to the same segment are endowed with the same hash value. It is understood that if two points in the space are closer to each other, the higher the probability that they are mapped to the same segment, and the functions of the hash function family are established according to the difference between a and b.

Given L randomly generated p-stable hash functions, each party first computes the hash value of its sample data. We use the distributed communication framework Allreduce operation to build L global hash tables (H)₁,H₂...H_L) The input of AllReduce is a hash table formed by instance id of each participant and hash value thereof

Propagating the aggregated hash table to each party to compute similarityAnd (4) information.

Each party gets a global hash table. For one participant p_oE.g. P, first at the participant P_gFinding a participant P in an element P_oSimilar sample data of (2). Comparison of p_oEach sample data with p_gAnd if the number of the same hash value of each sample data is greater than the threshold value t, the two sample data are considered to be similar.

Step 2, each participant p_oDetermine its set of similar participants P_s。

For one participant p_oE.g. P. First, find out the current participant p_oAt the other participant p_gTaking the similar sample data, merging the similar sample data, and counting other participators p_gWith the current participant p_oThe number of similar sample data. Then, calculating the occupation of similar sample data in other participants p_gThe ratio of the similar sample data ratios (i.e., the ratio of the similar sample data to the number of all sample data) for all sample data. Finally, the other participants p of the similar sample data ratio top-k are scaled_g(i.e. the participator with similar sample data in the top k bits in proportion) as the current participator p_oOf similar participants P_s。

In practice, steps 1-2 are pre-processing procedures that can be done once and reused in multiple trainings. Only when there is an update in the training data, the preprocessing needs to be performed again.

And II, a joint training stage (shown in figure 2).

Step 3, for the participant p_mE.g. P, each training from its set of similar participants P_sPart of participants selected at random

Used for jointly training a random decision tree. At this time, p_mThe random forest model is regarded as the active part of the training, the random forest model is in an active training state in the training process of the current random decision tree, and the similar part P of the random forest model^*Viewed as aAnd the random forest model of the passive side is in a passive training state in the training process of the current random decision tree.

Step 4, initiative party p_mCoordinating passive parties P^*Using the master p_mAnd each passive side P^*The sample data is used for jointly training the random decision tree to carry out the optimal attribute and the attribute division value of each node, the construction of the random decision tree is jointly completed, and at the moment, the initiative p_mCoordinating all passive parties P^*And simultaneously has an identical random decision tree.

Decision trees are among the most efficient and widely used techniques in many data mining algorithms. The decision tree is a tree composed of a plurality of judgment nodes, and samples are classified through judgment of node attributes and attribute values of each layer. The partition attributes and attribute values on each node are selected based on a goal: the 'purity' of the data of each divided node is the highest, that is, samples contained in the sub-nodes divided by the attribute and the attribute value belong to the same category as much as possible. The random decision tree is formed by adding randomness on the basis of the decision tree so as to improve the generalization capability and the training speed of the decision tree. The random forest model consisting of a plurality of random decision trees can accommodate the error of a single random tree, thereby achieving better prediction effect.

The optimal attribute and attribute partition value of each node of the random decision tree are realized in a top-down recursion mode, information gain of each attribute in the candidate attribute set is calculated for the nodes from a root node, and the attribute value with the maximum information gain are selected as the optimal partition of the nodes. And constructing child nodes according to the division of the optimal attributes and the attribute values, and recursively calling the method by the child nodes to construct a random decision tree. The process of calculating the optimal attribute and the attribute dividing value of each node of the random decision tree is as follows:

step 4.1, initiative side p_mRandomly selecting partial attributes from a known attribute set F

Forming a candidate attribute set, and setting a candidate attribute set F^*Is sent toEach passive side P^*。

Step 4.2, Passive side P^*First determining the property f_jMaximum value in its local sample data

And minimum value

Then from

Randomly selects a value in the range of

As attribute f_j∈F^*A random value of (a); equally active side p_mFirst determining the property f_jMaximum value in its local sample data

And minimum value

Then from

Randomly selects a value v_m,j

As attribute f_j∈F^*Is calculated. All passive parties P^*Set F of candidate attributes^*Each attribute f in_j∈F^*Corresponding random value

Sent to the master p_m。

Step 4.3, initiative side p_mReceiving a passive party P^*Transmitted random value

And from the random value v_m,jAnd all random values

Maximum value of

Minimum value of

Randomly selecting a value within the range

As attribute f_j∈F^*The value of (a) is divided. Initiative side p_mSet F of candidate attributes^*Each attribute f in (1)_j∈F^*Corresponding division value v_j ^*To the passive party P^*。

Step 4.4, initiative side p_mAnd each passive side P^*Pre-dividing local sample data by using the candidate attribute set and the corresponding attribute dividing value respectively, and calculating the information gain q of each attribute in the pre-divided candidate attribute set_IG。

Initiative side p_mAnd a passive side P^*According to the attribute f_jAnd attribute value

Dividing sample data in the nodes, and calculating divided information gain:

wherein, T_iRepresenting a participant p_iAnd (3) currently dividing sample data of the node, wherein A represents a divided child node set, and C is a sample data label type set.

Representing an attribute f_jThe number of sample data of the divided child node a,

the number of the sample data labels in the divided child node a is c.

Step 4.5, each passive side P^*Information gain for each attribute in pre-partitioned candidate attribute set

Adding Laplace noise to carry out disturbance to obtain disturbance information gain of each attribute in the candidate attribute set

Each passive side P^*Gain disturbance information of all attributes in candidate attribute set

Sent to the master p_m。

Passive square P^*And adding Laplace noise to the information gain for disturbance. Because the information gain is calculated directly on the data, these potential holes are exactly what the differential privacy decision tree algorithm is to prevent. As the depth of the tree increases, more privacy budgets are allocated to each layer, and the effectiveness of the model is guaranteed while privacy is protected. The sensitivity of the information gain calculation function is:

S(qIG)＝log(N+1)+1/ln2

wherein, N is the size of the data set, epsilon is the privacy budget allocated to each tree by the passive party, epsilon controls the degree of privacy protection, and the smaller the privacy budget, the higher the degree of privacy protection. Different budgets are allocated for different depths of the tree, when the tree depth is large, the true count is more easily overcome by noise due to further partitioning. Thus, deeper nodes are allocated a larger budget to hold the real information. k is initialized to the height of the tree, S_tRepresents the total share of the privacy budget:

where ε' represents the privacy budget allocation for each level of the random tree,

adding noise Lap (S (q) to the information gain divided by each attribute in the candidate attribute set_IG)/(ε'/|F^*|)) the information gain after adding noise disturbance is expressed as:

step 4.6, for each attribute f in the candidate attribute set_j∈F^*Master side p_mInformation gain to the master respectively

And each passive side P^*Gain of disturbance information

Carrying out weighted summation to obtain an attribute f_jIs given a composite score

Wherein N is_iIs the number of sample data of the passive side, N_mIs the sample data number of the master.

Step 4.7, initiative party p_mSelecting the attribute with the highest composite score

As the best attribute, and sends the best attribute and its corresponding attribute score value (obtained in step 4.3) to each passive party P^*. The optimal attribute and the attribute partition value corresponding to the optimal attribute areAnd (4) the best attribute and the attribute division value of the current node of the trained random decision tree.

Step 5, initiative party p_mAnd each passive side P^*And (4) adding the random decision trees x constructed in the step (4) into respective random forest models in an incremental selection mode.

Active side p_mAnd each passive side P^*When the random trees are selected in an increment mode, only the random decision trees x which contribute to the accuracy of the respective existing random forest models are selected and added into the existing random forest model models. The method contributes to the precision of the random forest model and meets the following requirements: e (X (T)) is less than or equal to E (X (T)) + X (T)), X represents the existing random forest model of the user, T represents the test data of the user, namely the test precision E of the random forest is improved after a new random tree X is added.

Step 6, repeating the steps 3-5, wherein in the process of each repetition, the master party p_mAnd each passive side P^*The various random forest model models are updated once (whether random decision tree x is added or not is considered as a model update). In the training phase, two training states exist for the random forest model of each participant: active and passive. When the participator is used as an active party, the random forest model is in an active training state; when the participator is used as a passive party, the random forest model is in a passive training state. And after all the participants are used as the active parties and respectively finish the training and model updating for a preset number of times, showing that all the participants finish the federal learning and finish the training. At this time, each user p_oBelonging to P and having different federal random forest models X_o. Because each user is trained with similar users in a combined mode, and the incremental selection process is added, the user model is personalized, and the method is more suitable for local data of the user.

And thirdly, local actual measurement stage.

Step 7, in the scene of wearable personal equipment, each user continuously collects data D generated by the individual by utilizing own equipment_oUsing the trained federal random forest model X_oAnd performing activity prediction on the collected new data. When making a decision, the user acquires all the individual piecesAnd (4) predicting results of the decision tree, and then according to the votes of the results for each activity category, winning the category with the most votes, namely the activity label identified by the personalized random forest model in the human activity identification task. The user uses wearable equipment such as smart mobile phone, bracelet, smart glasses to track own activity to know the health status of oneself.

It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims

1. The wearable equipment activity prediction method based on the federal personalized random forest is characterized by comprising the following steps of:

step 1, calculating hash values of all sample data of wearable equipment of each participant by using local sensitive hashes, wherein each participant sends a hash table formed by the hash values of the sample data to other participants, and the other participants find similar sample data among different participants according to a global hash table;

step 2, determining a similar participant set of each current participant, namely: the current participant respectively counts the number of similar sample data of other participants and the current participant, calculates the proportion of the number of the similar sample data to the total amount of all the sample data of the other participants, and arranges other participants with the proportion in the front k bits to form a similar participant set of the current participant; wherein k is a set value;

step 3, each current participant randomly selects part of similar participants from the similar participant set, wherein the current participant serves as a current active party, and the selected similar participants serve as current passive parties;

step 6, repeating the steps 3-5 until the training times of each current participant reach the preset training times, and finally enabling each participant to have different federal random forest models respectively;

2. The method for predicting the activity of wearable equipment based on the federal personalized random forest as claimed in claim 1, wherein in the step 5, the optimal attribute and the attribute dividing value of each node of the random decision tree are jointly trained by using sample data of the current active party and each current passive party by the following process:

step 5.3, the current active party randomly selects a value from the range of the attribute random value of each attribute of the candidate attribute set of the current active party and each current passive party as an attribute division value, and sends the attribute division values of all the attributes of the candidate attribute set to each current passive party;

step 5.4, the current active party and each current passive party pre-divide the local sample data thereof by using the candidate attribute set and the corresponding attribute dividing value respectively, and calculate the information gain of each attribute in the pre-divided candidate attribute set;

3. The method of claim 1, wherein in step 1, each participant sends a hash table comprising hash values of sample data to each participant through a distributed communication framework.