CN114692894A - Implementation method of machine learning model supporting dynamic addition and deletion of user data - Google Patents

Implementation method of machine learning model supporting dynamic addition and deletion of user data Download PDF

Info

Publication number
CN114692894A
CN114692894A CN202210353332.2A CN202210353332A CN114692894A CN 114692894 A CN114692894 A CN 114692894A CN 202210353332 A CN202210353332 A CN 202210353332A CN 114692894 A CN114692894 A CN 114692894A
Authority
CN
China
Prior art keywords
data
user
model
server
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210353332.2A
Other languages
Chinese (zh)
Inventor
毛云龙
李成成
仲盛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202210353332.2A priority Critical patent/CN114692894A/en
Publication of CN114692894A publication Critical patent/CN114692894A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/22Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc
    • G06F7/36Combined merging and sorting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for realizing a machine learning model supporting dynamic addition and deletion of user data, which comprises the following steps of 1, collecting data at a server side; step 2, dividing the user data set to obtain a series of mutually independent data subsets, wherein each piece of user data is only stored in one data subset; step 3, training the data subsets obtained by division through the submodels, and performing combined reasoning on the trained submodels by adopting the integrated model; step 4, responding to a data updating request of a user, and realizing incremental learning of the data of the newly added user; and 5, responding to the data deletion request of the user, and realizing the forgetting of the user data. The invention can carry out machine learning on any amount of user data and can realize accurate model reasoning result.

Description

Implementation method of machine learning model supporting dynamic addition and deletion of user data
Technical Field
The invention relates to a method for realizing model training and updating in an online machine learning service, in particular to a method for realizing a machine learning model supporting dynamic addition and deletion of user data.
Background
With the rapid development of machine learning, especially deep learning, many enterprises begin to train deep neural network models and use them as a class of user services, that is, machine learning services, such as image classification, face recognition, and speech translation, which provide many conveniences for people's lives through online machine learning. At the same time, however, in order to improve the quality of service, service providers need to use real data as a training set of models, thereby improving the usability of the models. Therefore, part of the machine learning services need the user to upload part of private data, and an accurate and efficient machine learning model is trained on the basis, so that the model inference request of the user is further improved and satisfied.
However, this approach also results in the data of the user being stored in the server, and even if the user no longer uses the service, the data information uploaded by the user before still exists in the online service model, which increases the risk of leakage of the private data of the user. In order to solve this problem, the server needs to not only respond to the data update request of the user, but also satisfy the data deletion request of the user, so as to permanently delete the relevant user data from the server. In the machine learning scenario, the data information of the user is not only in the training set of the model, but also the machine learning model can memorize the data that has been trained, so the machine learning model also contains the information of the user, and the operation forgotten by the machine needs to be executed.
However, the inventors of the present application have found that the above-mentioned techniques have at least the following technical problems: the model structure of the existing online machine learning service does not support machine forgetting of partial data, but can be realized only by completely retraining the model, which causes that a large amount of calculation overhead is needed for data deletion and a long retraining time is needed.
Therefore, for the online machine learning service scenario, a flexible machine learning model needs to be constructed to support dynamic addition and deletion of user data and maintain the usability and updating efficiency of the model.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method for realizing a machine learning model supporting dynamic addition and deletion of user data aiming at the defects of the prior art.
The application provides a method for realizing a machine learning model supporting dynamic addition and deletion of user data, which is characterized by comprising the following steps:
step 1, collecting data of a server side; the data collection comprises an auxiliary data set consisting of related public data sets and data collected by the server in advance and user data uploaded by a user;
step 2, the server divides the user data set to obtain a series of mutually independent data subsets, and each piece of user data is only stored in one data subset;
step 3, the server trains the data subsets obtained by division through the submodels, and the trained submodels are subjected to combined reasoning through the integrated model;
step 4, the server responds to the data updating request of the user to realize incremental learning of the data of the newly added user; the data updating request of the user comprises the steps that the user interacts with the server, the private data of the user is uploaded, and the server is requested to learn the newly uploaded data;
step 5, the server responds to the data deletion request of the user, and machine forgetting of the existing user data is realized; the data deleting request of the user comprises the steps that the user interacts with the server, the identity of the user is uploaded, the server is requested to delete all data under the identity, and training parameters of the part of data are removed from the machine learning model.
The technical scheme of the invention is further limited as follows: in step 1, the user data comprises an original data sample of a user, a data tag of the sample, a user identity of the sample, and available time of the sample;
the original data samples of the users, each user has a plurality of data samples, and a single data sample is represented by x;
the data labels of the samples have a corresponding data label for each sample, the corresponding data label is used for indicating a specific data category, and the related category is represented by y;
the identity marks of the users, data from the same user have the same identity mark and are marked as u;
the time available for the data, designated by the user, is denoted t.
Preferably, in step 2, the user data set is composed of a plurality of pieces of user data, and for a user data set containing n pieces of data, it is represented by Du={(xi,yi,ui,ti)|1≤i≤n}。
Preferably, the data partitioning process in step 2 includes splitting the user data set according to the user identity attribute, and calculating an average available time; then sorting the user data, and dividing the sorted data into a series of mutually independent data subsets in sequence; the data subsets contain data from multiple users, with different data subsets being independent of each other.
Preferably, the step 2 comprises the following specific steps:
step 201, for the user data set DuThe data is divided into a plurality of single-user data sets according to the identity attribute of the data, wherein each single-user data set uniquely contains the data from a specific user, and the data set of the jth user can be represented as
Figure BDA0003579973160000031
For each single-user data set, the average usable time of the data therein is calculated 202
Figure BDA0003579973160000032
Step 203, the average available time of the single-user data sets is sorted by a sorting algorithm to obtain a group of ordered single-user data sets, namely
Figure BDA0003579973160000033
And for any two sets, satisfy
Figure BDA0003579973160000034
Step 204, dividing the user data set into m subsets according to the same sequence, wherein each subset comprises
Figure BDA0003579973160000035
Data of individual users, noted as
Figure BDA0003579973160000036
Figure BDA0003579973160000037
Wherein j is more than or equal to 1 and less than or equal to m, and the division of the user data set is completed.
Preferably, the model training process of step 3 includes training of a sub-model and training of an integrated model, where the training process of the sub-model includes training a data subset and an auxiliary data set using a machine learning algorithm, and executing the training algorithm on each data subset to obtain a corresponding sub-model; the training process of the integrated model comprises the step of using a machine learning algorithm, using all sub-models and all data sets, and training to obtain the integrated model.
Preferably, the step 3 specifically comprises the following steps: step 301, initializing a list of submodels for storing parameters of each submodel;
step 302, setting a basic structure of sub-model training and corresponding hyper-parameters, including a model training turn N, a learning rate lr of a model and a loss function L of the model;
step 303, for the jth model, initialize a single model parameter, denoted as Wj
Step 304, using the j-th data subset obtained by the division
Figure BDA0003579973160000039
And auxiliary data set D saved by the serveroAs a training set for the current model;
305, training the model for N times by using the corresponding learning rate lr according to the hyper-parameters of the model, and updating the parameters of the model by using a loss function in each training round, namely
Figure BDA0003579973160000038
Step 306, using j 1 to j m, repeating steps 303 to 305 and saving the corresponding WjObtaining submodels corresponding to all subsets from a list of the submodels;
step 307, setting a basic structure of integrated model training and corresponding hyper-parameters, including model training round N, learning rate lr of the model and loss function L of the model;
step 308, initializing parameters of the integrated model, and recording the parameters as E;
step 309, use the full user data set DuWith the auxiliary data set DoAs an original training set of the integrated model;
step 310, according to the hyper-parameters of the integrated model, training the integrated model N times by using the corresponding learning rate lr, and updating the parameters of the model by using a loss function in each training cycle, namely
Figure BDA0003579973160000041
Wherein W1(x),…,Wm(x) Representing the calculation result of the submodel on the input data x;
step 311, storing parameters of the integrated model, and ending the training process of the model;
step 312, using the submodels and the integrated model to realize reasoning on unknown data; the inference process of the model can be represented as E (W)1(x),…,Wm(x) That is, all submodels are used to calculate the input data first, then all calculation results are input into the integrated model to be aggregated, and the integrated model outputs the final inference result.
Preferably, the model updating process in step 4 includes that the server responds to a data updating request of the user and receives a data sample uploaded by the user; the server uses machine learning algorithmThe new submodel and the integrated model complete the learning of the newly added data; the data updating request of the user comprises the steps that the user interacts with the server and uploads k private data of the user, and the uploaded data set is Dupdate={(xi,yi,ui,ti) I is more than or equal to 1 and less than or equal to k; the specific treatment steps are as follows:
step 401, after receiving the data set, the server merges the data set into an existing user data set, that is, the server merges the data set into an existing user data set
Figure BDA0003579973160000042
And calculating the average available time of the newly added data
Figure BDA0003579973160000043
Step 402, calculating the average available time T according to the aboveupdateAdding the new data to the jth data subset, i.e.
Figure BDA0003579973160000044
And make Tj-1<Tupdate≤TjThat is, the average available time of the data of the added data subset is as close as possible to the newly added data;
step 403, repeat step 305, using the updated data subset
Figure BDA0003579973160000045
Training the submodels so that the submodels can learn the newly added data, and then storing the parameters of the newly trained submodels;
step 404, repeat step 310, using the updated data set
Figure BDA0003579973160000046
The updated subset model trains the integrated model, so that the integrated model can learn new data, and then parameters of the newly trained integrated model are stored;
step 405, the data updating process is ended, and the inference process of the model is consistent with the aforementioned step 312.
Preferably, the model updating process in step 5 includes that the server responds to a data deletion request of the user and receives an identity uploaded by the user; then the server retrains parameters of the submodel and the integrated model by using a machine forgetting algorithm to finish forgetting the existing data; the method comprises the following specific steps:
step 501, after receiving the identity identifier, the server screens out all user data identical to the identity identifier from the user data set, and records the user data as Dremove={(xi,yi,ui,ti)|1≤i≤k,uiU }; u is the identity of the user;
step 502, the server searches for D from the whole data subsetremoveRelated data, which is used for dividing the data with the same user identity into the same data subsets in the division process of the model, and the server determines the unique data subset containing the user data and records the unique data subset as the user data
Figure BDA0003579973160000051
Step 503, the server is in
Figure BDA0003579973160000052
In which the part of the data is removed, i.e.
Figure BDA0003579973160000053
Then abandoning the original sub-model;
step 504, repeat step 305, using the removed data subset
Figure BDA0003579973160000054
Retraining the submodel so that the submodel no longer contains the relevant information of the removed data, and then saving the parameters of the newly trained submodel;
step 505, the server removes D from all user dataremoveI.e. by
Figure BDA0003579973160000055
Then abandoning the original integrated model;
step 506, repeat step 310, using the removed user data
Figure BDA0003579973160000056
Retraining the integration model to remove information in the integration model related to the user data, and then saving parameters of the integration model;
step 507, ending the data deleting process, and the reasoning process of the model is consistent with the step 312.
The technical scheme provided in the embodiment of the application has at least the following technical effects or advantages: 1) the invention provides an online service-oriented scene and an implementation method of a machine learning model supporting dynamic addition and deletion of user data.
2) The invention also supports the processing of the data updating request and the data deleting request of the user, and can realize high-efficiency processing efficiency and complete data forgetting by adjusting and updating partial network parameters in the model.
3) Compared with a completely retrained machine learning model scheme, the method can greatly reduce the calculation overhead required by data updating and data forgetting of the model under the condition of maintaining the accuracy of the model unchanged, thereby being capable of adapting to frequent data changes and responding to user requests more quickly.
4) The method can be applied to most online machine learning services, particularly to fields closely related to user privacy, and has wide application scenes.
Drawings
Fig. 1 is a schematic diagram of a network structure of a server and multiple users according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a server data set and a machine learning model structure according to an embodiment of the present invention.
Fig. 3 is a schematic flow chart of the division of a user data set into data subsets according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating a training process of a machine learning model according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a data updating process of the online machine learning service according to the embodiment of the present invention.
FIG. 6 is a diagram illustrating model structure update of an online machine learning service according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of a data deletion process of the online machine learning service according to an embodiment of the present invention.
FIG. 8 is a diagram illustrating model structure updating of an online machine learning service according to an embodiment of the present invention.
Detailed Description
The invention provides a machine learning model implementation method for supporting continuous data updating and deleting in an online service-oriented scene. In the method, a centralized machine learning model is used by an online machine learning service, and a server collects and collates data and trains a machine learning model for a user to use; and uploading part of data of the user side by the user side, and performing inference tasks by using the model. According to the change of the user's will, the user can dynamically upload new data to the server during the service operation period, or require the server to delete the data which is uploaded by the user.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and specific examples.
Examples
Fig. 1 is a schematic diagram of a network structure of a user-server model according to the present invention. In the online machine learning service, a central node is used as a server, and a server end stores data uploaded by a user and trains the data to obtain a machine learning model. And the rest of the user nodes only carry out data interaction with the server. Taking an image classification task as an example, the data interaction mainly comprises three requests, the first request is a model reasoning request of a user, wherein the user uploads part of unlabelled image samples to a server, the server is requested to carry out reasoning by using a machine learning model, and then the server is takenAnd the server returns the classification inference result to the user and immediately deletes the image uploaded by the user. The second is a data updating request of a user, wherein the user respectively uploads original image data x, image category y, user identity u and available data time t as four tuples of data to a server together, and a single request can simultaneously contain a plurality of data tuples, namely Dupdate={(xi,yi,ui,ti) I is more than or equal to 1 and less than or equal to k }; after receiving the request, the server stores the request into the local image data set of the server and updates the machine learning model, thereby realizing the classification reasoning capability of the newly uploaded image. The third type is a data deletion request of a user, wherein the user uploads a personal identity u to a server and requests the server to delete all data related to the identity; after receiving the request, the server needs to delete the image data in the locally stored data set, which is the same as the identity identifier, and update the machine learning model to remove the relevant information in the model, thereby realizing complete machine forgetting.
FIG. 2 is a diagram of a server data set and a machine learning model structure according to the present invention. The data set at the server side mainly comprises an auxiliary data set and a user data set, wherein the auxiliary data set mainly comprises public data and a part of data owned by the server, so that the data only comprises data samples and data labels, and the problems related to user data updating and deleting do not need to be considered. The user data set mainly comprises data uploaded by a plurality of users, and each piece of user data comprises a data sample, a data label, a user identity and data available time. The machine learning model shown in the figure is also composed of two parts, one part is a plurality of sub models which have the same parameter structure but are trained separately by using different user data, for example, in fig. 2, there are 6 users, which are divided into 3 data subsets, and the 3 sub models are respectively used for training, in order to improve the training effect of the model, in the process, the auxiliary data set is also used for training the sub models. After the sub-model training is completed, the invention uses a single integrated model to aggregate the output of the sub-model, and further obtains the final reasoning result.
Fig. 3 is a schematic flow chart of the user data set partitioning into data subsets according to the present invention. For example, the server provides an image classification service aiming at the CIFAR series, wherein CIFAR100 is used as an auxiliary data set, CIFAR10 is used as data uploaded by 100 users, each user randomly owns a part of data of CIFAR10, and the identity and the available time of the user are increased for each piece of data. Wherein the CIFAR10 contains 50000 data samples in total. The data partitioning method used by the server is as follows:
step 201, the server divides the data set into a series of single-user data sets according to the identity of the user. In this example, each single-user data set represents data owned by one user, and thus 100 single-user data sets are obtained by dividing the data set, each user includes 500 data samples, and the j-th user data set is
Figure BDA0003579973160000071
Wherein j is more than or equal to 1 and less than or equal to 100;
step 202, for each single-user data set, calculating the average time available in the data set
Figure BDA0003579973160000081
In this example, by counting the available time in 100 single-user data sets, the average available time of each data set, denoted as T, can be obtainedjWherein j is more than or equal to 1 and less than or equal to 100;
step 203, the average available time of the single-user data sets is sorted by a sorting algorithm to obtain a group of ordered single-user data sets, namely
Figure BDA0003579973160000082
And for any two sets, satisfy
Figure BDA0003579973160000083
And step 204, sequentially dividing the single-user data set into m subsets according to the sequence. In this example, the total division is 10 subsets of data, each subset containing 10 users of data, the jth subset of data can be expressed as
Figure BDA0003579973160000084
Wherein j is more than or equal to 1 and less than or equal to 10, and k is the data number of each subset. The data set division step is completed by taking the group of data subsets as the division result;
FIG. 4 is a schematic diagram of a training process of the machine learning model according to the present invention. In this example, the server side trains a submodel for each of the 10 divided data subsets, and then aggregates the results of the submodels by integrating the models to obtain the prediction output of the overall model. The specific implementation steps are as follows:
step 301, the server initializes an empty submodel list for storing the training parameters of each submodel.
Step 302, the server selects a suitable model structure for each data subset, and sets corresponding training hyper-parameters. In this example, the server may use the ResNet network as the network structure of the submodel for training the CIFAR10 data set, the model hyper-parameter related thereto may be set to the training round N-30, the learning rate lr of the model is 0.1, the loss function of the model uses the cross-entropy function, that is, the model loss function uses the cross-entropy function
Figure BDA0003579973160000085
Figure BDA0003579973160000086
Wherein y iscIs a unit vector of 1, p only in the c-th dimensioncIs the predicted outcome of the model;
step 303, initializing a separate submodel W for each subset of dataj. In this example, the server randomly initializes a new ResNet model;
step 304, the server will subset the jth data
Figure BDA0003579973160000087
And an auxiliary data set DoCombined as currentA training data set of the model. In this example, each subset of data
Figure BDA0003579973160000088
Containing data samples of size of the original data set
Figure BDA0003579973160000089
I.e. containing 5000 user data samples, and the complete auxiliary data set;
305, training the model for N times by using a machine learning algorithm and the hyper-parameters of the model and the corresponding learning rate lr, and updating the parameters of the model by using a loss function in each training round, namely
Figure BDA0003579973160000091
In the embodiment, the server trains the ResNet model obtained by initialization for 30 times, and the parameters of the submodels are stored in a list after the training is finished;
and step 306, repeating the steps 303 to 305 by using different data subsets and enabling j to be 1 to j to be m, and finishing the training process of the sub-model. In this example, after the training process is completed, the server obtains 10 different ResNet models, each ResNet model is independent of each other, the learned user information is different, and a data reasoning process can be performed independently;
step 307, the server selects a suitable integration model and sets the associated hyper-parameters. In this example, the server may use a fully-connected network as the structure of the integrated model, and since the fully-connected network includes a small number of parameters, the training time of the model is relatively short, and here, the training round N is 10, the learning rate lr is 0.01, and the loss function also uses a cross entropy function;
308, the server randomly initializes the parameters E of the integration model;
step 309, the server trains the integration model using the full user data set. In this example, the server trains the integrated model 10 times with a learning rate of 0.01, 50000 pieces of user data are involved in each training, and the updating process of the model is
Figure BDA0003579973160000092
Wherein W1(x),…,Wm(x) Representing the result of the calculation of the input data x by the submodel;
and step 310, storing the parameters of the integrated model after training, and ending the training process of the model.
In step 311, using the above sub-model and the integration model, the inference of unknown data can be realized. The reasoning process of the model can be represented as E (W)1(x),…,Wm(x) That is, all submodels are used to calculate the input data first, then all calculation results are input into the integrated model to be aggregated, and the integrated model outputs the final inference result.
Fig. 5 and fig. 6 are schematic diagrams of a data updating process and a model structure updating of the online machine learning service according to the present invention. The user data updating request comprises that a user interacts with the server and uploads own k pieces of private data. At this time, the uploaded data set is Dupdate={(xi,yi,ui,ti) I is more than or equal to 1 and less than or equal to k. After the server receives the request, the specific processing procedure is as follows:
the server first saves this part of the data to a local user data set, step 401, i.e. it saves this part of the data to a local user data set
Figure BDA0003579973160000093
Taking fig. 6 as an example, on the basis of fig. 2, the server receives the update data of the user 7 at this time;
in step 402, the server calculates the average available time of the newly added data
Figure BDA0003579973160000101
In step 403, in all the data subsets, the server searches the data subset closest to the average available time, and adds the new data to the data subset. Namely that make
Figure BDA0003579973160000102
And Tj-1<Tupdate≤Tj. As shown in fig. 6, the server finds that the subset 3 is closest to the available time of the new data, and adds the subset 3 with the available time;
in step 404, the server updates the sub-model corresponding to the subset of data. As shown in fig. 6, since the data of the subset 3 is updated, the server updates the sub-model corresponding to the subset 3, and for the update of the data, the server may use the updated data subset on the parameters of the sub-model that has been trained previously in an incremental learning manner
Figure BDA0003579973160000103
The training process of machine learning continues, enabling the submodel to learn the distribution of the update data. After the training is finished, the server saves the parameters of the trained sub-model and replaces the original sub-model;
in step 405, the server updates the final integration model. Since the sub-model is updated with parameters and the corresponding output result changes, the server needs to further update the parameters of the integrated model, where the updated data set can be used
Figure BDA0003579973160000104
The updated subset model trains the integrated model, so that the integrated model can learn new data, and then parameters of the newly trained integrated model are stored and replace the original sub-model;
at step 406, the data update process ends. As shown in FIG. 6, the part of the model that needs to be changed includes a sub-model and an integrated model, and the data update cost required by the machine learning scheme of the present invention is about the update cost of the whole model compared with the update of the whole model
Figure BDA0003579973160000105
Where m represents the number of submodels.
Fig. 7 and 8 are schematic diagrams of a data deletion process and model structure update of the online machine learning service according to the present invention. The data deleting request mainly relates to the interaction between a user and a server, the user uploads the own identity mark u, the server is requested to delete all data under the identity mark, and training parameters of the part of data are removed in a machine learning model. After the server receives the request, the specific processing procedure is as follows:
step 501, after receiving the identity identifier, the server screens out all user data identical to the identity identifier from the user data set, and records the user data as Dremove={(xi,yi,ui,ti)|1≤i≤k,uiU }. As in fig. 8, the user 4 requests to delete the relevant data, and then the server searches all the data related to the user 4;
step 502, the server searches for the data subset where the part of data is located, and since the data with the same user identity identifier is divided into the same data subset in the division process of the model, the server can determine the unique data subset containing the user data and record the unique data subset as the unique data subset
Figure BDA0003579973160000111
In the figure, subset 2 is related to the partial data;
step 503, the server is in
Figure BDA0003579973160000112
In which the part of the data is removed, i.e.
Figure BDA0003579973160000113
Figure BDA0003579973160000114
The original submodels and integration models are then discarded. As in fig. 8, the relevant data of user 4 is removed from subset 2, and also from the overall user data set, so that the training data of the model no longer contains relevant data;
step 504, the server uses the removed data subset
Figure BDA0003579973160000115
And retraining the submodel, and then saving and updating the parameters of the submodel. For a data deletion request, the server completely retrains the relevant model so as to ensure that the retrained model does not contain any information related to the deleted data, so that complete machine forgetting is realized;
in step 505, the server uses the removed user data
Figure BDA0003579973160000116
Retraining the integrated model, and then storing and updating the parameters of the integrated model;
step 506, the data deletion process is ended. As shown in fig. 8, the model structure of the model to be retrained mainly includes a sub-model and an integrated model, and compared with retraining of all models, the data deletion cost required by the machine learning scheme of the present invention is about the update cost of all models
Figure BDA0003579973160000117
Where m represents the number of submodels.
Furthermore, because the average available time of the data is used for sorting when the data is divided, the user data which are more likely to be forgotten are gathered together, and the data deletion of the adjacent time can be executed together when the data is deleted, so that the required model retraining times are reduced, and the retraining expense of the model is further reduced.
The foregoing is only a preferred embodiment of this invention and it should be noted that modifications can be made by those skilled in the art without departing from the principle of the invention and these modifications should also be considered as the protection scope of the invention.

Claims (9)

1. A method for realizing a machine learning model supporting dynamic addition and deletion of user data is characterized by comprising the following steps:
step 1, collecting data of a server side; the data collection comprises an auxiliary data set consisting of related public data sets and data collected by the server in advance and user data uploaded by a user;
step 2, the server divides the user data set to obtain a series of mutually independent data subsets, and each piece of user data is only stored in one data subset;
step 3, the server trains the data subsets obtained by division through the submodels, and the trained submodels are subjected to combined reasoning through the integrated model;
step 4, the server responds to the data updating request of the user to realize incremental learning of the data of the newly added user; the data updating request of the user comprises the steps that the user interacts with the server, the private data of the user is uploaded, and the server is requested to learn the newly uploaded data;
step 5, the server responds to the data deletion request of the user, and machine forgetting of the existing user data is realized; the data deleting request of the user comprises the steps that the user interacts with the server, the identity of the user is uploaded, the server is requested to delete all data under the identity, and training parameters of the part of data are removed from the machine learning model.
2. The method of claim 1 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: in step 1, the user data comprises an original data sample of a user, a data tag of the sample, a user identity of the sample, and available time of the sample;
the original data samples of the users, each user has a plurality of data samples, and a single data sample is represented by x;
the data labels of the samples have a corresponding data label for each sample, the corresponding data label is used for indicating a specific data category, and the related category is represented by y;
the identity marks of the users, data from the same user have the same identity mark and are marked as u;
the time available for the data, designated by the user, is denoted t.
3. The method of claim 1 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: in step 2, the user data set is composed of a plurality of pieces of user data, and for the user data set containing n pieces of data, the user data set is represented as Du={(xi,yi,ui,ti)|1≤i≤n}。
4. The method of claim 3 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the data partitioning process of the step 2 comprises splitting a user data set according to the identity attribute of the user, and calculating the average available time; then sorting the user data, and dividing the sorted data into a series of mutually independent data subsets in sequence; the data subsets contain data from multiple users, with different data subsets being independent of each other.
5. The method of claim 4 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the step 2 specifically comprises the following steps:
step 201, for the user data set DuThe data is divided into a plurality of single-user data sets according to the identity attribute of the data, wherein each single-user data set uniquely contains the data from a specific user, and the data set of the jth user can be represented as
Figure FDA0003579973150000021
For each single-user data set, the average usable time of the data therein is calculated 202
Figure FDA0003579973150000022
Step 203, the average available time of the single-user data sets is sorted by a sorting algorithm to obtain a group of ordered single-user data sets, namely
Figure FDA0003579973150000023
And for any two sets, satisfy
Figure FDA0003579973150000024
Step 204, dividing the user data set into m subsets according to the same sequence, wherein each subset comprises
Figure FDA0003579973150000025
Data of individual users, denoted as
Figure FDA0003579973150000026
Figure FDA0003579973150000027
Wherein j is more than or equal to 1 and less than or equal to m, and the division of the user data set is completed.
6. The method of claim 1 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the model training process of the step 3 comprises the training of the submodel and the training of the integrated model, wherein the training process of the submodel comprises the steps of training a data subset and an auxiliary data set by using a machine learning algorithm, and executing the training algorithm on each data subset to obtain a corresponding submodel; the training process of the integrated model comprises the step of using a machine learning algorithm, using all sub-models and all data sets, and training to obtain the integrated model.
7. The method of claim 6 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the step 3 specifically comprises the following steps: step 301, initializing a list of submodels for storing parameters of each submodel;
step 302, setting a basic structure of sub-model training and corresponding hyper-parameters, including a model training turn N, a learning rate lr of a model and a loss function L of the model;
step 303, for the jth model, initialize a single model parameter, denoted as Wj
Step 304, using the j-th data subset obtained by the division
Figure FDA0003579973150000031
And auxiliary data set D saved by the serveroAs a training set for the current model;
305, training the model for N times by using the corresponding learning rate lr according to the hyper-parameters of the model, and updating the parameters of the model by using a loss function in each training round, namely
Figure FDA0003579973150000032
Step 306, using j 1 to j m, repeating steps 303 to 305 and saving the corresponding WjObtaining submodels corresponding to all subsets from a list of the submodels;
step 307, setting a basic structure of integrated model training and corresponding hyper-parameters, including a model training turn N, a learning rate lr of a model and a loss function L of the model;
step 308, initializing parameters of the integrated model, and recording the parameters as E;
step 309, use the full user data set DuWith the auxiliary data set DoAs an original training set of the integrated model;
step 310, according to the hyper-parameters of the integrated model, training the integrated model for N times by using the corresponding learning rate lr, and updating the parameters of the model by using the loss function in each training round, namely
Figure FDA0003579973150000033
Wherein W1(x),…,Wm(x) Representing the calculation result of the submodel on the input data x;
step 311, saving the parameters of the integrated model, and ending the training process of the model;
step 312, using the submodels and the integrated model to realize reasoning on unknown data; the reasoning process of the model can be represented as E (W)1(x),…,Wm(x) That is, all the submodels are used for calculating the input data, all the calculation results are input into the integrated model for aggregation, and the integrated model outputs the final inference result.
8. The method of claim 7 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the model updating process in the step 4 comprises the steps that the server responds to a data updating request of a user and receives a data sample uploaded by the user; then the server updates the parameters of the sub-model and the integrated model by using a machine learning algorithm to complete the learning of the newly added data; the data updating request of the user comprises the steps that the user interacts with the server and uploads k private data of the user, and the uploaded data set is Dupdate={(xi,yi,ui,ti) I is more than or equal to 1 and less than or equal to k; the specific treatment steps are as follows:
step 401, after receiving the data set, the server merges the data set into the existing user data set, that is, the server merges the data set into the existing user data set
Figure FDA0003579973150000041
And calculating the average available time of the newly added data
Figure FDA0003579973150000042
Step 402, calculating the average available time T according to the aboveupdateAdding the new data to the jth data subset, i.e.
Figure FDA0003579973150000043
And make Tj-1<Tupdate≤TjThat is, the average available time of the data of the added data subset is as close as possible to the newly added data;
step 403, repeat step 305, using the updated data subset
Figure FDA0003579973150000044
Training the submodels so that the submodels can learn the newly added data, and then storing the parameters of the newly trained submodels;
step 404, repeat step 310, using the updated data set
Figure FDA0003579973150000045
The updated subset model trains the integrated model, so that the integrated model can learn new data, and then parameters of the newly trained integrated model are stored;
step 405, the data updating process is ended, and the inference process of the model is consistent with the aforementioned step 312.
9. The method of claim 7 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the model updating process of the step 5 comprises that the server responds to a data deletion request of a user and receives an identity uploaded by the user; then the server retrains parameters of the submodel and the integrated model by using a machine forgetting algorithm to finish forgetting the existing data; the method comprises the following specific steps:
step 501, after receiving the identity identifier, the server screens out all user data identical to the identity identifier from the user data set, and records the user data as Dremove={(xi,yi,ui,ti)|1≤i≤k,uiU }; u is the identity of the user;
step 502, the server searches for D from the whole data subsetremoveRelated data, which is used for dividing the data with the same user identity into the same data subsets in the division process of the model, and the server determines the unique data subset containing the user data and records the unique data subset as the user data
Figure FDA0003579973150000046
Step 503, the server is in
Figure FDA0003579973150000047
In which the part of the data is removed, i.e.
Figure FDA0003579973150000048
Then abandoning the original sub-model;
step 504, repeat step 305, using the removed data subset
Figure FDA0003579973150000049
Retraining the submodel so that the submodel no longer contains the relevant information of the removed data, and then saving the parameters of the newly trained submodel;
step 505, the server removes D from all user dataremoveI.e. by
Figure FDA00035799731500000410
Then abandoning the original integrated model;
step 506, repeat step 310, using the removed user data
Figure FDA0003579973150000051
Retraining the integration model to remove information in the integration model related to the user data, and then saving parameters of the integration model;
step 507, ending the data deleting process, wherein the reasoning process of the model is consistent with the step 312.
CN202210353332.2A 2022-04-02 2022-04-02 Implementation method of machine learning model supporting dynamic addition and deletion of user data Pending CN114692894A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210353332.2A CN114692894A (en) 2022-04-02 2022-04-02 Implementation method of machine learning model supporting dynamic addition and deletion of user data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210353332.2A CN114692894A (en) 2022-04-02 2022-04-02 Implementation method of machine learning model supporting dynamic addition and deletion of user data

Publications (1)

Publication Number Publication Date
CN114692894A true CN114692894A (en) 2022-07-01

Family

ID=82143670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210353332.2A Pending CN114692894A (en) 2022-04-02 2022-04-02 Implementation method of machine learning model supporting dynamic addition and deletion of user data

Country Status (1)

Country Link
CN (1) CN114692894A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116522007A (en) * 2023-07-05 2023-08-01 中国科学技术大学 Recommendation system model-oriented data forgetting learning method, device and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116522007A (en) * 2023-07-05 2023-08-01 中国科学技术大学 Recommendation system model-oriented data forgetting learning method, device and medium
CN116522007B (en) * 2023-07-05 2023-10-20 中国科学技术大学 Recommendation system model-oriented data forgetting learning method, device and medium

Similar Documents

Publication Publication Date Title
Meng et al. Adavit: Adaptive vision transformers for efficient image recognition
Aszemi et al. Hyperparameter optimization in convolutional neural network using genetic algorithms
CN109919685B (en) Customer churn prediction method, apparatus, device and computer readable storage medium
CN109582864B (en) Course recommendation method and system based on big data science and dynamic weight adjustment
CN107844784A (en) Face identification method, device, computer equipment and readable storage medium storing program for executing
Ding et al. Where to prune: Using LSTM to guide data-dependent soft pruning
CN112631717A (en) Network service function chain dynamic deployment system and method based on asynchronous reinforcement learning
CN113807176B (en) Small sample video behavior recognition method based on multi-knowledge fusion
CN115186097A (en) Knowledge graph and reinforcement learning based interactive recommendation method
CN115840900A (en) Personalized federal learning method and system based on self-adaptive clustering layering
CN114692894A (en) Implementation method of machine learning model supporting dynamic addition and deletion of user data
CN109034953B (en) Movie recommendation method
CN104008177B (en) Rule base structure optimization and generation method and system towards linguistic indexing of pictures
CN112990385A (en) Active crowdsourcing image learning method based on semi-supervised variational self-encoder
CN116089883B (en) Training method for improving classification degree of new and old categories in existing category increment learning
Zhang et al. Adversarial reinforcement learning for unsupervised domain adaptation
CN103886030A (en) Cost-sensitive decision-making tree based physical information fusion system data classification method
CN115409203A (en) Federal recommendation method and system based on model independent meta learning
CN115035341A (en) Image recognition knowledge distillation method capable of automatically selecting student model structure
CN113590958B (en) Continuous learning method of sequence recommendation model based on sample playback
CN114743133A (en) Lightweight small sample video classification and identification method and system
Khan et al. A multi-perspective revisit to the optimization methods of Neural Architecture Search and Hyper-parameter optimization for non-federated and federated learning environments
CN113705215A (en) Meta-learning-based large-scale multi-label text classification method
CN110119268B (en) Workflow optimization method based on artificial intelligence
Buijs et al. Applying transfer learning and various ANN architectures to predict transportation mode choice in Amsterdam

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination