CN114780997A

CN114780997A - Data processing method, device, equipment and medium

Info

Publication number: CN114780997A
Application number: CN202210462556.7A
Authority: CN
Inventors: 江伊雯; 刘圣龙; 张舸; 赵涛; 吕艳丽; 周鑫; 夏雨潇; 王衡; 王迪
Original assignee: Big Data Center Of State Grid Corp Of China
Current assignee: Big Data Center Of State Grid Corp Of China
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-07-22

Abstract

The invention discloses a data processing method, a data processing device, data processing equipment and a data processing medium. The method comprises the following steps: responding to a data processing request instruction, and determining a data segment where target data is located; deleting the data segment where the target data is located, and searching the last data segment before the data segment where the target data is located; and retraining the pre-established target training model from the last data segment to obtain a new target training model, wherein the pre-established target training model is obtained by training the original training model according to the low forgetting probability data segment and the high forgetting probability data segment which are obtained by dividing the pre-established target training model according to the forgetting probability. The embodiment solves the problem that the SISA model in the prior art cannot make full use of forgetting probability information, improves the prediction accuracy of the aggregated target training model, accelerates the retraining speed of the model, and improves the usability of the model.

Description

Data processing method, device, equipment and medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, and medium.

Background

In the training process of the machine learning model, a large amount of data is used. The data may include a large amount of private data of the user, for example, personal information of a large amount of past patients may exist in a data set for training a medical diagnosis model; the data of the recommendation system is concentrated with the use data of the user on the Internet. The use of a large amount of data improves the training effect of the machine learning model, and brings many safety and privacy problems.

Data poison attack (data poison attack) is to input data well designed by an attacker to a model for training, and to disturb the real distribution of the data, so that the model obtains wrong output. Once the model receives the toxic attack, the accuracy of the model is reduced, and even the output result is deviated to the direction expected by the attacker. Various internet companies both at home and abroad have been attacked by various data poisons aiming at machine learning models.

The member inference attack (membership inference attack) refers to the fact that an attacker reversely deduces a training set of models by analyzing a published machine learning model. For example, an attacker may obtain private information about a target by determining whether the target is in a data set of a disease diagnosis model of some kind. Since the attack does not need a specific model structure, and only needs to call a machine learning model interface provided by a large Internet company on the network, great harm is brought to privacy, and many users hope that a model training party deletes own data in a data set and the influence of the own data in a trained model is eliminated.

For data poisoning attacks, many studies propose to prevent training data poisoning by improving the robustness of the model. However, for the user's request for deleting data and the influence caused by the data, the previously used mechanisms, including differential privacy (differential privacy), have no perfect implementation. Therefore, researchers began to use machine learning to satisfy users' right to be for data destruction (right to be for).

The most straightforward way to implement machine forgetting is to remove the samples to be destroyed from the original training dataset and then retrain the model from scratch. However, retraining from scratch creates a very high computational overhead when the data set is large in size and undo requests occur frequently. According to the concept proposed by Cao et al, machine forgetting learning needs to meet two requirements: 1) completely forget: for a data destruction request sent by a user, the existing model is forgotten by data to obtain a new model, and the model has the same prediction result as a model obtained by retraining from the beginning; 2) timely: compared with retraining, forgetting learning needs to reduce calculation overhead, achieve system redeployment more quickly, and obtain higher availability.

Cao et al propose to convert the learning algorithm into a summation form after statistical query learning, and decompose the dependency between training data. To delete a data instance, the model owner simply deletes the translation for this data instance from the summation that depends on this instance. However, the algorithms of Cao and Yang are not suitable for learning algorithms that cannot be converted to summation forms, such as neural networks.

Ginart et al studied the machine forgetting technique for k-means clustering algorithm, and the core of the study lies in providing a learning algorithm with high efficiency for data deletion. Specifically, the algorithm constructs data deletion as an online problem and gives a time-efficient analysis of the optimal deletion efficiency. The data deletion operation for learning algorithm A can be defined as R_A(D, A (D), i) from the data set D, the machine learning model A (D), and a certain in the hypothesis space derived from the index i e {1, …, n }, in terms of the index i e {1, …, n }And (4) modeling. A data delete operation may be defined as if it is random for all D and i, variable A (D)_i) And R_A(D, A (D), i) are equivalent in distribution, then A (D)_i)＝_d R_A(D, A (D), i). But this technique cannot be generalized to other machine learning models.

Therefore, Bourtoule et al propose a more general algorithm SISA (shared, Isolated, Sliced and Aggregated) for deep learning. The main idea of SISA is to divide the training data into several disjoint shards, each training a sub-model. To delete a particular instance, the model owner need only retrain the sub-model that contains that instance. To further speed up the forgetting process, authors propose to divide each shard into several pieces and store the intermediate model parameters as each piece updates the model. However, the SISA model cannot utilize good forgetting probability information when applied to a scene with known data forgetting probability. The special fragmentation mode for the scene with known data forgetting probability provided by Bourtoule et al enables model prediction to be accurate and reduces the usability of the model.

Disclosure of Invention

The invention provides a data processing method, a data processing device, data processing equipment and a data processing medium, which are used for solving the problem that a SISA model in the prior art cannot fully utilize forgetting probability information, improving the prediction accuracy of the model and improving the usability of the model.

According to an aspect of the present invention, there is provided a data processing method including:

responding to a data processing request instruction, and determining a data segment where target data are located, wherein the target data are data to be subjected to data deleting operation in an original data set;

deleting the data segment where the target data is located, and searching the last data segment before the data segment where the target data is located;

and retraining a pre-established target training model from the last data segment to obtain a new target training model, wherein the pre-established target training model is obtained by training an original training model according to a low forgetting probability data segment and a high forgetting probability data segment which are obtained by dividing the pre-established target training model according to forgetting probability, and the number of the low forgetting probability data segment and the number of the high forgetting probability data segment are equal.

According to another aspect of the present invention, there is provided a data processing apparatus comprising:

the determining module is used for responding to a data processing request instruction and determining a data segment where target data are located, wherein the target data are data to be subjected to data deleting operation in an original data set;

the processing module is used for executing deletion operation on the data segment where the target data is located and searching the last data segment before the data segment where the target data is located;

and the retraining module is used for retraining a pre-established target training model from the last data segment to obtain a new target training model, wherein the pre-established target training model is obtained by training an original training model according to a low forgetting probability data segment and a high forgetting probability data segment which are obtained by dividing according to forgetting probability, and the number of the low forgetting probability data segment and the number of the high forgetting probability data segment are equal.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the data processing method according to any of the embodiments of the invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement a data processing method according to any one of the embodiments of the present invention when the computer instructions are executed.

The technical proposal of the embodiment of the invention divides the original data into the data segment with low forgetting probability and the data segment with high forgetting probability according to the forgetting probability, and the low forgetting probability data segment and the high forgetting probability data segment with approximately the same total data quantity are adopted to train the original training model so as to obtain a target training model with higher prediction accuracy, so that when a data processing request instruction is received, the data segment where the target data is located is directly deleted, and the data processing request instruction is started from the last data segment before the data segment where the target data is located, the pre-established target training model can be retrained, the problem that the SISA model in the prior art cannot fully utilize the forgetting probability information is solved, the prediction accuracy of the aggregated target training model is improved, the retraining speed of the model is increased, and the usability of the model is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention;

fig. 2 is a flowchart of a data processing method according to a second embodiment of the present invention;

fig. 3 is a flowchart of a data processing method according to a third embodiment of the present invention;

fig. 4 is a flowchart of a study on fine-grained data destruction by using a machine learning model according to an embodiment of the present invention;

fig. 5 is a flowchart of a fine-grained data processing algorithm according to an embodiment of the present invention;

fig. 6 shows the prediction accuracy of two models corresponding to the purchasse data set under different slicing conditions according to the embodiment of the present invention;

fig. 7 shows the prediction accuracy of two models corresponding to SVHN datasets under different slicing conditions according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating the number of data points affected by a data delete stage according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a data processing apparatus according to a fourth embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It should be noted that the terms "original", "intermediate", "object", and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the relevant parameters involved in the embodiments of the present invention are explained. The method comprises the following specific steps:

randomness in machine learning: the goal of the machine learning algorithm is to learn a slave sample space

To the mark space

Mapping of (2). According to the PAC learning theory, if a mapping c is such that c (x) is true for any sample (x, y), then c is called the target concept. Learning algorithm

Not knowing about the target concept, but by accessing a known data set

And describing the data distribution. For learning algorithm

All possible concept sets considered are called "hypothesis spaces" (hypo-space), with

Means that any possible value in space is assumed

Referred to as hypothesis (hypothesisis). For a given data set

Learning algorithm

By solving the objective function, an assumption h is obtained that is as close as possible to the target concept c.

In the solution process of the learning algorithm, the randomness of the result mainly comes from two aspects: the randomness of the training process and the randomness of the learning algorithm.

Randomness of the training process: in a given data set

Often, it is necessary to first randomly draw small batches of data from the data set, and the order of data draw is different in different training. Furthermore, training is typically parallel, without explicit synchronization, meaning that the random data acquisition order of the parallel training process makes training uncertain.

Randomness of the learning algorithm: intuitively, the goal of the learning algorithm is to make a wide range of assumptions

Finding the optimal hypothesis h. This assumption is typically defined by setting a fixed parametric weight to the learning model. PAC learning theory holds that hypothesis h, which is as close as possible to the target concept c, is one of many assumptions that minimize the risk of experience. However, a commonly used optimization function, such as a random gradient descent, can only converge to one of a plurality of local minima for any convex loss function. Plus the randomness involved in the training, for the same dataset

It is very challenging to obtain the same final hypothesis h using the same learning algorithm.

Due to the randomness in machine learning, it is difficult to quantify the effect of certain data points on the model, and it is also difficult to remove them in the final model.

Target of forgetting learning: although the target of forgetting learning is very clear, namely the influence of taking out some data in a data set in the existing model, the forgetting task is simple to implement due to the randomness of a machine learning algorithm and the like. The existing forgetting learning methods all have various problems, and the ideal forgetting learning method should meet the following requirements:

understandability: since the underlying retraining method is well understood and implemented, any forgetting learning algorithm needs to be easily understood and should be simple to apply and correct for non-experts.

Availability: if a large amount of data needs to be deleted or comparison-typical data points are deleted, the accuracy of the model is reduced as can be appreciated. Even retraining the model from scratch can result in reduced accuracy when data is destroyed. A better forgetting learning method should be able to control this drop at a level similar to the retraining model.

The destruction can be proved: just like retraining, forgetting learning should prove that the deleted data points no longer have an effect on the model parameters. Moreover, such proof should be concise and not require expert assistance.

Applicability: for an excellent forgetting learning method, any machine learning model should be available regardless of the complexity or other nature of the model.

Reducing forgetting time: forgetting learning methods should be faster than retraining under any circumstances.

No additional overhead is introduced: any available forgetting learning method should not introduce additional computational overhead in the original complex computational model training process.

The invention provides a method for processing data by using a novel machine forgetting learning method, so as to improve the prediction accuracy of a model and realize compromise between forgetting speed and model accuracy in the forgetting process. The original data set is divided into data segments with low forgetting probability and data segments with high forgetting probability through analysis of the forgetting probability of the known data. And respectively establishing a learning model for the two types of data. As the existing research shows that the accuracy reduction caused by using a machine learning forgetting algorithm can be reduced by the transfer learning, the method can firstly use a SISA model to train low forgetting probability data, then use high forgetting probability data to adjust on the basis of the existing model by using a method similar to the transfer learning, and then obtain a final target training model by an aggregation method.

Example one

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention, where the embodiment is applicable to a case where data deletion is performed when a probability of a data deletion request is known, and the method may be executed by a data processing apparatus, where the data processing apparatus may be implemented in a form of hardware and/or software, and the data processing apparatus may be configured in a terminal device. Illustratively, the terminal device may be a computer, an iPad or other terminal with a data processing function. As shown in fig. 1, the method includes: S110-S130.

And S110, responding to the data processing request instruction, and determining the data segment where the target data is located.

The target data is data to be subjected to data deleting operation in the original data set. The data processing request instruction refers to an instruction for deleting part of data in the original data set. The instruction for deleting part of the data in the original data set can be understood as an instruction for forgetting or destroying part of the data in the original data set. In an embodiment, in the case where an instruction for a data deletion operation is received, a data segment in which data to be subjected to the data deletion operation is located is determined. The data segment refers to a segment divided in advance according to the forgetting probability of the data. In an embodiment, the data fragment comprises: a low forgetting probability data segment and a high forgetting probability data segment. It can be understood that the data segment in which the target data is located may be a data segment with a low forgetting probability or a data segment with a high forgetting probability. Of course, to facilitate deletion of the target data and subsequent retraining of the model, the target data can only belong to one data segment.

And S120, deleting the data segment where the target data is located, and searching the last data segment before the data segment where the target data is located.

In the embodiment, after the data segment where the target data is located is determined, the data segment where the target data is located is directly deleted from the original data set, and the last data segment before the data segment where the target data is located is searched. Illustratively, assuming that the original data set is divided into 10 data segments, i.e., data segment 1, data segment 2, data segment 3 … …, data segment 9 and data segment 10, and the data segment in which the target data is located is data segment 3, the last data segment before the data segment in which the target data is located is data segment 2.

And S130, retraining the pre-established target training model from the last data segment to obtain a new target training model.

The pre-created target training model is obtained by training an original training model according to a low forgetting probability data segment and a high forgetting probability data segment which are obtained by dividing the forgetting probability, wherein the number of the low forgetting probability data segment and the number of the high forgetting probability data segment are equal.

In the embodiment, the process of retraining the pre-created target training model may refer to a process of training an original training model to obtain a target training model. It should be noted that the process of training the original training model to obtain the target training model includes: dividing an original data set according to forgetting probability to obtain a plurality of corresponding low forgetting probability data fragments and a plurality of corresponding high forgetting probability data fragments, then training an original training model by sequentially adopting one low forgetting probability data fragment and one high forgetting probability data fragment to obtain a corresponding intermediate isolation training model, and then performing aggregation processing on all intermediate isolation training models to obtain a corresponding target training model.

In an embodiment, after the data segment in which the target data is located is deleted from the original data set, the pre-created target training model is retrained from the last data segment to obtain a new target training model. Exemplarily, with the example in S120, a retraining process is described, that is, a pre-created target training model is retrained from a data segment 2, specifically, a model is trained by using a data segment 1, a data segment 2 and a data segment 4, so as to obtain a corresponding intermediate isolation training model (denoted as model 1); then, the data segment 1, the data segment 2, the data segment 4 and the data segment 5 are adopted to train the model 1 to obtain a corresponding middle isolation training model (marked as the model 2), the analogy is repeated until the incremental training of the model is completed to obtain a plurality of corresponding middle isolation training models, and then the plurality of middle isolation training models are aggregated to obtain a new target training model.

In the technical scheme of the embodiment, original data is divided into data segments with low forgetting probability and data segments with high forgetting probability according to forgetting probability, and the low forgetting probability data segment and the high forgetting probability data segment with approximately the same total data quantity are adopted to train the original training model so as to obtain a target training model with higher prediction accuracy, so that when a data processing request instruction is received, the data segment where the target data is located is directly deleted, and the data processing request instruction is started from the last data segment before the data segment where the target data is located, the target training model established in advance can be retrained, the problem that the SISA model in the prior art cannot make full use of forgetting probability information is solved, the prediction accuracy of the aggregated target training model is improved, the retraining speed of the model is increased, and the usability of the model is improved.

Example two

Fig. 2 is a flowchart of a data processing method according to a second embodiment of the present invention, where this embodiment is based on the above-mentioned second embodiment, and further refines the partitioning process of the original data set, the retraining process of the original training model, and the aggregation processing process. As shown in fig. 2, the method includes:

s210, dividing the original data set to obtain corresponding data fragments with low forgetting probability and data fragments with high forgetting probability.

Wherein the number of the low forgetting probability data segment and the high forgetting probability data segment is equal. It can be understood that the number of low forgetting probability data fragments and the number of high forgetting probability data fragments contained in the original data set are the same, and the intersection of the low forgetting probability data fragments and the high forgetting probability data fragments is empty, and the union of the low forgetting probability data fragments and the high forgetting probability data fragments is the original data set. It should be noted that the data amount contained between each low forgetting probability data segment is the same, and the data amount contained between each high forgetting probability data segment is the same, but the data amounts contained between the low forgetting probability data segment and the high forgetting probability data segment are not equivalent, and generally, the data amount contained in each low forgetting probability data segment is larger than the data amount contained in the high forgetting probability data segment.

In the embodiment, the data segment where the data needing to be forgotten is located by using the stored data information, the last data segment before the data segment is searched, and the original training model is divided from the data segment.

S220, training the original training model by sequentially adopting the low forgetting probability data fragment and the high forgetting probability data fragment to obtain a corresponding intermediate isolation training model.

The total number of the middle isolation training models is equal to the number of the low forgetting probability data fragments and the number of the high forgetting probability data fragments, and the middle isolation training models, the low forgetting probability data fragments and the high forgetting probability data fragments are in one-to-one correspondence.

In an embodiment, to ensure isolation between models, a single model is trained on only one piece of data. It can be understood that a low forgetting probability data segment and a high forgetting probability data segment are respectively adopted to retrain the original training model, so as to obtain a corresponding intermediate isolation training model. Illustratively, the original training model is pre-trained by using a first low forgetting probability data segment to obtain a corresponding pre-processing isolation training model (marked as a first pre-processing isolation training model), and then the first pre-processing isolation training model is finely adjusted based on the first high forgetting probability data segment to obtain a corresponding intermediate isolation training model (marked as a first intermediate isolation training model). Similarly, on the basis of the first preprocessing isolation training model, the first preprocessing isolation training model is trained by adopting a union set of the first low forgetting probability data segment and the second low forgetting probability data segment to obtain a corresponding preprocessing isolation training model (marked as a second preprocessing isolation training model), and then the second preprocessing isolation training model is finely adjusted on the basis of the second high forgetting probability data segment to obtain a corresponding intermediate isolation training model (marked as a second intermediate isolation training model). And repeating the steps until the training of the original training model by adopting all the data fragments with the low forgetting probability and the data fragments with the high forgetting probability is completed.

And S230, performing aggregation processing on all the intermediate isolation training models by adopting a preset aggregation algorithm to obtain corresponding target training models, and predicting target data by adopting the target training models.

The preset aggregation algorithm refers to an algorithm for aggregating all the intermediate isolation training models to obtain a final target training model. Illustratively, the preset aggregation algorithm may include: majority voting; simple averaging method. In the embodiment, a preset aggregation algorithm is adopted to aggregate a plurality of intermediate isolation training models to obtain a corresponding target training model.

And S240, responding to the data processing request instruction, and determining the data segment where the target data is located.

The target data is data to be subjected to data deleting operation.

And S250, deleting the data segment where the target data is located, and searching the last data segment before the data segment where the target data is located.

And S260, retraining the pre-created target training model from the last data segment to obtain a new target training model.

The target training model is obtained by training an original training model according to low forgetting probability data fragments and high forgetting probability data fragments which are obtained by dividing forgetting probability, wherein the number of the low forgetting probability data fragments is equal to that of the high forgetting probability data fragments.

According to the technical scheme of the embodiment, on the basis of the embodiment, the original data set is analyzed through data sources to obtain the forgetting probability corresponding to the original data, the original data are divided into the data segments with the low forgetting probability and the data segments with the high forgetting probability according to the forgetting probability, the original model is trained through the data segments with the low forgetting probability and the data segments with the high forgetting probability to obtain a plurality of intermediate isolation training models with similar learning capabilities, and the intermediate isolation training models are aggregated through the preset aggregation algorithm to obtain the corresponding target training model.

EXAMPLE III

Fig. 3 is a flowchart of a data processing method according to a third embodiment of the present invention, and this embodiment further refines the partitioning process of the original data set, the retraining process of the original training model, and the aggregation processing process based on the foregoing embodiments. As shown in fig. 3, the method includes:

s310, dividing the original data set into a low forgetting probability data set and a high forgetting probability data set according to the forgetting probability of the original data in the original data set.

And the union of the low forgetting probability data set and the high forgetting probability data set is an original data set. Exemplarily, assume that the original data set is denoted as

The original data set may be assembled

Partitioning into low forgetting probability datasets

And high forgetting probability data set

And is provided with

At the same time, the user can select the desired position,

in one embodiment, S310 includes: S3101-S3102:

s3101, sequencing all original data in the original data set in an ascending order according to the forgetting probability of the original data.

In an embodiment, assuming that the forgetting probabilities of all the original data in the original data set are known, all the original data are sorted from small to large according to the forgetting probabilities to obtain all the data from small to large according to the forgetting probabilities.

S3102, dividing the original data according to the data proportion or the forgetting probability sum proportion to obtain a low forgetting probability data set and a high forgetting probability data set.

The data proportion refers to the proportion of the data amount belonging to the low forgetting probability data and the high forgetting probability data in the original data set. Illustratively, assuming a data ratio of 4:1, the total amount of data for the low forgetting probability data accounts for 80% of all the original data in the original data set, while the total amount of data for the high forgetting probability data accounts for 20% of all the original data in the original data set. In the embodiment, the ratio of the forgetting probabilities to the sums refers to a ratio between a sum of forgetting probabilities of all data in the low forgetting probability data set and a sum of forgetting probabilities of all data in the high forgetting probability data set. Illustratively, assuming that the sum of the forgetting probabilities of all data in the low forgetting probability data set is 0.8, the sum of the forgetting probabilities of all data in the high forgetting probability data set is 0.6, and the amount of data contained in the high forgetting probability data set is half of the amount of data contained in the low forgetting probability data set, the ratio between the two is 8: 3.

In the embodiment, the original data set is divided according to the data proportion or the forgetting probability sum proportion to obtain a low forgetting probability data set and a high forgetting probability data set.

And S320, respectively carrying out equal-slice division on the low forgetting probability data set and the high forgetting probability data set to obtain corresponding low forgetting probability data slices and high forgetting probability data slices.

Wherein each low forgetting probability data segment contains an equal amount of data and each high forgetting probability data segment contains an equal amount of data. It is understood that the amount of data contained in all low forgetting probability data fragments is the same, and the amount of data contained in all high forgetting probability data fragments is the same. In an embodiment, in order to limit the influence of data generation to a small range, the divided data segments are fragmented. Illustratively, the low forgetting probability data and the high forgetting probability data which are already sorted are divided into S pieces of data of the same size, respectively. For low forgetting probability datasets

The data set will be divided into

For arbitrary slices at the same time

And

and is provided with

Similarly, for high forgetting probability datasets

The same division is also made. By dividing the original data into a plurality of non-phasesAnd in the training process of the data segments, a plurality of groups of data segments are trained in parallel, so that the training time can be reduced.

S330, pre-training the original training model based on the low forgetting probability data fragments by adopting an incremental training mode to obtain a pre-processing isolation training model corresponding to each low forgetting probability data fragment.

Note that, for the data set

Can be divided into R disjoint data segments uniformly and satisfy

And is provided with

In that

The incremental training process above comprises the steps of: firstly, importing a target model, randomly initializing parameters, and adopting a first data segment obtained by pre-division

Training the model to obtain a model

And model the

Storing the parameters; then, at

On the basis of (1), use

Training model

Obtaining a model

And model

Storing the parameters; by analogy, in the R-th step, use is made of

Training model

Get the model

And model

As output model of this stage

And (5) storing.

In the embodiment, any one low forgetting probability data segment is subjected to

Corresponding can be obtained through an increment training mode

And model the

Are saved and used for further fine tuning of the model.

S340, fixing parameters of partial layers in the preprocessing isolation training model according to the size relation of data volumes contained in the low forgetting probability data set and the high forgetting probability data set.

In an embodiment, parameters of partial layers in the preprocessed isolation training model can be fixed according to the magnitude relation of data volumes contained in the low forgetting probability data set and the high forgetting probability data set, so that the retraining time of the model is reduced, namely the retraining speed is increased.

S350, fine adjustment is carried out on the preprocessing isolation training model based on the high forgetting probability data fragments by adopting an increment training mode, and a middle isolation training model corresponding to each high forgetting probability data fragment is obtained.

The specific implementation process of the incremental training mode may refer to the description in S330, and is not described herein again. In an embodiment, incremental training is utilized on the data set

Upper pair model

Adjusting and recording the adjusted model as

And saving the parameters after each model modification.

In an embodiment, the raw data set is layered and then sliced in each layer, so that the total amount of data contained in each data segment is approximately the same, so that each data segment can obtain a learner with similar learning ability (i.e., an intermediate isolation training model).

And S360, performing aggregation processing on all the intermediate isolation training models by adopting a majority voting method or a simple averaging method to obtain corresponding target training models.

In an embodiment, R mutually isolated intermediate isolation training models can be obtained by performing fine tuning on the pre-processing isolation training model by using the high forgetting probability data segment. Because the data volumes contained in the data sets for training the R models are the same, the models have similar performance, and the combination of a plurality of learners can reduce the insufficient generalization performance of a single learner caused by misselection.

Common combining strategies include averaging, voting, and learning. In machine forgetting learning, the combination strategy needs to have the following characteristics: the binding strategy should not involve training data, otherwise in some cases the binding mechanism itself would have to be forgotten. Therefore, in this embodiment, a plurality of intermediate isolation training models may be combined by using a majority voting method (majority voting) or a simple averaging method (simple averaging) to obtain a corresponding target training model.

And S370, responding to the data processing request instruction, and determining the data segment where the target data is located.

The target data is data to be subjected to data deleting operation in the original data set.

And S380, deleting the data segment where the target data is located, and searching the last data segment before the data segment where the target data is located.

And S390, retraining the pre-established target training model from the last data segment to obtain a new target training model.

The pre-established target training model is obtained by training an original training model according to a low forgetting probability data fragment and a high forgetting probability data fragment which are obtained by dividing forgetting probability, wherein the number of the low forgetting probability data fragment and the number of the high forgetting probability data fragment are equal.

Based on the above embodiments, the technical solution of this embodiment modifies the low forgetting probability data by using the high forgetting probability data in each data segment, and does not cause a great reduction in prediction accuracy due to the mutual isolation of the data segments. And the learning rate of the two layers of data is adjusted, so that the model can be prevented from forgetting the data characteristics with low forgetting probability, the retraining speed can be increased by freezing certain parameters in the first layer of model, namely, the retraining time and the model prediction accuracy are balanced by adopting a freezing technology, and the data is subjected to fine-grained and efficient deleting operation under the scene that the data forgetting probability is known.

In one embodiment, the analysis process of the retraining time is specifically as follows: the number of samples affected in the retraining process is in direct proportion to the retraining time, so that the influence of each parameter on the retraining time can be analyzed through the number of samples affected by retraining.

The original data set can be divided into the data sets with low forgetting probability in the training process

And high forgetting probability data set

Two parts, for retraining the time overhead also includes the presence of forgetting data

Neutralizing forgotten data

And (4) the overhead of the middle two parts.

Assuming that there are a total of K data processing requests (e.g., forget requests), the ith forget request occurs

The probability in (1) is denoted as P_LTake place in

The probability in (1) is denoted as P_HAccording to a data set partitioning rule (partitioning by forgetting probability), P_L＜＜P_H。

If the ith forget request occurs

The number of data point samples that need to be retrained is expected to be:

if the ith forget request occurs

The expected upper limit of the number of data point samples that need to be retrained is:

combining the above two formulas, the overall model has an expected upper bound of the number of samples in each forgetting request:

the upper bound expected for the number of samples per forget request for the existing SISA model is

Due to the fact that

Moreover, studies have shown that

Therefore, compared with the SISA model, the hierarchical machine learning forgetting algorithm has a large improvement in forgetting data time.

In one implementation, fig. 4 is a flowchart of a study on fine-grained data destruction by using a machine learning model according to an embodiment of the present invention. As shown in fig. 4, the layering and learning forgetting algorithm comprises three parts: data grouping training and optimization thereof, data destruction improvement of distribution differentiation, and model isolation training and aggregation. Wherein, the training of data packets and the optimization thereof comprise: a sensitivity measurement model and a data access mode model, and two different models are adopted for data grouping; the model isolation training and aggregation comprises the following steps: model training process and model aggregation process (including screen projection mechanism design and model weight setting).

Fig. 5 is a flowchart of a fine-grained data processing algorithm according to an embodiment of the present invention. As shown in FIG. 5, assume that the original data set is denoted as

The original data set may be assembled

Partitioning into low forgetting probability datasets

And high forgetting probability data set

For low forgetting probability datasets

The data set will be divided into

For arbitrary slices at the same time

And

and is provided with

Similarly, for high forgetting probability datasets

The same division is also made. For any data segment with low forgetting probability

Corresponding can be obtained through an increment training mode

And model the

Storing the parameters; then, any data segment with low forgetting probability is processed

Corresponding can be obtained through an increment training mode

And model the

The parameters are stored to obtain S intermediate isolation training models, and the S intermediate isolation training models are subjected to aggregation processing to obtain and output corresponding target training models.

In one embodiment, assuming that the information of the original data set is used as shown in table 1, in the Purchase data set, 600 items with the largest Purchase amount are selected as the category attributes.

Table 1 data set information

In the experiment, the same model structure as the SISA model was selected, and specific model information is shown in table 2, including: various deep neural networks having different numbers of hidden layers and different layer sizes.

DNN model structure adopted in Table 2

For three different data sets, two data forgetting probability distribution cases are respectively assumed, one is exponential distribution, and the other is pareto distribution. The implementation effect of the method of the invention was tested by experiments.

Fig. 6 shows prediction accuracy rates of two models corresponding to a purchasase data set under different fragmentation conditions according to an embodiment of the present invention; fig. 7 shows prediction accuracy rates of two models corresponding to SVHN datasets under different slicing conditions, according to an embodiment of the present invention. As shown in fig. 6 and 7, the hierarchical machine forgetting learning method (i.e., the HMU training method) and the SISA training method provided in the embodiments of the present invention respectively perform comparative analysis on the prediction accuracy of the two training methods by using the purchasase data and the SVHN data under the circumstances of different data segments.

As shown in fig. 6 and 7, it can be seen from the comparison result of the prediction accuracy that when the number of data segments is 1, that is, when data segments are not used, the prediction accuracy of the SISA training method is higher because, in the hierarchical machine forgetting learning method (that is, the HMU training method), data layering is performed according to the forgetting probability even if the number of data segments is 1, and the fitting capability of the model is reduced; when the number of the data segments is more than 1, in the two groups of experiments, the SISA training method has obvious reduction of prediction accuracy, because the SISA training method uses a small number of data training part isolation models with high forgetting probability to obtain part weak learners, the prediction accuracy of the whole model is influenced by model aggregation, and the reduction is more obvious along with the increase of the number of the fragments. In the model of the embodiment, the data sets of the isolation models have the same size and have similar learning ability, so that the condition that the accuracy rate is reduced due to unbalanced learning ability is avoided.

Fig. 8 is a schematic diagram of the number of data points affected by the data deletion phase according to an embodiment of the present invention. As shown in fig. 8, since the experiment simulates that the amount of forgetting probability data is actually small, the experiment is performed when the amount of forgetting probability data is small

Under the condition, compared with the SISA training method, the HMU training method provided by the embodiment of the invention has less influence on data points in a data deleting stage.

Example four

Fig. 9 is a schematic structural diagram of a data processing apparatus according to a fourth embodiment of the present invention. As shown in fig. 9, the apparatus includes: a determination module 910, a processing module 920, and a retraining module 930.

A determining module 910, configured to determine, in response to a data processing request instruction, a data segment in which target data is located, where the target data is data to be subjected to a data deletion operation in an original data set;

the processing module 920 is configured to execute a deletion operation on the data segment where the target data is located, and search for a last data segment before the data segment where the target data is located;

a retraining module 930, configured to retrain a pre-created target training model from the last data segment to obtain a new target training model, where the pre-created target training model is a model obtained by training an original training model with a low forgetting probability data segment and a high forgetting probability data segment that are obtained by dividing according to forgetting probability, and the number of the low forgetting probability data segment and the number of the high forgetting probability data segment are equal.

In an embodiment, before determining, in response to the data processing request instruction, the data segment in which the target data is located, the data processing apparatus further includes:

the dividing module is used for responding to a data processing request instruction and dividing an original data set to obtain a corresponding data segment with low forgetting probability and a data segment with high forgetting probability;

the training module is used for training the original training model by sequentially adopting the low forgetting probability data fragment and the high forgetting probability data fragment to obtain a corresponding intermediate isolation training model; the total number of the intermediate isolation training models is equal to the number of the low forgetting probability data fragments and the number of the high forgetting probability data fragments, and the intermediate isolation training models, the low forgetting probability data fragments and the high forgetting probability data fragments are in one-to-one correspondence;

and the aggregation processing module is used for aggregating all the intermediate isolation training models by adopting a preset aggregation algorithm to obtain corresponding target training models so as to predict target data by adopting the target training models.

In one embodiment, the partitioning module includes:

the first dividing unit is used for dividing the original data set into a low forgetting probability data set and a high forgetting probability data set according to the forgetting probability of the original data in the original data set;

the second dividing unit is used for respectively carrying out equal-piece division on the low forgetting probability data set and the high forgetting probability data set to obtain corresponding low forgetting probability data pieces and high forgetting probability data pieces; wherein each low forgetting probability data segment contains an equal amount of data and each high forgetting probability data segment contains an equal amount of data.

In one embodiment, the retraining module includes:

the pre-training unit is used for pre-training the original training model based on the low forgetting probability data fragments by adopting an incremental training mode to obtain a pre-processing isolation training model corresponding to each low forgetting probability data fragment;

and the fine tuning unit is used for fine tuning the preprocessing isolation training model based on the high forgetting probability data fragments by adopting an increment training mode to obtain a middle isolation training model corresponding to each high forgetting probability data fragment.

In an embodiment, the aggregation processing module is specifically configured to: and performing aggregation processing on all the intermediate isolation training models by adopting a majority voting method or a simple averaging method to obtain corresponding target training models.

In an embodiment, the first division unit includes:

the sequencing subunit is used for sequencing all the original data in the original data set in an ascending order according to the forgetting probability of the original data;

and the dividing subunit is used for dividing the original data according to the data proportion or the forgetting probability sum proportion to obtain a low forgetting probability data set and a high forgetting probability data set.

In an embodiment, before the incremental training mode is adopted and the preprocessing isolation training model is finely tuned based on the high forgetting probability data fragments to obtain the intermediate isolation training model corresponding to each high forgetting probability data fragment, the retraining module further includes:

and the fixing unit is used for fixing the parameters of the partial layers in the preprocessing isolation training model according to the size relation of the data volume contained in the low forgetting probability data set and the high forgetting probability data set.

The data processing device provided by the embodiment of the invention can execute the data processing method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

FIG. five illustrates a schematic structural diagram of an electronic device 10 that may be used to implement an embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 10, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM)12, a Random Access Memory (RAM)13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM)12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 may also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the data processing method: responding to a data processing request instruction, and determining a data segment where target data are located, wherein the target data are data to be subjected to data deleting operation in an original data set; deleting the data segment where the target data is located, and searching the last data segment before the data segment where the target data is located; retraining a pre-established target training model from the last data segment to obtain a new target training model, wherein the target training model is obtained by training an original training model according to a low forgetting probability data segment and a high forgetting probability data segment which are obtained by dividing according to forgetting probability, and the number of the low forgetting probability data segment and the number of the high forgetting probability data segment are equal.

In some embodiments, the data processing method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the data processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Computer programs for implementing the methods of the present invention can be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data processing method, comprising:

and retraining a pre-established target training model from the last data segment to obtain a new target training model, wherein the pre-established target training model is obtained by training an original training model according to a low forgetting probability data segment and a high forgetting probability data segment which are obtained by dividing according to forgetting probability, and the number of the low forgetting probability data segment and the number of the high forgetting probability data segment are equal.

2. The method of claim 1, before determining the data segment in which the target data is located in response to the data processing request instruction, further comprising:

dividing an original data set to obtain corresponding data fragments with low forgetting probability and data fragments with high forgetting probability;

sequentially adopting the low forgetting probability data fragment and the high forgetting probability data fragment to train an original training model to obtain a corresponding middle isolation training model; the total number of the intermediate isolation training models is equal to the number of the low forgetting probability data fragments and the number of the high forgetting probability data fragments, and the intermediate isolation training models, the low forgetting probability data fragments and the high forgetting probability data fragments are in one-to-one correspondence;

and adopting a preset aggregation algorithm to aggregate all the intermediate isolation training models to obtain corresponding target training models.

3. The method according to claim 2, wherein the dividing the original data set to obtain corresponding low forgetting probability data segments and high forgetting probability data segments comprises:

dividing an original data set into a low forgetting probability data set and a high forgetting probability data set according to the forgetting probability of the original data in the original data set;

respectively carrying out equal-slice division on the low forgetting probability data set and the high forgetting probability data set to obtain corresponding low forgetting probability data slices and high forgetting probability data slices; wherein the data amount contained in each of the low forgetting probability data pieces is identical, and the data amount contained in each of the high forgetting probability data pieces is identical.

4. The method according to claim 2, wherein the training of the original training model by sequentially adopting the low forgetting probability data segment and the high forgetting probability data segment to obtain a corresponding intermediate isolation training model comprises:

pre-training an original training model by adopting an incremental training mode based on the low forgetting probability data fragments to obtain a pre-processing isolation training model corresponding to each low forgetting probability data fragment;

and fine-tuning the preprocessing isolation training model based on the high forgetting probability data fragments by adopting an increment training mode to obtain a middle isolation training model corresponding to each high forgetting probability data fragment.

5. The method according to claim 1 or 2, wherein the aggregating all the intermediate isolation training models by using a preset aggregation algorithm to obtain corresponding target training models comprises:

and performing aggregation processing on all the intermediate isolation training models by adopting a majority voting method or a simple averaging method to obtain corresponding target training models.

6. The method of claim 3, wherein the dividing the original data set into a low forgetting probability data set and a high forgetting probability data set according to forgetting probabilities of original data in the original data set comprises:

sequencing all original data in the original data set in an ascending order according to the forgetting probability of the original data;

and dividing the original data according to the data proportion or the forgetting probability sum proportion to obtain a low forgetting probability data set and a high forgetting probability data set.

7. The method according to claim 4, wherein before the fine tuning of the pre-processing isolation training model is performed by using an incremental training mode based on the high forgetting probability data fragments to obtain an intermediate isolation training model corresponding to each high forgetting probability data fragment, the method further comprises:

and fixing parameters of a partial layer in the preprocessing isolation training model according to the size relation of data volumes contained in the low forgetting probability data set and the high forgetting probability data set.

8. A data processing apparatus, characterized by comprising:

the processing module is used for deleting the data segment where the target data is located and searching the last data segment before the data segment where the target data is located;

and the retraining module is used for retraining a pre-established target training model from the last data segment to obtain a new target training model, wherein the pre-established target training model is obtained by training an original training model according to a low forgetting probability data segment and a high forgetting probability data segment which are obtained by dividing the pre-established target training model according to forgetting probability, and the number of the low forgetting probability data segment and the number of the high forgetting probability data segment are equal.

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the data processing method of any one of claims 1-7.

10. A computer-readable storage medium, characterized in that it stores computer instructions for causing a processor to implement the data processing method of any of claims 1-7 when executed.