CN113283578A

CN113283578A - Data denoising method based on marking risk control

Info

Publication number: CN113283578A
Application number: CN202110399544.XA
Authority: CN
Inventors: 王魏; 胡圣佑
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2021-08-20

Abstract

The invention discloses a data denoising method based on marking risk control, wherein the success of data deep learning usually depends on a large amount of accurately marked data, but a large amount of accurately marked data is usually difficult to collect in an actual scene. In order to reduce the influence of data marking noise on the performance of the neural networks, the method maintains that two neural networks mutually select data with small loss as low-risk data to update the peer-to-peer networks, and each network respectively filters out high-risk data and retrains the rest data. In order to solve the problem that learning performance is degraded due to the fact that two networks are more and more similar along with training, when the inconsistency of the two neural networks is stable, data are stopped being selected from each other, and the networks are updated until convergence is achieved by using the obtained low-risk data. Compared with the prior art, the deep neural network has stronger robustness.

Description

Data denoising method based on marking risk control

Technical Field

The invention relates to a data denoising method based on marker risk control, which can screen high-risk marker data to improve robustness and belongs to the technical field of computer artificial intelligent data analysis.

Background

In recent years, deep learning has been highly successful in various fields such as face recognition, automatic driving, machine translation, and the like. However, the performance of deep learning depends on a large amount of accurately labeled data, and in practical application, it is often difficult to collect a large amount of accurately labeled data, because obtaining accurate data labels requires a large amount of manpower and material resources. To address this problem, people often use crowd sourcing techniques to assign large amounts of unmarked data to voluntary users for marking, and the resulting markings tend to be noisy due to the level of the users being uneven. How to learn from data with labeled noise becomes a concern for many scholars.

Disclosure of Invention

The purpose of the invention is as follows: in practical application, training data of the deep neural network is often noisy, and training the deep neural network on the data with labeled noise can impair the classification performance of the deep neural network. In order to solve the problem, the invention provides a data denoising method based on marking risk control. The invention maintains that two neural networks mutually select data with small loss to learn by each other, and then each network discovers and removes high-risk data in the data and retrains the rest data. The difference between the two networks is taken care of during the training process to prevent the learning performance from degrading and thus improve the robustness.

The technical scheme is as follows: a data denoising method based on marking risk control comprises the following steps:

first, a data set is prepared

The data therein contains the marking noise. Randomly initializing two peer-to-peer deep neural networks in a data set

Training the T rounds respectively to obtain two neural networks f and g.

For a data set

Respectively calculating the cross entropy loss of each data by using the neural networks f and g, sorting the data by using the cross entropy loss and respectively selecting the data with small part loss

And

neural network f is

Go up training K round to get f ', f' at

The data in which the prediction result is inconsistent with the original mark is regarded as high-risk data

Remove high risk data and leave the data

(\ represents the difference set of the two sets,

indicating that the previously selected data has a small loss,

to represent

High risk data in, so

To represent

In (1) removing

The data remaining thereafter) is retrained. In the same way g is

Training K rounds to obtain g 'and g' in

The data of (2) are predicted, and the data of which the prediction result is inconsistent with the original mark are regarded as high-risk samples

Remove high risk data and leave the data

And g is retrained.

In each training round, data that f and g have not been seen in the current round are calculated

And calculating the inconsistency, and if the inconsistency tends to be stable, or the number of training rounds reaches a preset maximum value N. The learning process in the previous paragraph is stopped. And low risk data obtained at the end

And

f and g are trained respectively until the neural networks converge, and finally two neural networks are obtained through training.

In the prediction stage, the user inputs the feature vectors of the data to be detected on the two modes to the two neural networks obtained by training respectively, the two neural networks return the prediction results of the data to be detected to the user respectively, and then the one with higher confidence coefficient is selected from the two prediction results and is output as the final mark of the data.

The data set

Is an image data set in which the image data contains a marker noise. In the prediction stage, the user inputs the feature vectors of the image data to be detected on two modes to the two neural networks obtained by training respectively, the two neural networks return the prediction results of the image data to be detected to the user respectively,and then the one with higher confidence is selected from the two prediction results to be output as the final mark of the image data.

Has the advantages that: compared with the prior art, the data denoising method based on the marker risk control maintains two neural networks, each network selects data with small loss as low-risk data through cross entropy loss to update the peer-to-peer networks, then each network discovers and deletes the data still with higher risk in the networks and trains the networks again, and meanwhile, the difference between the two networks is concerned in the training process to prevent the learning performance from deteriorating. The noise of the image data in the image classification task can have adverse effect on a subsequent neural network classification model, and the more serious the noise is, the more the performance of the neural network is damaged. Noise in a real image is a complex result of accumulation of various noise components, so that the task of image denoising is very difficult. Compared with the prior art, the method can better remove image noise and obtain training data with higher purity, so that the neural network has stronger robustness.

Drawings

FIG. 1 is a schematic diagram of the method of the present invention;

FIG. 2 is a flow chart of the method of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

As shown in fig. 1, a data denoising method based on labeled risk control maintains that two neural networks mutually select data with small loss as low-risk data to update a peer-to-peer network based on a small loss criterion, each network finds and removes high-risk data therein and retrains the remaining data, the inconsistency of the two networks is concerned in the training process, if the inconsistency tends to be stable or the number of learning rounds reaches a preset maximum number of learning rounds, the mutual selection of data is stopped, and the two networks are respectively trained on the low-risk data obtained in the last round until convergence, thereby controlling risk and improving robustness.

The image data denoising method based on the marking risk control comprises the following steps:

step 100 of preparing an image dataset

Wherein the images should have the same dimensions with the presence of marking noise.

Step 101, determining a network architecture, such as VGG, ResNet, EfficientNet and the like, according to the requirements of an image classification task, randomly initializing two peer-to-peer deep neural networks, and using a gradient descent method to perform image data set

Is trained T times over all image data to obtain two neural networks f and g.

Step 102, for an image dataset

For each image data, the neural networks f and g calculate the cross entropy loss separately. Ordering the image data using cross-entropy loss, selecting small loss samples of the top R (t) ratio to construct an image data set

And

r (t) is a parameter for controlling the proportion of the selected image data. l represents the loss function used in training the neural network, typically the cross entropy loss.

103, the two neural networks respectively screen out high-risk image data in the small loss image data selected by the peer-to-peer network and retrain the remaining image data, and the specific steps are as follows:

step 1031, f in the image dataset

Training a K wheel on all image data by using a gradient descent method to obtain f'; g in the image dataset

Training K rounds using gradient descent method on all image data to obtain g'.

Step 1032, will

Is input to the f' acquisition prediction flag. Image data in which the prediction flag does not coincide with the original flag is regarded as high-risk image data

Will be provided with

Each image data of (a) is input to the g' obtaining prediction flag. Image data in which the prediction flag does not coincide with the original flag is regarded as high-risk image data

Representing an image training data, wherein x represents an image feature,

which is a label of the collected image, because of the presence of the marking noise,

not necessarily its true mark y. y is^f′And y^g′Representing the prediction labels given by the neural networks f 'and g', respectively, for image x.

In step 1033, the remaining image data set

Training again on f; in the remaining image data set

And g is retrained.

And 104, calculating the inconsistency of the two neural networks on the image data which is not used for updating the neural networks, stopping mutual learning if the inconsistency reaches a stable state or the number of training rounds reaches a preset maximum value N, and returning to the step 102 to continue training if the inconsistency reaches the stable state or the number of training rounds reaches the preset maximum value N.

And 105, respectively training the two neural networks on the low-risk image data obtained in the last round until convergence, so as to obtain two neural networks f and g.

And step 106, inputting the image data to be detected into f and g respectively to obtain prediction marks, and outputting the prediction mark with higher confidence coefficient as a final mark.

Claims

1. A data denoising method based on labeling risk control is characterized in that two neural networks are maintained, data with small loss are mutually selected to serve as low-risk data to be updated to peer-to-peer networks based on a small loss criterion, each network finds and removes high-risk data in the high-risk data and retrains the high-risk data on the rest data, the inconsistency of the two networks is concerned in the training process, if the inconsistency tends to be stable or the number of learning rounds reaches a preset maximum number of learning rounds, the mutual selection of samples is stopped, the two networks are respectively trained on the low-risk samples obtained in the last round until convergence, the two neural networks are obtained, and the neural networks are used for denoising the data to be processed.

2. The method of claim 1, wherein the data set is prepared first

Wherein the data contains a marker noise; randomly initializing two peer-to-peer deep neural networks in a data set

Training T wheels respectively to obtain two neural networks f and g;

for a data set

And

neural network f is

Go up training K round to get f ', f' at

Remove high risk data and leave the data

Training again on f; in the same way g is

Training K rounds to obtain g 'and g' in

Remove high risk data and leave the data

G is retrained;

Calculating the inconsistency, and if the inconsistency tends to be stable, or the number of training rounds reaches a preset maximum value N; stopping the learning process in the previous paragraph; and low risk data obtained at the end

And

3. The data denoising method based on labeling risk control as claimed in claim 1, wherein in the prediction stage, the user inputs the feature vector of the data to be tested into two neural networks obtained by training, the two neural networks respectively return the prediction results of the data to be tested to the user, and then the one with higher confidence coefficient is selected from the two prediction results and used as the final label output of the data.

4. The method of claim 1, wherein the data set is denoised based on marker risk control

Is an image data set in which the image data contains a marker noise.

5. The method of claim 4, wherein in the prediction stage, the user inputs the feature vectors of the image data to be detected in the two modalities into the two neural networks obtained by training, the two neural networks respectively return the prediction results of the image data to be detected to the user, and then the one with higher confidence is selected from the two prediction results as the final label of the image data for output.

6. The method of claim 1, wherein the data denoising method based on marker risk control is applied to a data set

Respectively calculating the cross entropy loss of each data by using the neural networks f and g, sorting the data by using the cross entropy loss, and respectively selecting small loss samples with the first R (t) proportion to construct a data set

And

r (t) is a parameter for controlling the proportion of the selected data.

7. The method for denoising data based on labeled risk control according to claim 2, wherein each network finds and removes high risk data therein and retrains on the remaining data, comprising the specific steps of:

step 1031, f in data set

Training a K round by using a gradient descent method to obtain f'; g in the data set

Training a K round by using a gradient descent method to obtain g';

step 1032, will

Each data input to f' gets a predictive flag; image data in which the prediction flag does not coincide with the original flag is regarded as high-risk data

Will be provided with

Each data input to g' gets a predictive flag; data in which the predicted tag is inconsistent with the original tag is considered high risk data

Step 1033, in the remaining data set

Training again on f; in the remaining data set

And g is retrained.