CN115953609B

CN115953609B - Data set screening method and system

Info

Publication number: CN115953609B
Application number: CN202210942382.4A
Authority: CN
Inventors: 王纵驰; 王建兴; 付利红; 孙天姿; 王诗慧; 张朗; 刘翔宇; 史淼
Original assignee: Aerospace Shenzhou Wisdom System Technology Co ltd; China Aviation Oil Group Co ltd
Current assignee: Aerospace Shenzhou Wisdom System Technology Co ltd; China Aviation Oil Group Co ltd
Priority date: 2022-08-08
Filing date: 2022-08-08
Publication date: 2023-08-18
Anticipated expiration: 2042-08-08
Also published as: CN115953609A

Abstract

The application discloses a data set screening method and a system, which belong to the field of data processing, wherein the method comprises the following steps: generating m different initial data sets from the original data sets through different sampling processes; respectively inputting m different initial data sets into mp classifiers to obtain mp different classifier output results; inputting the mp different classifier output results into a voting network to obtain a data set with mean value and divergence value; and carrying out iterative screening on the data set with the mean value and the bifurcation value, and outputting a core data set after iteration is ended when a stop condition is met. The system comprises: the device comprises an initial data set generation module, a classifier module, a voting module and a screening module. The method can improve the accuracy of the screening model, and the screened core set has universality compared with the same type of deep learning algorithm.

Description

Data set screening method and system

Technical Field

The application belongs to the field of data processing, and particularly relates to a data set screening method and system.

Background

Deep convolutional neural networks (Convolutional Neural Networks) have shown their potential in many areas of research in computer vision (e.g., image classification, object detection and scene segmentation) and have been applied by training convolutional neural network models on a large data volume of supervised data sets. However, this method has a great limitation in practice because the cost of collecting a large number of marker images is high and the storage of large data volume data and model training process is laborious and laborious. The problems exposed in these applications raise a key issue: "what method is used to select the data to achieve the highest accuracy with a fixed data size. "and the core dataset screening algorithm is one of the common approaches to solve this problem.

The core dataset screening problem considers that in a dataset that has been fully labeled, algorithms are used to attempt to screen out a subset so that the model trained on the selected subset performs as closely as possible to the model trained on the entire dataset.

However, the conventional core data set selection algorithm for machine learning methods such as support vector machine is ineffective when applied to CNN. The main factor behind this inefficiency is the inter-batch correlation due to CNN batch sampling training. In a classical environment, an active learning algorithm typically selects one point in each iteration; but for CNNs this is not possible. Due to the random gradient descent method, the introduction of a single point may not have a statistically significant effect on the accuracy of CNN; and this makes single point tag queries difficult because each iteration requires extensive training until convergence. Therefore, it is necessary to perform batch-wise sorting in each iteration, thereby producing an effect of batch-to-batch correlation.

Disclosure of Invention

In order to solve the defects in the prior art, the application provides a data set screening method and a data set screening system.

In a first aspect, the present application provides a data set screening method, including the steps of:

generating m different initial data sets from the original data sets through different sampling processes;

respectively inputting m different initial data sets into mp classifiers to obtain mp different classifier output results;

inputting the mp different classifier output results into a voting network to obtain a data set with mean value and divergence value;

and carrying out iterative screening on the data set with the mean value and the bifurcation value, and outputting a core data set after iteration is ended when a stop condition is met.

The method for generating m different initial data sets by sampling the original data sets comprises the following steps:

performing data feature extraction on the original data set by using the front 174 layers of the ImageNet pre-trained ResNet50 model as a feature extraction network;

calculating the mutual distance dist between every two samples in the original data set according to the extracted data characteristics _ij And obtaining a distance set, wherein the calculation formula is as follows:

wherein i and j represent different samples in the original data set, u represents different dimensions in the data characteristic tensor of the samples, n represents the total number of dimensions of the data characteristic tensor, F _iμ 、F _jμ Representing the data characteristic tensor values of the corresponding dimension u for samples i, j in the original dataset.

Calculating the distribution density of the data according to the distance set, and obtaining a density set;

and sorting the density sets in descending order according to the distribution density of the data from large to small, and taking samples corresponding to the first mk data as an initial data set.

The distribution density calculation process is as follows: ascending order of distance sets is carried out according to the sequence of mutual distances between every two samples from small to large, and the mutual distance of eight nearest data samples after order is takenCalculate data sample x _i The formula is as follows:

where m' is one of eight samples, density (x _i ) For data sample x _i Is a distribution density of (a).

The classifier is a ResNet50 network classification model initialized with different initialization parameters.

The loss function and the classification regression function of the ResNet50 network classification model are described as the following formulas:

where loss is the loss function, and,representing a loss function solving process, R is a set of data points, and the parameter w refers to the weight trained by the deep learning model algorithm in each iteration process, and in the whole trained loss function solving process, the class loss regression solving process in any class c of the classification problemη _c (x) All are satisfying lambda ^η -a solution process of Lipschitz continuous conditions;

under the constraint of the above conditions, a class C image classification deep learning dataset is expressed as a sub-spatial A series of +.about.among those collected by the manual screening process>A collection of data points, where [ N ]]= {1,..n } represents the number N of data in the data set, which is divided into two parts according to the deep learning data division rule, respectively the training data set T of number N: { x _i ，y _i } _i∈[n] With a training dataset V of quantity ml: { x _i ，y _i } _i∈[ml] 。

The mp different classifier output results are input into a voting network, and the method comprises the following steps:

inputting the mp different classifier output results as a data set R to be selected into a voting network;

in the voting network, data (x _m ，y _m ) Analyzing;

calculating a set of probability predictors P for data in a set of candidate data _ij ＝{P _pq |p＝1，...，K，q＝1，...，C}；

P _pq The probability prediction value of the data belonging to q classification obtained by the classifier module p is represented, p represents the prediction value obtained by the data passing through different classifier modules, q represents the probability of the prediction data belonging to q classification, and K is the number of the classifier modules.

Calculating an average MP of a set of probability predictors _c As a voting network prediction tag output, and calculating a divergence value L of the data based on an average value, the formula is as follows:

wherein C is the number of categories, MP _j Indicating the data prediction category as the average value of j, P _ij Indicating a prediction probability for the classifier module i to predict the data as class j.

The data set with mean and bifurcation values includes the following features: { x _i ，y _i ，MP _c L }, where x _i A data sample, i.e. a certain sample in the historical data set; y is _i An original tag representing a data sample; MP (MP) _c Is the average value of the probability prediction values; l is a bifurcation value.

Performing iterative screening on the data set with the mean value and the bifurcation value, and when the stopping condition is met, ending the iteration, and outputting a core data set, wherein the method comprises the following steps of:

from the original tag y of the dataset _i Average value MP of probability prediction value in voting network _c The classification prediction of (2) is the same as the classification prediction of (3), the data set R to be selected is divided into: consistent candidate set TR ^k Divergence candidate set FR ^k ；

Respectively carrying out descending order according to the values of the divergence values L, carrying out sampling proportion distribution, and carrying out layered sampling to obtain b data finally screened out the iteration, and adding s ^k Forming the data set s required for the next iteration ^k+1 ；

When the stop condition is satisfied, the iteration is ended, and the core data set is output.

The consistent candidate set TR ^k Divergence candidate set FR ^k The number of the selected data is in proportion to nine to one.

The stop condition includes: precision conditions and quantity conditions;

and (3) iterating the precision condition until the condition meets the following formula:

s.t.{x _i ,y _i } _i∈[m] ∈V

i.e. when data set s ^k Training the obtained model weight W _S So that model M _S When the accuracy rate of the test data set V is smaller than the error allowable value epsilon determined by the task compared with the accuracy rate E of the original data set under the same condition, the data set s is considered ^k I.e. in the initial dataset s ⁰ And the core data set under error epsilon, at which time the iterative termination algorithm ends, f (x) _i ,w _i ) For data sample x _i And w is equal to _i A function between weights;

the quantitative condition being when the data set s ^k When the number of iterations reaches a preset value M, the data set s is considered ^k I.e. in the initial dataset s ⁰ And the iteration is ended at this time as well as the core dataset under error epsilon.

In a second aspect, the present application proposes a data set screening system comprising: the system comprises an initial data set generation module, a classifier module, a voting module and a screening module;

the initial data set generation module, the classifier module, the voting module and the screening module are sequentially connected;

the initial data set generation module is used for generating m different initial data sets from the original data sets through different sampling processes;

the classifier module is used for respectively inputting mp different initial data sets into mp classifiers to obtain mp different classifier output results;

the voting module is used for inputting the mp different classifier output results into a voting network to obtain a data set with mean value and bifurcation value;

and the screening module is used for carrying out iterative screening on the data set with the mean value and the bifurcation value, and outputting a core data set after the iteration is ended when the stopping condition is met.

The application has the beneficial effects that:

the application provides a data set screening method and a data set screening system, which are used for constructing a weighted subset of training approximate replacement original data sets from the perspective of data sets used for deep learning, namely a core data set, defining a core data set selection problem under CNN and designing and realizing a core data set selection algorithm under CNN. According to the application, three different data sets are used for researching the image classification problem, and experimental results show that the accuracy of a training model is only reduced by not more than 5% while the scale of a core data set is one fifth of the original scale. Meanwhile, the generalization error of the core data set screened by the method is only 0.13, and the generalization is better.

Drawings

FIG. 1 is a flow chart of a data set screening method according to an embodiment of the present application;

FIG. 2; the embodiment of the application provides a data set screening system schematic block diagram;

FIG. 3 is a graph showing comparison of the accuracy of each algorithm of the CIFAR dataset according to the embodiment of the application;

FIG. 4 is a graph showing the comparison of the accuracy of the algorithms of the Fashion-MNIST dataset of the present application;

FIG. 5 is a graph of the visual effect of the Fashion-MNIST dataset of an embodiment of the present application;

fig. 6 is a graph showing the accuracy of each algorithm of the SVHN dataset according to the embodiment of the present application;

fig. 7 is a diagram of a residual block structure of an embodiment of the present application.

Detailed Description

The application is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and are not intended to limit the scope of the present application.

In a first aspect, the present application proposes a data set screening method, as shown in fig. 1, including the following steps:

step S1: generating m different initial data sets from the original data sets through different sampling processes;

step S2: respectively inputting m different initial data sets into mp classifiers to obtain mp different classifier output results;

step S3: inputting the mp different classifier output results into a voting network to obtain a data set with mean value and divergence value;

step S4: and carrying out iterative screening on the data set with the mean value and the bifurcation value, and outputting a core data set after iteration is ended when a stop condition is met.

In step S1 of the present application, the initial data set is sampled, and an algorithm of an iterative process is required, and the initial state of the algorithm often affects the final result. In general, for sampling processes that generate an initial data set from an original data set, conventional methods employ a uniform sampling method based on the assumption that all data is independently co-distributed. The actual raw data set distribution may not follow the independent co-distribution assumption and thus the sampling process means to generate the initial data set needs to take into account the raw data distribution situation.

Since the uniform sampling method assumes that the probability of all data samples is the same, the data composition of the initial data set may be caused to have randomness with the difference of random initial values of the sampling method. The present document therefore uses a hierarchical sampling method in terms of data density for sampling. Firstly, calculating the mutual distance between every two samples in the data set according to the data characteristics, then calculating the distribution density of the data according to the mutual distance, and finally carrying out hierarchical sampling according to the data density to obtain an initial data set.

where i, j represent different samples in the original dataset and u represent different dimensions in the data characteristic tensor of the sampleN represents the total number of data characteristic tensor dimensions, F _iμ 、F _jμ Representing the data characteristic tensor values of the corresponding dimension u for samples i, j in the original dataset.

The ability of deep convolutional neural networks to extract deep information features of images is closely related to the depth of the structure of their network. However, the gradient vanishes in the training phase when the number of layers of the network structure increases, which objectively results in limited depth of the number of layers of the network structure and limited improvement of network performance. The series combined network with ResNet solves the problem of performance degradation, the network is possible to deepen infinitely through a residual block with a short connection layer jump structure, and the training speed is faster compared with other network models with adjustable depth. The structure of a specific residual block is shown in fig. 7.

In the figure, weight layer is a weight layer, F (x) represents the mapping of x after calculation by a convolution mechanism, a Relu nonlinear module is a nonlinear activation unit, the input of a residual block is F (x) +x, and output gradients of different layers are combined through a short-connection layer-jump structure, so that the number of layers is not limited.

The classifier module herein employs a ResNet50 based classification model, whose network architecture is shown in Table 1 below.

TABLE 1 ResNet50 network architecture

where loss is the loss function, and,representing a loss function solving process, R is a collection of data points. The parameter w refers to the weight value trained by the deep learning model algorithm in each iteration process, and in the whole trained loss function solving process, the class loss regression solving process eta in any class c of the classification problem _c (x) All are satisfying lambda ^η -a solution process of Lipschitz continuous conditions;

the parameter w in the above formula refers to the weight value trained by the deep learning model algorithm in each iteration process. Meanwhile, in the whole training loss function solving process, the class loss regression solving process eta in any class c of the classification problem _c (x) All are satisfying lambda ^η Solution process of Lipschitz continuous conditions, x _i Data sample, y _i Representing data samplesIs a primary label of (c).

Under the constraint of the above conditions, a class C image classification deep learning dataset may be expressed as a spatial sub-setA series of +.about.among those collected by the manual screening process>A collection of data points, where [ N ]]= {1,..n } represents the number N of data in the data set, which is divided into two parts according to the deep learning data division rule, respectively the training data set T of number N: { x _i ，y _i } _i∈[n] With a training dataset V of quantity ml: { x _i ，y _i } _i∈[ml] 。

Referring to the active learning running step design, the algorithm will generate an initial data set s containing mk data in the original data set for the first iteration by sampling on the basis of the training data set T ⁰ ＝{s ⁰ (j)∈[n]} _j∈[m] Specific sampling methods are discussed in the following subsections. The rest data are located in the data set to be selected Among them.

In each iteration, the core dataset screening algorithm uses only { x } _i }i _∈[n] ，{y _s(j) } _j∈[m] Is trained on the data of the model. In other words, the training set of the algorithm in k iterations is s ^k Model M after training _s The model weight w is generated _k . By model weights w _k In the data set R to be selected ^k By algorithm A _s Selecting b data and adding s ^k Forming the data set s required for the next iteration ^k ⁺¹ And sequentially iterating until the termination condition is met, and ending the iterative termination algorithm.

in the voting network, data (x _m ，y _m ) Analyzing;

calculating a set of probability predictors P in a candidate dataset _ij ＝{P _pq |p＝1，...，K，q＝1，...，C}；

P _pq Representing the predicted values of the data obtained by the classifier module p, which belong to the q class, and p representing the predicted values obtained by the data through different classifier modules, q representing the probability that the predicted data belongs to the q class.

from the original tag y of the dataset _i Average value MP of probability prediction value in voting network _c Classification prediction of (a)The data set R to be selected is divided into: consistent candidate set TR ^k Divergence candidate set FR ^k ；

The stop condition includes: precision conditions and quantity conditions;

s.t.{x _i ，y _i } _i∈[m] ∈V

i.e. when data set s ^k Training the obtained model weight W _S So that model M _S When the accuracy rate of the test data set V is smaller than the error allowable value epsilon determined by the task compared with the accuracy rate E of the original data set under the same condition, the data set s is considered ^k I.e. in the initial dataset s ⁰ And the core data set under error epsilon, at which time the iterative termination algorithm ends, f (x) _i ，w _i ) For data sample x _i And w is equal to _i A function between weights;

The core set screening of the present application takes into account a data set that has been fully labeled and attempts to select a subset so that the model trained on the selected set will fall within an acceptable range of accuracy, etc., as compared to the model trained on the complete training set. For a particular machine learning algorithm, there are methods such as core set screening algorithms for SVM and core set screening algorithms for k-Means and k-media.

However, there are certain practical and theoretical limitations to the methods applied to CNN. The current common methods can be divided into the following categories according to the principle: synthetic data methods based on data distillation, subsampling methods based on markov chain monte carlo (Markov Chain Monte Carlo, MCMC), and weak supervised subset selection methods based on active learning. The synthesis method of data distillation is adopted to perform inverse gradient optimization on the difference value of the training loss constructed by the constraint synthesis data set and the training loss constructed by the original data (CNN of a given initial value and a fixed network structure), so that the accuracy similar to the original data is achieved. The data synthesized by the method is highly relevant to the CNN model applied, and has no universality. The sub-sampling MCMC method requires checking a constant portion of the data for each iteration, which severely limits the computational gain. The weak supervision subset selection method based on active learning considers the relation between data points and surrounding data points, and utilizes the current model to estimate and screen feedback of selected data to find diversified coverage areas for the data set, data tag information is ignored in the calculation process, and the selected subset has correlation with the calculation using a CNN model. In view of the problems of the method, a screening algorithm based on bifurcation is provided, training is performed in a supervised learning mode, and CNN participates in screening judgment of batch data points by using a collaborative training voting network framework.

The goal of active learning is to find an effective way to selectively mark under a limited marking budget to optimize in terms of accuracy. It is usually an iterative process in which each iteration uses the query learning method described above to select several points to be marked from unlabeled data for marking, and uses the marked data to train a model to improve accuracy.

For different usage scenarios, different active learning methods may be employed, such as uncertainty sampled query learning, model change expected query learning based, and error reduction based query learning [8]. Inspired by a similar algorithm in active learning, the algorithm is also designed as an iterative mode.

Collaborative training is a bifurcation-based approach that assumes that each data can be classified from different angles, different classifiers can be trained from different angles, then the unlabeled samples are classified using these trained classifiers from different angles, and then unlabeled samples that are considered trusted are screened and added to the training set [10]. Since these classifiers are trained from different angles, a complement can be formed to improve classification accuracy; just as things can be better understood from different angles [11]. According to the collaborative training theory, the voting network architecture is adopted to screen data from the data set to be selected to supplement the selected data, so that the training precision of the core set is improved.

In a second aspect, the present application proposes a data set screening system, as shown in fig. 2, comprising: the system comprises an initial data set generation module, a classifier module, a voting module and a screening module;

Experimental results:

(1) Experimental setup

To validate the proposed core dataset screening algorithm, three widely used image classification datasets were selected to develop related experiments in accordance with convention. Firstly, carrying out a classification experiment of ten types of images on a CIFAR-10 data set by using a Fashion-MNIST data set, and secondly, carrying out a digital classification experiment on a SVHN data set.

The CIFAR dataset has three tasks: the class of coarse-granularity classification tasks is divided into 10 classes, and the class of fine-granularity classification tasks exceeds 100 classes. Because there are hundreds of pieces of data for each class of fine-granularity tasks, the small re-splitting of the single class of data volume results in deep learning model training, and thus only coarse-granularity task processing is performed for the CIFAR data set. The coarse-grained task contains 6 tens of thousands of images of size 32 x 3, each labeled as one of 10 image categories. The training set and the test set contained 50,000 and 10,000 images, respectively, and we used the training set as the original dataset. Similarly, we also select the SVHN dataset training set portion with a data size of 32×32 as the original dataset, and for the Fashion-MNIST dataset we interpolate it to an image size of 32×32 using image processing methods. And testing the accuracy by using the respective test set.

In this chapter, the voting network framework inference module uses 5 ResNet50 network classification models initialized with different initialization parameters as classifier modules for inference. Considering the scale of the experimentally selected dataset, the initial dataset s is selected ⁰ The number of the data sets is 1000, the number of the data sets required for generating the next iteration is also 1000 in each screening, and the iteration times of the data sets are controlled by setting corresponding termination conditions for different data sets so as to carry out experimental comparison.

In order to ensure fairness of experiment comparison, all algorithms in the chapter are trained by adopting a supervised learning method. Experimental comparative analysis was performed herein for each of the following algorithms:

1) Random sampling (Random): and selecting data participating in training according to the number of each iteration from the original data set.

2) Maximum Entropy uncertainty sampling (Entropy-based): according to the empirical settings in its paper, we learn actively based on the maximum entropy of the data and calculate entropy using the softmax output. Only data from each dataset that performs best is used herein because they perform similarly.

3) Active learning-based data selection method (Active-learning): it is set to a supervised learning mode according to the conditions herein.

4) An algorithm (Our (random)) for initial data set selection using conventional random sampling;

5) The use of the herein algorithm for initial dataset selection by the herein sampling method designed for this task demonstrates the impact of different sampling methods to the algorithm, in contrast to the ablation experiment performed by method (4).

(2) CIFAR dataset experiments

For the ResNet50 network classification model using the CIFAR original training set, training is carried out by adopting means of standard data augmentation and the like, and the average accuracy rate of the ResNet50 network classification model on the CIFAR test set is 92.63%. Thus for this dataset, the precision termination condition is employed herein for the algorithm herein, the error tolerance value ε is set to 2.63% for convenience, and the remaining algorithms will stop at the same time the algorithm herein reaches the termination condition.

In the experiment, five times of initialization operation are carried out on the initial mark point pool, and the average classification precision of the result algorithm under the test set is used as a record, and each algorithm adopts a supervised learning mode to calculate the data in iteration. Drawing accuracy and training set s ⁱ The number relationship is shown in fig. 3.

Experiments on this dataset showed that the algorithms herein and methods based on entropy and active learning all have better results than the randomly sampled baseline. And in the last iteration cycle, the accuracy achieved by the entropy and active learning methods are 0.8991 and 0.9016, respectively, while the random sampling baseline is 0.8762. The performance gap between these methods also coincides with the previous literature.

The algorithm designed in the method has obvious advantages compared with other algorithms in a plurality of periods at the early stage, and keeps the lead in the aspect of precision at the later stage; in particular, the initial data set generated by the sampling method designed for this task has an initial period accuracy that is improved by 9.7% compared to the random sampling initial method, and this data verifies the validity of the initial data set generated by this sampling method. Finally, the accuracy of the two methods reaches 0.9048 and 0.9033 respectively, and the two methods are positioned at the first two positions of the five methods, and although the performance difference between the two methods in classification and the traditional method is smaller, the more flexible design can be effectively applied to more complex and diverse data set tasks.

(3) Fashion-MNIST dataset experiments

For the ResNet50 network classification model using Fashion-MNIST original training set, training is carried out by adopting means of standard data augmentation and the like, and the average accuracy rate of the ResNet50 network classification model on CIFAR test set is 94.27% [24 ]]. Thus for this dataset, the accuracy stop condition is employed herein for the algorithm herein, setting the error tolerance value ε to 4.27% for convenience. Experiment is carried out under the same condition, and the precision and the training set s are drawn ⁱ The number relationship is shown in fig. 4.

The trend of each accuracy curve in the data set is similar to that of the previous data set, but the accuracy of the two sampling methods for initializing the data set in the data set is 0.5173 and 0.5064 respectively, and the difference of the accuracy caused by different sampling methods is not great.

Analyzing the reasons of the data, according to the analysis of the data distribution by using PCA dimension reduction in Fashion-MNIST data set paper, the data distribution in the data set is uniform, so that the data set generated by adopting the uniform sampling method can better represent the distribution of the original data. This experimental result also demonstrates the analysis of the effect of the initial dataset sampling processing section on the experimental result with respect to the sampling method. As shown in fig. 5, a dimensionality reduction visual distribution map of the dataset is provided.

(4) SVHN dataset experiments

For the ResNet50 network classification model using the SVHN original training set, training is carried out by adopting means of standard data augmentation and the like, and the average accuracy rate of the ResNet50 network classification model on the SVHN test set is 96.41%. Thus for this dataset, the accuracy stop condition is employed herein for the algorithm herein, setting the error tolerance value ε to 1.41% for convenience. The experimental conditions were set the same as in the previous section. Drawing accuracyAnd training set s ⁱ The number relationship is shown in fig. 6.

The individual accuracy curves in this dataset are similar in trend to the previous dataset, but the individual algorithms in this dataset behave little differently from the random sampling baseline. The reason is that ten kinds of numbers from zero to nine in the data set are guessed, the difference of the same kind of data in the data set is not large, and therefore, the task can be well completed by simple random sampling, and the performance of each algorithm is not distinguished.

(6) Core dataset generalization experiments

The generalization of the core dataset was screened out for analysis of each algorithm. And selecting each core data set generated in the CIFAR data set with higher algorithm distinction, using the core data set to train with the ResNet50 network classification model initialized differently in the algorithm for five times, and using the average classification precision of the core data set under the test set as a recording standard to compare with the precision in the experiment. All algorithms in the experiment were declining, so the absolute value of the difference was used to record table 2 below.

TABLE 2 absolute values of the difference between the algorithms for generalization experiments

It can be seen that although the core dataset selected by the random sampling baseline is relatively low in accuracy, its difference in generalization experiment is only 0.07, and the dataset generalization using the random sampling baseline algorithm is best if the error is ignored.

Compared with a baseline method, the method integrates two indexes of the accuracy of the core data set and the reduction difference used by generalization, and can be seen that the two methods have universality on similar deep learning models on the premise of good accuracy. The two different initialization methods are respectively focused on the precision improvement and the generalized usability under the condition of less data quantity, and can be selected as required in practical application.

While the applicant has described and illustrated the embodiments of the present application in detail with reference to the drawings, it should be understood by those skilled in the art that the above embodiments are only preferred embodiments of the present application, and the detailed description is only for the purpose of helping the reader to better understand the spirit of the present application, and not to limit the scope of the present application, but any improvements or modifications based on the spirit of the present application should fall within the scope of the present application.

Claims

1. The method for screening the image classification data set based on the convolutional neural network is characterized by comprising the following steps of:

generating m different image classification initial data sets from the image classification initial data sets through different sampling processes;

the method comprises the following steps:

performing data feature extraction on the image classification original data set by using the front 174 layers of the ImageNet pre-trained ResNet50 model as a feature extraction network;

calculating the mutual distance dist between every two samples in the original image classification data set according to the extracted data characteristics _ij And obtaining a distance set, wherein the calculation formula is as follows:

wherein i and j represent different samples in the image classification original data set, u represents different dimensions in the data characteristic tensor of the samples, n represents the total number of dimensions of the data characteristic tensor, F _iμ 、F _jμ Data characteristic tensor values representing corresponding dimensions u of samples i and j in the original data set of the image classification;

calculating the distribution density of data according to the distance set, and obtaining a density set, wherein the distribution density calculation process of the data is as follows: ascending order of distance sets is carried out according to the sequence of mutual distances between every two samples from small to large, and the mutual distance of eight nearest data samples after order is takenCalculate data sample x _i The formula is as follows:

where m' is one of eight samples, density (x _i ) For data sample x _i Distribution density of (3);

according to the distribution density of the data, descending order sorting is carried out on the density sets from big to small, and samples corresponding to the first mk data are taken as an image classification initial data set;

respectively inputting m different image classification initial data sets into mp classifiers to obtain mp different classifier output results, wherein the classifier is a ResNet50 network classification model initialized by different initialization parameters, and a loss function and a classification regression function of the ResNet50 network classification model are described as the following formulas:

where loss is the loss function, and,the method is characterized in that the method comprises the steps of representing a loss function solving process, R is a set of data points, and a parameter w refers to a weight trained by a deep learning model algorithm in each iteration process, and in the whole trained loss function solving process, the class loss regression solving process eta in any class c of the classification problem _c (x) All are satisfying lambda ^η -a solution process of Lipschitz continuous conditions;

under the constraint of the above conditions, a class C image classification deep learning dataset is expressed as a sub-spatialOf which the series { x } is collected by a manual screening process _i ，y _i } _i∈[N] ～p _z A collection of data points, where [ N ]]= {1,..n } represents the number N of data in the data set, which is divided into two parts according to the deep learning data division rule, respectively the training data set T of number N: { x _i ，y _i } _i∈[n] With a training dataset V of quantity ml: { x _i ，y _i } _i∈[ml] ，x _i Data sample, y _i An original tag representing a data sample;

inputting the mp different classifier output results into a voting network to obtain an image classification data set with mean and bifurcation values, and inputting the mp different classifier output results into the voting network as a data set R to be selected, wherein the image classification data set with mean and bifurcation values comprises the following characteristics: { x _i ，y _i ，MP _c L }, where x _i A data sample, i.e. a certain sample in the historical data set; y is _i An original tag representing a data sample; MP (MP) _c Is the average value of the probability prediction values; l is a bifurcation value;

in the voting network, data (x _m ，y _m ) Analyzing;

Wherein P is _pq The method comprises the steps that a probability prediction value of data, which belongs to q classification, is obtained through a classifier module p, p represents a prediction value of data, which is obtained through different classifier modules, q represents the probability of the prediction data, which belongs to q classification, and K is the number of the classifier modules;

wherein C is the number of categories, MP _j Indicating the data prediction category as the average value of j, P _ij Indicating a prediction probability for classifier module i to predict data as class j;

performing iterative screening on the image classification data set with the mean value and the bifurcation value, and outputting an image classification core data set after the iteration is ended when a stop condition is met, wherein the method comprises the following steps of:

When the stopping condition is met, ending the iteration, and outputting an image classification core data set;

the stop condition includes: precision conditions and quantity conditions;

s.t.{x _i ，y _i } _i∈[m] ∈V

i.e. when data set s ^k Training the obtained model weight W _S So that model M _S When the accuracy rate of the test data set V is smaller than the accuracy rate E of the image classification original data set under the same condition than the error allowable value epsilon determined by the task, the data set s is considered ^k I.e. classifying the initial dataset s in the image ⁰ Classifying the core dataset with the image under error epsilon, at which time the iterative termination algorithm ends, f (x) _i ，w _i ) For data sample x _i And w is equal to _i A function between weights;

the quantitative condition being when the data set s ^k When the number of iterations reaches a preset value M, the data set s is considered ^k I.e. classifying the initial dataset s in the image ⁰ And classifying the core data set with the image under the error epsilon, and ending the algorithm by the same iteration.

2. An image classification dataset screening system based on a convolutional neural network, comprising: the system comprises an image classification initial data set generation module, a classifier module, a voting module and a screening module;

the image classification initial data set generation module, the classifier module, the voting module and the screening module are sequentially connected in sequence;

the image classification initial data set generation module is used for carrying out data feature extraction on the image classification initial data set by using the front 174 layers of the ResNet50 model of the ImageNet pre-training as a feature extraction network;

the classifier module is used for respectively inputting m different image classification initial data sets into mp classifiers to obtain mp different classifier output results, the classifier is a ResNet50 network classification model initialized by different initialization parameters, and a loss function and a classification regression function of the ResNet50 network classification model are described as the following formulas:

under the constraint of the above conditions, a class C image classificationThe deep learning dataset is represented as a slave spaceOf which the series { x } is collected by a manual screening process _i ，y _i } _i∈[N] ～p _z A collection of data points, where [ N ]]= {1,..n } represents the number N of data in the data set, which is divided into two parts according to the deep learning data division rule, respectively the training data set T of number N: { x _i ，y _i } _i∈[n] With a training dataset V of quantity ml: { x _i ，y _i } _i∈[ml] ，x _i Data sample, y _i An original tag representing a data sample;

the voting module is used for inputting the mp different classifier output results into a voting network to obtain an image classification data set with mean value and bifurcation value, and inputting the mp different classifier output results into the voting network as a data set R to be selected, wherein the image classification data set with mean value and bifurcation value comprises the following characteristics: { x _i ，y _i ，MP _c L }, where x _i A data sample, i.e. a certain sample in the historical data set; y is _i An original tag representing a data sample; MP (MP) _c Is the average value of the probability prediction values; l is a bifurcation value;

in the voting network, data (x _m ，y _m ) Analyzing;

calculating an average MP of a set of probability predictors _c As the output of the voting network predictive label, and calculate the divergence value L of the data based on the average value, the formula is as followsThe illustration is:

the screening module is used for selecting the original label y according to the data set _i Average value MP of probability prediction value in voting network _c The classification prediction of (2) is the same as the classification prediction of (3), the data set R to be selected is divided into: consistent candidate set TR ^k Divergence candidate set FR ^k ；

the stop condition includes: precision conditions and quantity conditions;

s.t.{x _i ，y _i } _i∈[m] ∈V