CN112200255A

CN112200255A - Information redundancy removing method for sample set

Info

Publication number: CN112200255A
Application number: CN202011110339.9A
Authority: CN
Inventors: 程战战; 许昀璐; 吴飞; 浦世亮
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2021-01-08
Anticipated expiration: 2040-10-16
Also published as: CN112200255B

Abstract

The invention provides an information redundancy removing method for a sample set, which comprises the following steps: obtaining a sample to be processed and a corresponding trainable label to obtain an original sample set to be processed; performing feature extraction on each sample based on a pre-trained machine learning model to obtain a feature vector set of an original sample set; inputting a characteristic vector set to a learnable sample selector model, carrying out sample selection on the characteristic vector set, and obtaining a representative characteristic vector subset according to a preset threshold; and acquiring original samples corresponding to the feature vector subsets as a sub-sample set after redundant information is removed. According to the technical scheme, the original sample set can be efficiently simplified, redundant information is removed, the samples with valuable information are reserved, and the training efficiency of the algorithm on the sample set can be improved.

Description

Information redundancy removing method for sample set

Technical Field

The invention relates to the technical field of data processing, in particular to an information redundancy removing method for a sample set.

Background

With the development of deep learning technology, machine learning methods based on large-scale data sets are continuously proposed. However, in practice, there is often a large amount of information data redundancy in large-scale datasets, e.g., single-class sample-scale excess, duplicate or approximate samples, etc.; on the other hand, the large-scale data set causes the training process of the machine learning model to require more computing power and computing time, and consumes a great deal of resources. Therefore, in the face of large-scale training tasks in different scenes, for example, a very large-scale computational visual classification task is often trained by using tens of millions of image samples, or a very large-scale natural language processing task is often trained by using hundreds of millions of language samples, so that the method for removing redundancy based on information for a sample set is more urgent. Considering that the data set is large in scale, the relation between samples is complex, and the comparison and analysis of samples based on pair is large in calculation power, no directly-used technical scheme can be used for redundancy removal of information of the large-scale data set at present.

Disclosure of Invention

The present invention is directed to solve the problems in the prior art, and provides an information redundancy removing method for a sample set, so as to achieve information redundancy removal of a data set.

The technical scheme adopted by the invention is as follows:

a method of information de-redundancy for a sample set, the method comprising:

obtaining a sample to be processed and a corresponding trainable label to obtain an original sample set;

and performing feature extraction on each acquired original sample through a pre-prepared feature extraction model to obtain a feature vector set of the original sample set.

Selecting a sample for the characteristic vector set through a pre-prepared learnable sample selector model, and obtaining a representative characteristic vector subset according to a preset threshold value;

and acquiring original samples corresponding to the feature vector subsets as a sub-sample set after redundant information is removed.

Preferably, the step of preparing the feature extraction model in advance includes:

acquiring a sample set to be processed, recording the sample set as a first sample set, and acquiring a trainable label corresponding to the sample;

inputting the samples and the corresponding labels in the first sample set into a preset first machine learning model for training to obtain a preset first model, namely a pre-prepared feature extraction model; the preset first machine learning model comprises a feature extraction part and a model constraint convergence part; the characteristic extraction part is used for obtaining characteristic vectors of the samples, obtaining a characteristic vector set of an original sample set and recording the characteristic vector set as a second sample set, and the model constraint convergence part is used for controlling normal training of the characteristic extraction model until convergence.

Preferably, the step of sample selecting the feature vector set by the sample selector model comprises:

acquiring a second sample set, and acquiring a training label of an original sample corresponding to the feature vector as a trainable label; inputting the second sample set into a sample selector model, obtaining a representative feature vector subset and a corresponding trainable label according to a first preset threshold value, and recording as a third sample set;

the sample selector model comprises a neural network and an activation function, and the activation value obtained after the sample is input into the sample selector model is considered to be representative when the activation value is larger than a first preset threshold value.

Preferably, the sample selector model performs learning optimization through a teacher-student model structure, and stops training when the convergence index of the whole training process reaches a second preset threshold value, so as to obtain the sample selector.

Preferably, the step of determining the teacher-student model comprises:

inputting the second sample set and the labels corresponding to the samples in the set into a preset second machine learning model for training to obtain a teacher model, which is called a second model; the second machine learning model comprises a feature extraction part and a Loss constraint part, wherein the feature extraction part is used for obtaining high-level abstract features of the sample, and the Loss constraint part is used for optimizing the teacher model to realize training;

inputting the samples in the third sample set and the corresponding labels into a preset third machine learning model for training to obtain a student model, namely a third model; the preset third machine learning model comprises a feature extraction part and a Loss constraint part, wherein the feature extraction part is used for obtaining high-level abstract features of the sample, and the Loss constraint part is used for optimizing the student model to realize training;

the teacher model deals with the student models to carry out knowledge distillation, more value information is transmitted to the student models, and the student models are helped to achieve the best performance on the selected third sample set.

Preferably, the knowledgeable distillation method is layer-by-layer eigendistillation, resulting in distillation Loss.

Preferably, there are a plurality of ways to determine the first preset threshold, including a) or b):

a) a preset lowest threshold value is given, and when the activation value obtained after the sample is input into the sample selector model is larger than the set lowest threshold value, the sample is considered as representative and selected;

b) and sorting the activation values obtained after all the samples are input into the sample selector model according to the capacity of the third sample set, and determining the lowest threshold value to ensure that the sample amount exceeding the lowest threshold value is equal to the capacity of the third sample set.

Preferably, the learning optimization process of the sample selector model includes:

a complete sample selector training process should include at least one forward operation and at least one reverse operation;

in the forward operation, the samples in the second sample set are input into a sample selector model, the corresponding activation value of each sample is output, and a third sample set is obtained according to a first preset threshold; inputting the third sample set into a third model, and outputting the Loss of each sample;

in the reverse operation, the distillation Loss gradient output by the third model is fed back to the network parameters of the sample selector model, and only the Loss result generated by the third sample set exceeding the first preset threshold is calculated in the reverse gradient operation process, and the network weight is updated; and the training Loss of the student model and the teacher model is reversely returned to the respective feature extraction networks for parameter updating.

The information redundancy removing method for the sample set, provided by the invention, comprises the steps of firstly obtaining a sample set to be processed; then, acquiring a corresponding characteristic vector set through a characteristic extraction model; selecting the obtained feature vector set through a sample selector model to obtain a screened feature vector subset; and finally, obtaining the original sample corresponding to the feature vector subset to obtain a subsample set after redundant information is removed. The method can effectively remove redundant information in a large-scale sample set, and effectively reduces the scale of an original sample set while ensuring the performance of a training model of the reduced sample set; on the other hand, model training is carried out based on the simplified data set, so that model training cost can be greatly saved; the method can meet the practical application requirements of users, has strong applicability and comprises the common machine learning fields of voice recognition, image recognition, natural language recognition and the like. Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram illustrating an information redundancy removing method for a sample set according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of feature extraction in an information redundancy removal method for a sample set according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a teacher-student model in an information redundancy removing method for a sample set according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

First, for DNN (Deep Neural Network) of the present invention: the method is a multilayer feedforward artificial neural network, and the neurons of the feedforward artificial neural network can respond to the peripheral units in the preset coverage range and can effectively extract the characteristic information of the sample through weight sharing and characteristic convergence.

Teacher-Student (Teacher Student Model): the teacher network is a distillation-based neural network structure, and the teacher network can transmit effective information to the student network in a distillation mode, so that the information construction capability of students is improved.

In order to implement redundancy removal of sample set information, an embodiment of the present invention provides an information redundancy removal method for a sample set, and referring to fig. 1, the method includes:

s101, obtaining a sample to be processed (marked as an original sample) and a corresponding trainable label to obtain an original sample set.

The sample can be any data to be processed, including speech recognition, image recognition, natural language processing, and the like.

S102, after the original sample set is obtained, feature extraction is carried out on each obtained original sample through a pre-prepared feature extraction model, and a feature vector set of the original sample set is obtained.

The feature extraction model is used for extracting features of the original sample set to obtain a corresponding feature vector set, and ensures that feature vectors in the feature vector set correspond to samples in the original sample set one by one. In the invention, the original sample set can be an image sample, a voice sample, a natural language processing sample and the like; the feature map is also one of the feature vectors. In a possible implementation of the embodiment of the present invention, we only use the image classification sample as an implementation of the embodiment.

The preset feature extraction model can be a pre-trained self-coding neural network, and the method for pre-training the self-coding neural network comprises the steps of inputting an original image sample into the self-coding neural network for training, and obtaining a pre-trained coding part neural network after convergence to serve as a feature extraction module.

In a possible embodiment, the step of predetermining the preset feature extraction model includes:

step one, acquiring a plurality of original image samples from a sample set to be processed, wherein the original image samples are called a first sample set, and acquiring trainable labels corresponding to the samples.

Inputting the obtained image sample and the corresponding label into a preset self-coding neural network model for training to obtain a preset self-coding neural network model, namely the pre-prepared feature extraction model, wherein the self-coding neural network model comprises a feature coding part and a feature decoding part, the feature coding part is used for obtaining the depth high-level features of the image to obtain a basic feature vector, and the feature decoding part is used for decoding the basic feature vector to obtain an original image sample. And compared with the characteristic coding part, the characteristic decoding part can be regarded as a characteristic coding model constraint convergence part, and the normal training of the characteristic extraction model can be controlled until convergence.

And step three, when an image sample to be processed is obtained, inputting the image sample to the feature extraction model to obtain a corresponding feature vector set, which is called a second sample set.

Specifically, as shown in fig. 2, in the embodiment of the present invention, a self-coding neural network is adopted as a framework of a feature extraction model, where the self-coding model includes a feature coding (Encode) part and a feature decoding (Decode) part, and includes:

optionally, the feature encoding part is configured to perform feature compression on the input image sample to obtain a feature Vector (Base Vector) of the image; the feature coding sub-module includes a multilayer convolutional neural network, for example, a residual feature extraction neural network (ResNet-18, residual network 18 layers) may be used as a neural network model of the basic feature vector extraction sub-module, so as to complete a down-sampling process from an original image to a feature vector.

Optionally, the feature decoding part is configured to perform feature restoration on the extracted feature vector to obtain an original image sample; the feature decoding sub-module comprises a multilayer convolutional neural network, for example, the inverse process of the coding network is generally performed, and the up-sampling process from the feature vector to the original image is completed.

Optionally, constructing the first sample set as a first training data set; constructing an object function to enable the recovery image I 'output by the neural network'_i,jAnd the original graph I_i,jKeeping consistent, the loss function defined for the training samples is:

where H and W are the pixel height and width of the image, respectively.

And inputting the original image sample into the preset self-coding neural network model for training, and obtaining the preset self-coding neural network model when the preset neural network model converges or the training times reach the preset times. The feature encoding part (Encode) serves as a feature extraction model.

And S103, after the characteristic vector set is obtained, carrying out sample selection on the characteristic vector set through a pre-prepared learnable sample selector model, and obtaining a representative characteristic vector subset according to a preset threshold value.

And the preset sample selector model is used for selecting the feature vector set to obtain the feature vector subset with high-value information.

The preset sample selector model can be a pre-trained neural network, and the method for pre-training the neural network comprises the steps of inputting the characteristic vector set and the corresponding trainable labels into the selector neural network for training, and obtaining the pre-trained neural network after convergence. The training process may be jointly optimized through a teacher-student model structure.

In a possible embodiment, the step of predetermining the preset sample selector model comprises:

step one, a second sample set is obtained, and a training label of an original sample corresponding to the feature vector is used as a trainable label.

And step two, inputting the second sample set into a sample selector, and obtaining a representative feature vector subset and a corresponding trainable label according to a first preset threshold value, wherein the representative feature vector subset is called as a third sample set. The sample selector model comprises a multilayer neural network and an activation function, and when the activation value obtained after the sample is input into the sample selector model is larger than a first preset threshold value, the sample selector model is regarded as representative and selected.

The sample selector can perform learning optimization through a teacher-student model structure, and stops training when the convergence index of the whole training process reaches a second preset threshold value, so as to obtain the sample selector. The teacher-student model comprises a teacher network and a student network, the teacher network is trained on the basis of a second sample set to obtain a teacher model, the student network is trained on the basis of a third sample set to obtain a student model, the teacher model is used for distilling the student model to improve the knowledge extraction capability of the student model, the convergence index of the training process can be the Loss or the accuracy of the model, and a second preset threshold value can be set to be the Loss or the accuracy of the model when the Loss does not decrease.

Specifically, as shown in fig. 3, the optimization of the sample selector model by using the teacher-student model structure in the embodiment of the present invention includes:

while obtaining the second sample set, construct a second training based on the second sample setTraining data set (V)₁,V₂,…,V_N) Its corresponding trainable category label is (Y)₁,Y₂,…,Y_N) And N is the size of the sample set.

Further, a second set of samples is fed into a sample selector, comprising a multi-layer neural network and a non-linear activation function, such as sigmoid, for sample selection according to a first preset threshold. Specifically, there are various ways to determine the first preset threshold, including but not limited to: a) a preset lowest threshold value is given, and when the activation value obtained after the sample is input into the sample selector model is larger than the set lowest threshold value, the sample is considered as representative and selected; b) and sorting the activation values obtained after all the samples are input into the sample selector model according to the capacity of the third sample set, and determining the lowest threshold value to ensure that the sample amount exceeding the lowest threshold value is equal to the capacity of the third sample set. For example, the first threshold may be set based on the activation value of sigmoid, for example, if the activation value of sigmoid >0.8 is considered as a high-value sample, the sample is selected; the predetermined threshold may also be set based on the size of the selected subset of samples, for example, the selected subset of samples is agreed to be 50% of the second set of samples, and then the top 50% is selected to constitute the third set of samples after sorting according to the sigmoid activation value size.

Further, in acquiring the third set of samples, a third set of training data (V ') is constructed based on the third set of samples'₁,V′₂,…,V′_M) Its corresponding trainable category label is (Y'₁,Y′₂,…,Y′_M) And M is the sample set size.

And further, the corresponding labels of the samples in the second training data set are sent to a teacher neural network for training, the teacher neural network comprises a feature extraction part and a Loss constraint part, wherein the feature extraction part is used for obtaining high-level abstract features of the samples, and the Loss constraint part is used for optimizing a teacher model to realize training. Specifically, the teacher neural network includes a multi-layer neural network for deriving the predictive tag values. Meanwhile, an objective function needs to be constructed, so that the label value predicted by the neural network is consistent with the true label value of the corresponding image, and the loss function defined for the training sample is as follows:

wherein, Y_iIs the true value of the label, y_iIs the predicted probability value.

And further, sending the third training data set into a student neural network for training, wherein the student neural network comprises a feature extraction part and a Loss constraint part, the feature extraction part is used for obtaining high-level abstract features of the sample, and the Loss constraint part is used for optimizing the student model to realize training. Specifically, the student neural network comprises a multilayer neural network, and is used for obtaining a predicted label value and constructing an objective function, so that the label value predicted by the neural network is consistent with a true label value of a corresponding image, and a loss function defined for a training sample is as follows:

wherein, Y'_iIs a label truth value, y'_iIs the predicted probability value.

Optionally, the teacher model deals with the student model to perform knowledge distillation, so as to transfer more value information to the student model, improve the knowledge extraction capability of the student model, and help the student model to achieve the best performance on the selected third sample set:

wherein H 'and W' are respectively the resolution height and width of the corresponding characteristic layer of the teacher student network,

is the characteristic value label true value of the corresponding teacher model,

is the eigenvalue label true value of the corresponding student model.

Optionally, the sample selector may be trained according to a teacher-student model structure, and a complete sample selector training process should include at least one forward operation and at least one backward operation:

in the forward operation, inputting the second training data set into a sample selector model, outputting an activation value corresponding to each sample, and acquiring a third training data set according to a preset threshold; and inputting the third training data set into the student model, outputting the Loss of each sample, and driving the student model to train.

In the reverse operation, the network parameters of the sample selector model are subjected to distillation loss generated by distillation of the knowledge of the student model through the teacher model

And (4) performing back propagation updating parameters by gradient back transmission, wherein only the Loss result generated by the third training data set exceeding the first preset threshold is calculated in the process of back gradient operation, and updating the network weight. Training losses in models of teacher and student networks generated by respective training models

And

and reversely returns to the respective feature extraction network for parameter updating.

And inputting the second training data set and the third training negligence sniping into the preset neural network model for training, and obtaining a preset sample selector model when the preset neural network model converges or the training times reach preset times.

And S104, obtaining an original image data set corresponding to the obtained feature vector subset as a selected sub-sample set after redundant information is removed.

In the embodiment of the invention, an efficient and robust information redundancy removing method for a sample set is realized. The method can effectively remove redundant information in a large-scale sample set, and effectively reduces the scale of the original sample set while ensuring the performance of a training model of the reduced sample set; in addition, model training is carried out based on the simplified data set, so that model training cost can be greatly saved. The method can meet the actual application requirements of users and has strong applicability.

It should be noted that, in this document, the technical features in the various alternatives can be combined to form the scheme as long as the technical features are not contradictory, and the scheme is within the scope of the disclosure. Relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of information de-redundancy for a sample set, the method comprising:

2. The method of claim 1, wherein the step of preparing the feature extraction model in advance comprises:

3. The method of claim 1, wherein the step of sample selecting the set of feature vectors by the sample selector model comprises:

4. The method of claim 3, wherein the sample selector model is optimized for learning by a teacher-student model structure, and the training is stopped when the convergence index of the whole training process reaches a second preset threshold, so as to obtain the sample selector.

5. The method of claim 4, wherein the step of determining the teacher-student model comprises:

6. The method of claim 5, wherein the knowledgeable distillation method is layer-by-layer eigendistillation, resulting in distillation Loss.

7. The method of claim 3, wherein the first predetermined threshold is determined in a plurality of ways, including a) or b):

8. The method of claim 4, wherein the learning optimization process of the sample selector model comprises: