CN112200255B - Information redundancy removing method for sample set - Google Patents

Information redundancy removing method for sample set Download PDF

Info

Publication number
CN112200255B
CN112200255B CN202011110339.9A CN202011110339A CN112200255B CN 112200255 B CN112200255 B CN 112200255B CN 202011110339 A CN202011110339 A CN 202011110339A CN 112200255 B CN112200255 B CN 112200255B
Authority
CN
China
Prior art keywords
sample
model
training
sample set
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011110339.9A
Other languages
Chinese (zh)
Other versions
CN112200255A (en
Inventor
程战战
许昀璐
吴飞
浦世亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202011110339.9A priority Critical patent/CN112200255B/en
Publication of CN112200255A publication Critical patent/CN112200255A/en
Application granted granted Critical
Publication of CN112200255B publication Critical patent/CN112200255B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides an information redundancy removing method for a sample set, which comprises the following steps: obtaining a sample to be processed and a corresponding trainable label to obtain an original sample set to be processed; performing feature extraction on each sample based on a pre-trained machine learning model to obtain a feature vector set of an original sample set; inputting a characteristic vector set to a learnable sample selector model, carrying out sample selection on the characteristic vector set, and obtaining a representative characteristic vector subset according to a preset threshold; and acquiring original samples corresponding to the feature vector subsets as a sub-sample set after redundant information is removed. According to the technical scheme, the original sample set can be efficiently simplified, redundant information is removed, the samples with valuable information are reserved, and the training efficiency of the algorithm on the sample set can be improved.

Description

Information redundancy removing method for sample set
Technical Field
The invention relates to the technical field of data processing, in particular to an information redundancy removing method for a sample set.
Background
With the development of deep learning technology, machine learning methods based on large-scale data sets are continuously proposed. However, in practice, there is often a large amount of information data redundancy in large-scale datasets, e.g., single-class sample-scale excess, duplicate or approximate samples, etc.; on the other hand, the large-scale data set causes the training process of the machine learning model to require more computing power and computing time, and consumes a great deal of resources. Therefore, in the face of large-scale training tasks in different scenes, for example, a very large-scale computational visual classification task is often trained by using tens of millions of image samples, or a very large-scale natural language processing task is often trained by using hundreds of millions of language samples, so that the method for removing redundancy based on information for a sample set is more urgent. Considering that the data set is large in scale, the relation between samples is complex, and the comparison and analysis of samples based on pair is large in calculation power, no directly-used technical scheme can be used for redundancy removal of information of the large-scale data set at present.
Disclosure of Invention
The present invention is directed to solve the problems in the prior art, and provides an information redundancy removing method for a sample set, so as to achieve information redundancy removal of a data set.
The technical scheme adopted by the invention is as follows:
a method of information de-redundancy for a sample set, the method comprising:
obtaining a sample to be processed and a corresponding trainable label to obtain an original sample set;
and performing feature extraction on each acquired original sample through a pre-prepared feature extraction model to obtain a feature vector set of the original sample set.
Selecting a sample for the characteristic vector set through a pre-prepared learnable sample selector model, and obtaining a representative characteristic vector subset according to a preset threshold value;
and acquiring original samples corresponding to the feature vector subsets as a sub-sample set after redundant information is removed.
Preferably, the step of preparing the feature extraction model in advance includes:
acquiring a sample set to be processed, recording the sample set as a first sample set, and acquiring a trainable label corresponding to the sample;
inputting the samples and the corresponding labels in the first sample set into a preset first machine learning model for training to obtain a preset first model, namely a pre-prepared feature extraction model; the preset first machine learning model comprises a feature extraction part and a model constraint convergence part; the characteristic extraction part is used for obtaining characteristic vectors of the samples, obtaining a characteristic vector set of an original sample set and recording the characteristic vector set as a second sample set, and the model constraint convergence part is used for controlling normal training of the characteristic extraction model until convergence.
Preferably, the step of sample selecting the feature vector set by the sample selector model comprises:
acquiring a second sample set, and acquiring a training label of an original sample corresponding to the feature vector as a trainable label; inputting the second sample set into a sample selector model, obtaining a representative feature vector subset and a corresponding trainable label according to a first preset threshold value, and recording as a third sample set;
the sample selector model comprises a neural network and an activation function, and the activation value obtained after the sample is input into the sample selector model is considered to be representative when the activation value is larger than a first preset threshold value.
Preferably, the sample selector model performs learning optimization through a teacher-student model structure, and stops training when the convergence index of the whole training process reaches a second preset threshold value, so as to obtain the sample selector.
Preferably, the step of determining the teacher-student model comprises:
inputting the second sample set and the labels corresponding to the samples in the set into a preset second machine learning model for training to obtain a teacher model, which is called a second model; the second machine learning model comprises a feature extraction part and a Loss constraint part, wherein the feature extraction part is used for obtaining high-level abstract features of the sample, and the Loss constraint part is used for optimizing the teacher model to realize training;
inputting the samples in the third sample set and the corresponding labels into a preset third machine learning model for training to obtain a student model, namely a third model; the preset third machine learning model comprises a feature extraction part and a Loss constraint part, wherein the feature extraction part is used for obtaining high-level abstract features of the sample, and the Loss constraint part is used for optimizing the student model to realize training;
the teacher model deals with the student models to carry out knowledge distillation, more value information is transmitted to the student models, and the student models are helped to achieve the best performance on the selected third sample set.
Preferably, the knowledgeable distillation method is layer-by-layer eigendistillation, resulting in distillation Loss.
Preferably, there are a plurality of ways to determine the first preset threshold, including a) or b):
a) a preset lowest threshold value is given, and when the activation value obtained after the sample is input into the sample selector model is larger than the set lowest threshold value, the sample is considered as representative and selected;
b) and sorting the activation values obtained after all the samples are input into the sample selector model according to the capacity of the third sample set, and determining the lowest threshold value to ensure that the sample amount exceeding the lowest threshold value is equal to the capacity of the third sample set.
Preferably, the learning optimization process of the sample selector model includes:
a complete sample selector training process should include at least one forward operation and at least one reverse operation;
in the forward operation, the samples in the second sample set are input into a sample selector model, the corresponding activation value of each sample is output, and a third sample set is obtained according to a first preset threshold; inputting the third sample set into a third model, and outputting the Loss of each sample;
in the reverse operation, the distillation Loss gradient output by the third model is fed back to the network parameters of the sample selector model, and only the Loss result generated by the third sample set exceeding the first preset threshold is calculated in the reverse gradient operation process, and the network weight is updated; and the training Loss of the student model and the teacher model is reversely returned to the respective feature extraction networks for parameter updating.
The information redundancy removing method for the sample set, provided by the invention, comprises the steps of firstly obtaining a sample set to be processed; then, acquiring a corresponding characteristic vector set through a characteristic extraction model; selecting the obtained feature vector set through a sample selector model to obtain a screened feature vector subset; and finally, obtaining the original sample corresponding to the feature vector subset to obtain a subsample set after redundant information is removed. The method can effectively remove redundant information in a large-scale sample set, and effectively reduces the scale of an original sample set while ensuring the performance of a training model of the reduced sample set; on the other hand, model training is carried out based on the simplified data set, so that model training cost can be greatly saved; the method can meet the practical application requirements of users, has strong applicability and comprises the common machine learning fields of voice recognition, image recognition, natural language recognition and the like. Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a diagram illustrating an information redundancy removing method for a sample set according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of feature extraction in an information redundancy removal method for a sample set according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a teacher-student model in an information redundancy removing method for a sample set according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
First, for DNN (Deep Neural Network) of the present invention: the method is a multilayer feedforward artificial neural network, and the neurons of the feedforward artificial neural network can respond to the peripheral units in the preset coverage range and can effectively extract the characteristic information of the sample through weight sharing and characteristic convergence.
Teacher-Student (Teacher Student Model): the teacher network is a distillation-based neural network structure, and the teacher network can transmit effective information to the student network in a distillation mode, so that the information construction capability of students is improved.
In order to implement redundancy removal of sample set information, an embodiment of the present invention provides an information redundancy removal method for a sample set, and referring to fig. 1, the method includes:
s101, obtaining a sample to be processed (marked as an original sample) and a corresponding trainable label to obtain an original sample set.
The sample can be any data to be processed, including speech recognition, image recognition, natural language processing, and the like.
S102, after the original sample set is obtained, feature extraction is carried out on each obtained original sample through a pre-prepared feature extraction model, and a feature vector set of the original sample set is obtained.
The feature extraction model is used for extracting features of the original sample set to obtain a corresponding feature vector set, and ensures that feature vectors in the feature vector set correspond to samples in the original sample set one by one. In the invention, the original sample set can be an image sample, a voice sample, a natural language processing sample and the like; the feature map is also one of the feature vectors. In a possible implementation of the embodiment of the present invention, we only use the image classification sample as an implementation of the embodiment.
The preset feature extraction model can be a pre-trained self-coding neural network, and the method for pre-training the self-coding neural network comprises the steps of inputting an original image sample into the self-coding neural network for training, and obtaining a pre-trained coding part neural network after convergence to serve as a feature extraction module.
In a possible embodiment, the step of predetermining the preset feature extraction model includes:
step one, acquiring a plurality of original image samples from a sample set to be processed, wherein the original image samples are called a first sample set, and acquiring trainable labels corresponding to the samples.
Inputting the obtained image sample and the corresponding label into a preset self-coding neural network model for training to obtain a preset self-coding neural network model, namely the pre-prepared feature extraction model, wherein the self-coding neural network model comprises a feature coding part and a feature decoding part, the feature coding part is used for obtaining the depth high-level features of the image to obtain a basic feature vector, and the feature decoding part is used for decoding the basic feature vector to obtain an original image sample. And compared with the characteristic coding part, the characteristic decoding part can be regarded as a characteristic coding model constraint convergence part, and the normal training of the characteristic extraction model can be controlled until convergence.
And step three, when an image sample to be processed is obtained, inputting the image sample to the feature extraction model to obtain a corresponding feature vector set, which is called a second sample set.
Specifically, as shown in fig. 2, in the embodiment of the present invention, a self-coding neural network is adopted as a framework of a feature extraction model, where the self-coding model includes a feature coding (Encode) part and a feature decoding (Decode) part, and includes:
optionally, the feature encoding part is configured to perform feature compression on the input image sample to obtain a feature Vector (Base Vector) of the image; the feature coding sub-module includes a multilayer convolutional neural network, for example, a residual feature extraction neural network (ResNet-18, residual network 18 layers) may be used as a neural network model of the basic feature vector extraction sub-module, so as to complete a down-sampling process from an original image to a feature vector.
Optionally, the feature decoding part is configured to perform feature restoration on the extracted feature vector to obtain an original image sample; the feature decoding sub-module comprises a multilayer convolutional neural network, for example, the inverse process of the coding network is generally performed, and the up-sampling process from the feature vector to the original image is completed.
Optionally, the first sample is addedConstructing a set as a first training data set; constructing an object function to enable the recovery image I 'output by the neural network'i,jAnd the original graph Ii,jKeeping consistent, the loss function defined for the training samples is:
Figure BDA0002728392360000051
where H and W are the pixel height and width of the image, respectively.
And inputting the original image sample into the preset self-coding neural network model for training, and obtaining the preset self-coding neural network model when the preset neural network model converges or the training times reach the preset times. The feature encoding part (Encode) serves as a feature extraction model.
And S103, after the characteristic vector set is obtained, carrying out sample selection on the characteristic vector set through a pre-prepared learnable sample selector model, and obtaining a representative characteristic vector subset according to a preset threshold value.
And the preset sample selector model is used for selecting the feature vector set to obtain the feature vector subset with high-value information.
The preset sample selector model can be a pre-trained neural network, and the method for pre-training the neural network comprises the steps of inputting the characteristic vector set and the corresponding trainable labels into the selector neural network for training, and obtaining the pre-trained neural network after convergence. The training process may be jointly optimized through a teacher-student model structure.
In a possible embodiment, the step of predetermining the preset sample selector model comprises:
step one, a second sample set is obtained, and a training label of an original sample corresponding to the feature vector is used as a trainable label.
And step two, inputting the second sample set into a sample selector, and obtaining a representative feature vector subset and a corresponding trainable label according to a first preset threshold value, wherein the representative feature vector subset is called as a third sample set. The sample selector model comprises a multilayer neural network and an activation function, and when the activation value obtained after the sample is input into the sample selector model is larger than a first preset threshold value, the sample selector model is regarded as representative and selected.
The sample selector can perform learning optimization through a teacher-student model structure, and stops training when the convergence index of the whole training process reaches a second preset threshold value, so as to obtain the sample selector. The teacher-student model comprises a teacher network and a student network, the teacher network is trained on the basis of a second sample set to obtain a teacher model, the student network is trained on the basis of a third sample set to obtain a student model, the teacher model is used for distilling the student model to improve the knowledge extraction capability of the student model, the convergence index of the training process can be the Loss or the accuracy of the model, and a second preset threshold value can be set to be the Loss or the accuracy of the model when the Loss does not decrease.
Specifically, as shown in fig. 3, the optimization of the sample selector model by using the teacher-student model structure in the embodiment of the present invention includes:
in acquiring the second set of samples, a second set of training data (V) is constructed based on the second set of samples1,V2,…,VN) Its corresponding trainable category label is (Y)1,Y2,…,YN) And N is the size of the sample set.
Further, a second set of samples is fed into a sample selector, comprising a multi-layer neural network and a non-linear activation function, such as sigmoid, for sample selection according to a first preset threshold. Specifically, there are various ways to determine the first preset threshold, including but not limited to: a) a preset lowest threshold value is given, and when the activation value obtained after the sample is input into the sample selector model is larger than the set lowest threshold value, the sample is considered as representative and selected; b) and sorting the activation values obtained after all the samples are input into the sample selector model according to the capacity of the third sample set, and determining the lowest threshold value to ensure that the sample amount exceeding the lowest threshold value is equal to the capacity of the third sample set. For example, the first threshold may be set based on the activation value of sigmoid, for example, if the activation value of sigmoid >0.8 is considered as a high-value sample, the sample is selected; the predetermined threshold may also be set based on the size of the selected subset of samples, for example, the selected subset of samples is agreed to be 50% of the second set of samples, and then the top 50% is selected to constitute the third set of samples after sorting according to the sigmoid activation value size.
Further, in acquiring the third set of samples, a third set of training data (V ') is constructed based on the third set of samples'1,V′2,…,V′M) Its corresponding trainable category label is (Y'1,Y′2,…,Y′M) And M is the sample set size.
And further, the corresponding labels of the samples in the second training data set are sent to a teacher neural network for training, the teacher neural network comprises a feature extraction part and a Loss constraint part, wherein the feature extraction part is used for obtaining high-level abstract features of the samples, and the Loss constraint part is used for optimizing a teacher model to realize training. Specifically, the teacher neural network includes a multi-layer neural network for deriving the predictive tag values. Meanwhile, an objective function needs to be constructed, so that the label value predicted by the neural network is consistent with the true label value of the corresponding image, and the loss function defined for the training sample is as follows:
Figure BDA0002728392360000061
wherein, YiIs the true value of the label, yiIs the predicted probability value.
And further, sending the third training data set into a student neural network for training, wherein the student neural network comprises a feature extraction part and a Loss constraint part, the feature extraction part is used for obtaining high-level abstract features of the sample, and the Loss constraint part is used for optimizing the student model to realize training. Specifically, the student neural network comprises a multilayer neural network, and is used for obtaining a predicted label value and constructing an objective function, so that the label value predicted by the neural network is consistent with a true label value of a corresponding image, and a loss function defined for a training sample is as follows:
Figure BDA0002728392360000071
wherein, Y'iIs a label truth value, y'iIs the predicted probability value.
Optionally, the teacher model deals with the student model to perform knowledge distillation, so as to transfer more value information to the student model, improve the knowledge extraction capability of the student model, and help the student model to achieve the best performance on the selected third sample set:
Figure BDA0002728392360000072
wherein H 'and W' are respectively the resolution height and width of the corresponding characteristic layer of the teacher student network,
Figure BDA0002728392360000073
is the characteristic value label true value of the corresponding teacher model,
Figure BDA0002728392360000074
is the eigenvalue label true value of the corresponding student model.
Optionally, the sample selector may be trained according to a teacher-student model structure, and a complete sample selector training process should include at least one forward operation and at least one backward operation:
in the forward operation, inputting the second training data set into a sample selector model, outputting an activation value corresponding to each sample, and acquiring a third training data set according to a preset threshold; and inputting the third training data set into the student model, outputting the Loss of each sample, and driving the student model to train.
In the reverse operation, the network parameters of the sample selector model are subjected to distillation loss generated by distillation of the knowledge of the student model through the teacher model
Figure BDA0002728392360000075
And (4) performing back propagation updating parameters by gradient back transmission, wherein only the Loss result generated by the third training data set exceeding the first preset threshold is calculated in the process of back gradient operation, and updating the network weight. Training losses in models of teacher and student networks generated by respective training models
Figure BDA0002728392360000076
And
Figure BDA0002728392360000077
and reversely returns to the respective feature extraction network for parameter updating.
And inputting the second training data set and the third training negligence sniping into the preset neural network model for training, and obtaining a preset sample selector model when the preset neural network model converges or the training times reach preset times.
And S104, obtaining an original image data set corresponding to the obtained feature vector subset as a selected sub-sample set after redundant information is removed.
In the embodiment of the invention, an efficient and robust information redundancy removing method for a sample set is realized. The method can effectively remove redundant information in a large-scale sample set, and effectively reduces the scale of the original sample set while ensuring the performance of a training model of the reduced sample set; in addition, model training is carried out based on the simplified data set, so that model training cost can be greatly saved. The method can meet the actual application requirements of users and has strong applicability.
It should be noted that, in this document, the technical features in the various alternatives can be combined to form the scheme as long as the technical features are not contradictory, and the scheme is within the scope of the disclosure. Relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (6)

1. A method of information de-redundancy for a sample set, the method comprising:
acquiring an image sample to be processed and a corresponding trainable label to obtain an original image sample set;
extracting the characteristics of each acquired original image sample through a pre-prepared characteristic extraction model to obtain a characteristic vector set of the original image sample set; the pre-prepared feature extraction model is a pre-trained self-coding neural network, and the specific acquisition steps of the feature vector set are as follows: acquiring a plurality of original image samples from a sample set to be processed, wherein the original image samples are called a first sample set, and acquiring trainable labels corresponding to the samples; inputting the obtained image sample and the corresponding label into a preset self-coding neural network model for training to obtain a preset self-coding neural network model, namely the preset feature extraction model, wherein the self-coding neural network model comprises a feature coding part and a feature decoding part, the feature coding part is used for obtaining the depth high-level features of the image to obtain a basic feature vector, and the feature decoding part is used for decoding the basic feature vector to obtain an original image sample; the feature decoding part is regarded as a feature coding model constraint convergence part and can control the normal training of the feature extraction model until convergence; when an image sample to be processed is obtained, inputting the image sample to the feature extraction model to obtain a corresponding feature vector set, which is called a second sample set;
selecting a sample for the characteristic vector set through a pre-prepared learnable sample selector model, and obtaining a representative characteristic vector subset according to a preset threshold value; the steps of selecting the sample of the characteristic vector set by the sample selector model are as follows: acquiring a second sample set, and acquiring a training label of an original sample corresponding to the feature vector as a trainable label; inputting the second sample set into a sample selector model, obtaining a representative feature vector subset and a corresponding trainable label according to a first preset threshold value, and recording as a third sample set; the sample selector model comprises a neural network and an activation function, and when the activation value obtained after the sample is input into the sample selector model is larger than a first preset threshold value, the sample selector model is regarded as representative;
and acquiring an original image sample corresponding to the feature vector subset as an image subsample set after redundant information is removed.
2. The method of claim 1, wherein the sample selector model is optimized for learning by a teacher-student model structure, and the training is stopped when the convergence index of the whole training process reaches a second preset threshold, resulting in the sample selector.
3. The method of claim 2, wherein the step of determining the teacher-student model comprises:
inputting the second sample set and the labels corresponding to the samples in the set into a preset second machine learning model for training to obtain a teacher model, which is called a second model; the second machine learning model comprises a feature extraction part and a Loss constraint part, wherein the feature extraction part is used for obtaining high-level abstract features of the sample, and the Loss constraint part is used for optimizing the teacher model to realize training;
inputting the samples in the third sample set and the corresponding labels into a preset third machine learning model for training to obtain a student model, namely a third model; the preset third machine learning model comprises a feature extraction part and a Loss constraint part, wherein the feature extraction part is used for obtaining high-level abstract features of the sample, and the Loss constraint part is used for optimizing the student model to realize training;
the teacher model deals with the student models to carry out knowledge distillation, more value information is transmitted to the student models, and the student models are helped to achieve the best performance on the selected third sample set.
4. The method of claim 3, wherein the knowledgeable distillation method is layer-by-layer eigendistillation, resulting in distillation Loss.
5. The method of claim 1, wherein the first predetermined threshold is determined in a plurality of ways, including a) or b):
a) a preset lowest threshold value is given, and when the activation value obtained after the sample is input into the sample selector model is larger than the set lowest threshold value, the sample is considered as representative and selected;
b) and sorting the activation values obtained after all the samples are input into the sample selector model according to the capacity of the third sample set, and determining the lowest threshold value to ensure that the sample amount exceeding the lowest threshold value is equal to the capacity of the third sample set.
6. The method of claim 1, wherein the learning optimization process of the sample selector model comprises:
a complete sample selector training process should include at least one forward operation and at least one reverse operation;
in the forward operation, the samples in the second sample set are input into a sample selector model, the corresponding activation value of each sample is output, and a third sample set is obtained according to a first preset threshold; inputting the third sample set into a third model, and outputting the Loss of each sample;
in the reverse operation, the distillation Loss gradient output by the third model is fed back to the network parameters of the sample selector model, and only the Loss result generated by the third sample set exceeding the first preset threshold is calculated in the reverse gradient operation process, and the network weight is updated; and the training Loss of the student model and the teacher model is reversely returned to the respective feature extraction networks for parameter updating.
CN202011110339.9A 2020-10-16 2020-10-16 Information redundancy removing method for sample set Active CN112200255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011110339.9A CN112200255B (en) 2020-10-16 2020-10-16 Information redundancy removing method for sample set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011110339.9A CN112200255B (en) 2020-10-16 2020-10-16 Information redundancy removing method for sample set

Publications (2)

Publication Number Publication Date
CN112200255A CN112200255A (en) 2021-01-08
CN112200255B true CN112200255B (en) 2021-09-14

Family

ID=74009216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011110339.9A Active CN112200255B (en) 2020-10-16 2020-10-16 Information redundancy removing method for sample set

Country Status (1)

Country Link
CN (1) CN112200255B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580847B (en) * 2023-07-14 2023-11-28 天津医科大学总医院 Method and system for predicting prognosis of septic shock

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710907A (en) * 2018-05-15 2018-10-26 苏州大学 Handwritten form data classification method, model training method, device, equipment and medium
CN110991473A (en) * 2019-10-11 2020-04-10 平安信托有限责任公司 Feature selection method and device for image sample, computer equipment and storage medium
CN111259917A (en) * 2020-02-20 2020-06-09 西北工业大学 Image feature extraction method based on local neighbor component analysis
CN111768457A (en) * 2020-05-14 2020-10-13 北京航空航天大学 Image data compression method, device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710907A (en) * 2018-05-15 2018-10-26 苏州大学 Handwritten form data classification method, model training method, device, equipment and medium
CN110991473A (en) * 2019-10-11 2020-04-10 平安信托有限责任公司 Feature selection method and device for image sample, computer equipment and storage medium
CN111259917A (en) * 2020-02-20 2020-06-09 西北工业大学 Image feature extraction method based on local neighbor component analysis
CN111768457A (en) * 2020-05-14 2020-10-13 北京航空航天大学 Image data compression method, device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于栈式自编码神经网络对高光谱遥感图像分类研究";张国东等;《红外技术》;20190517;第41卷(第5期);全文 *

Also Published As

Publication number Publication date
CN112200255A (en) 2021-01-08

Similar Documents

Publication Publication Date Title
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN112699247B (en) Knowledge representation learning method based on multi-class cross entropy contrast complement coding
CN109977250B (en) Deep hash image retrieval method fusing semantic information and multilevel similarity
CN111639240A (en) Cross-modal Hash retrieval method and system based on attention awareness mechanism
CN113516133B (en) Multi-modal image classification method and system
CN113204633B (en) Semantic matching distillation method and device
CN111930887A (en) Multi-document multi-answer machine reading understanding system based on joint training mode
CN113822776B (en) Course recommendation method, device, equipment and storage medium
Dai et al. Hybrid deep model for human behavior understanding on industrial internet of video things
CN115526236A (en) Text network graph classification method based on multi-modal comparative learning
CN113239897A (en) Human body action evaluation method based on space-time feature combination regression
CN113094534A (en) Multi-mode image-text recommendation method and device based on deep learning
CN112200255B (en) Information redundancy removing method for sample set
Jiang et al. An intelligent recommendation approach for online advertising based on hybrid deep neural network and parallel computing
CN114373092A (en) Progressive training fine-grained vision classification method based on jigsaw arrangement learning
CN111666375A (en) Matching method of text similarity, electronic equipment and computer readable medium
CN115795035A (en) Science and technology service resource classification method and system based on evolutionary neural network and computer readable storage medium thereof
CN115455162A (en) Answer sentence selection method and device based on hierarchical capsule and multi-view information fusion
CN113344060B (en) Text classification model training method, litigation state classification method and device
CN110659962B (en) Commodity information output method and related device
CN114627282A (en) Target detection model establishing method, target detection model application method, target detection model establishing device, target detection model application device and target detection model establishing medium
CN113989566A (en) Image classification method and device, computer equipment and storage medium
CN114170460A (en) Multi-mode fusion-based artwork classification method and system
CN115422369B (en) Knowledge graph completion method and device based on improved TextRank
CN117093728B (en) Financial field management map construction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant