CN113222139A - Neural network training method, device and equipment and computer storage medium - Google Patents

Neural network training method, device and equipment and computer storage medium Download PDF

Info

Publication number
CN113222139A
CN113222139A CN202110462397.6A CN202110462397A CN113222139A CN 113222139 A CN113222139 A CN 113222139A CN 202110462397 A CN202110462397 A CN 202110462397A CN 113222139 A CN113222139 A CN 113222139A
Authority
CN
China
Prior art keywords
sample
current
knowledge
knowledge transfer
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110462397.6A
Other languages
Chinese (zh)
Inventor
葛艺潇
蔡青琳
张潇
朱烽
赵瑞
李鸿升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bozhi Perceptual Interaction Research Center Co ltd
Sensetime Group Ltd
Original Assignee
Bozhi Perceptual Interaction Research Center Co ltd
Sensetime Group Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bozhi Perceptual Interaction Research Center Co ltd, Sensetime Group Ltd filed Critical Bozhi Perceptual Interaction Research Center Co ltd
Priority to CN202110462397.6A priority Critical patent/CN113222139A/en
Publication of CN113222139A publication Critical patent/CN113222139A/en
Priority to PCT/CN2021/121379 priority patent/WO2022227400A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks

Abstract

The embodiment of the disclosure discloses a neural network training method, a device and equipment, and a computer storage medium, comprising: executing a loop process until soft labels meeting a preset number of anchoring samples are obtained; training a neural network at least based on the soft labels of the anchoring samples with the preset number and the anchoring samples with the preset number; wherein a current anchor sample and at least one knowledge transfer sample are determined from a current set of training samples in each cycle of the execution loop process; determining similarity between the current anchor sample and each knowledge transfer sample, and the prediction probability of the current anchor sample and the prediction probability of each knowledge transfer sample based on a neural network; determining a soft label for the current anchor sample based on a similarity between the current anchor sample and each knowledge transfer sample, a prediction probability for the current anchor sample, and a prediction probability for each knowledge transfer sample. And the knowledge integration under the self-distillation algorithm is realized.

Description

Neural network training method, device and equipment and computer storage medium
Technical Field
The present disclosure relates to the field of deep learning, and in particular, to a neural network training method, apparatus, device, and computer storage medium.
Background
In recent years, edge devices such as mobile phones, wearable devices, etc. have to handle deep learning related tasks locally, however edge devices are generally limited by limited resources and power consumption as well as latency and cost. In order to recommend the application of deep learning based products to edge devices, the related art proposes a Knowledge Distillation (KD) model compression method.
Specifically, the knowledge distillation-based model compression method is to transfer the reasoning prediction capability of a trained complex teacher model to a simple student model, namely, a soft label predicted by the teacher model is used as training supervision to guide the training of the student model, so that the computing resources required by the student model at the edge equipment end are reduced, and the computing speed of the student model is improved.
However, in order to further obtain more accurate "soft labels" to improve the network performance of the "student" model, the related art often integrates the distillation algorithm with knowledge of the multi-network model to provide effective training supervision for the "student" model, and the complexity is high, so that the training time and space cost is large.
Disclosure of Invention
The embodiment of the disclosure provides a neural network training method, a neural network training device and a computer storage medium.
The technical scheme of the disclosure is realized as follows:
the embodiment of the present disclosure provides a neural network training method, which includes:
executing a loop process until soft labels meeting a preset number of anchoring samples are obtained; training a neural network based on at least the soft labels of the preset number of anchor samples and the preset number of anchor samples; wherein the cyclic process comprises the steps of: acquiring a current training sample set, and determining the current anchoring sample and at least one knowledge transfer sample from the current training sample set in each period of executing a cyclic process; wherein the current anchor sample is any one of the current training sample set, and the at least one knowledge transfer sample is at least one other sample in the current training sample set except the current anchor sample; determining, based on the neural network, a similarity between the current anchor sample and each of the knowledge delivery samples, and a predicted probability of the current anchor sample and a predicted probability of each of the knowledge delivery samples; determining a soft label for the current anchor sample based on a similarity between the current anchor sample and each of the knowledge transfer samples, a prediction probability for the current anchor sample, and a prediction probability for each of the knowledge transfer samples.
In this way, for each training sample under the neural network, the generation of the soft label of the training sample can be assisted by the similarity between other samples and the sample and the prediction probability of other samples, and then efficient training supervision is performed on the neural network based on the soft labels meeting the preset number of training samples.
In the above method, the training the neural network based on at least the soft labels of the anchor samples in the preset number and the anchor samples in the preset number includes: determining a KL divergence for each of the anchor samples based on the soft label for each of the anchor samples and the prediction probability corresponding to the anchor sample; determining a cross-entropy loss for each of the anchor samples based on the hard label for each of the anchor samples and the prediction probability for the anchor sample; training the neural network based on the cross entropy loss of the preset number of anchor samples and the KL divergence of the preset number of anchor samples.
In this way, training the model with smaller data and higher learning rate can be achieved through the target loss function determined by the soft label with high accuracy.
In the above method, the neural network includes an encoder and a classifier; the determining, based on the neural network, a similarity between the current anchor sample and each of the knowledge transfer samples, and a prediction probability of the current anchor sample and a prediction probability of each of the knowledge transfer samples, comprises: determining, based on an encoder of the neural network, sample characteristics of the current anchor sample and sample characteristics of each of the knowledge transfer samples; determining a similarity between the current anchor sample and each of the knowledge transfer samples based on sample characteristics of the current anchor sample and sample characteristics of each of the knowledge transfer samples; based on a classifier of the neural network, a prediction probability of the current anchor sample and a prediction probability of each of the knowledge transfer samples are determined.
In this way, the encoder based on the neural network may obtain the similarity between the current anchor sample and each knowledge transfer sample, and the classifier based on the neural network obtains the prediction probability of the current anchor sample and the prediction probability of each knowledge transfer sample, so as to further realize the generation of the soft label acting on the anchor sample based on the sample similarity, the prediction probability of the current anchor sample and the prediction probability of the knowledge transfer sample.
In the above method, the determining a similarity between the current anchor sample and each knowledge transfer sample based on the sample characteristics of the current anchor sample and the sample characteristics of each knowledge transfer sample includes: carrying out normalization processing on the sample characteristics of the current anchoring sample to obtain the normalized sample characteristics of the current anchoring sample; carrying out normalization processing on the sample characteristics of each knowledge transfer sample to obtain the normalization characteristics of each knowledge transfer sample; and carrying out dot product operation processing on the normalized sample characteristics of the current anchoring sample and the normalized characteristics of each knowledge transfer sample to obtain the similarity between the current anchoring sample and each knowledge transfer sample.
Therefore, dot product operation can be carried out on the sample characteristics of the current anchoring sample and each knowledge transfer sample which are converted to the same dimension through normalization, and the pairwise similarity between every two samples is further determined.
In the above method, the determining the soft label of the current anchor sample based on the similarity between the current anchor sample and each knowledge transfer sample, the prediction probability of the current anchor sample, and the prediction probability of each knowledge transfer sample includes: determining knowledge transfer parameters of each knowledge transfer sample to the current anchor sample based on a similarity between the current anchor sample and each knowledge transfer sample; determining a soft label for the current anchor sample based on knowledge delivery parameters of each of the knowledge delivery samples for the current anchor sample, the prediction probability of the current anchor sample, and the prediction probability of each of the knowledge delivery samples.
In the above method, the determining the soft label of the current anchor sample based on the knowledge transfer parameter of each knowledge transfer sample to the current anchor sample, the prediction probability of each knowledge transfer sample, and the prediction probability of the current anchor sample includes: performing knowledge transfer processing once on knowledge transfer parameters of the current anchor sample and the prediction probability of each knowledge transfer sample based on each knowledge transfer sample to obtain the initial knowledge transfer probability of the current anchor sample; performing knowledge fusion processing once based on the initial knowledge transfer probability and the prediction probability of the current anchoring sample to obtain an initial soft label of the current anchoring sample; based on the initial soft label of the current anchoring sample, executing a loop process until the prediction probability of the at least one knowledge transfer sample is smaller than a preset probability threshold value to obtain the soft label of the current anchoring sample; wherein the cyclic process comprises: in each period of the cyclic process, executing knowledge transfer processing based on the soft label of the current anchoring sample obtained in the previous period and each knowledge transfer parameter to obtain the knowledge transfer probability of the current anchoring sample; and executing knowledge fusion processing based on the knowledge transfer probability of the current anchoring sample and the prediction probability of the current anchoring sample to obtain a soft label of the current anchoring sample in the next period.
Thus, for each training sample in the same batch, the similarity of the training sample can be combined with each other sample in the same batch of training samples, and the 'dark knowledge' of each other sample is weighted and integrated and transmitted to the current training sample until the knowledge is completely fused, so that the robust soft label of each training sample is obtained.
In the above method, the determining knowledge transfer parameters of each knowledge transfer sample for the current anchor sample based on the similarity between the current anchor sample and each knowledge transfer sample comprises: accumulating at least one similarity between the current anchor sample and each knowledge transfer sample to obtain a similarity accumulated value; and determining knowledge delivery parameters of each knowledge delivery sample to the current anchor sample based on the similarity between the anchor sample and each knowledge delivery sample and the accumulated value of the similarities.
In this way, the "knowledge" transfer weight value of each knowledge transfer sample to the anchor sample can be obtained through the normalization processing of the sample similarity, and then the weighted transfer of the prediction probability of each knowledge transfer sample is carried out according to the weight value.
In the above method, the performing knowledge transfer processing on the knowledge transfer parameter of the current anchor sample and the prediction probability of each knowledge transfer sample once based on each knowledge transfer sample to obtain the initial knowledge transfer probability of the current anchor sample includes: determining a knowledge transfer probability of each knowledge transfer sample for the current anchor sample based on knowledge transfer parameters of each knowledge transfer sample for the current anchor sample and a predicted probability of each knowledge transfer sample; accumulating at least one knowledge transfer probability of the current anchoring sample for each knowledge transfer sample to obtain an accumulated value of the knowledge transfer probabilities; and carrying out primary knowledge transfer processing based on the knowledge transfer probability accumulated value to obtain the initial knowledge transfer probability of the current anchor sample.
In this way, weighting and integration of other sample prediction probabilities for delivery to the current anchor sample may be achieved.
In the above method, the method further comprises: obtaining a training data set, wherein the training data set comprises at least one batch of training data subsets; selecting a batch of the training data subset that has not been selected previously as a training sample set from the training data set as the current training sample set.
In this way, neural network training may be performed in the form of multiple batches, i.e., mini-batch.
In the method, random sampling processing is performed on the training data set to obtain at least one first training data; determining a hard label of each first training data, and continuing to perform similarity sampling processing on the remaining data which are not selected in the training data set and are used as the first training data based on the hard label of each first training data to obtain at least one second training data corresponding to each first training data; and using the training data subset of one batch constructed based on the at least one first training data and at least one second training data corresponding to each first training data as the current training sample set.
Therefore, each training sample in the current training sample set can be ensured to have at least one other sample similar to the visual property of the training sample, and further the weighted transfer of knowledge among the samples can be realized according to the similarity among the samples.
The embodiment of the present disclosure provides a neural network training device, including:
the acquisition unit is used for executing a cyclic process until soft labels meeting a preset number of anchoring samples are obtained; wherein the cyclic process comprises the steps of: acquiring a current training sample set, and determining the current anchoring sample and at least one knowledge transfer sample from the current training sample set in each period of executing a cyclic process; wherein the current anchor sample is any one of the current training sample set, and the at least one knowledge transfer sample is at least one other sample in the current training sample set except the current anchor sample; determining, based on the neural network, a similarity between the current anchor sample and each of the knowledge delivery samples, and a predicted probability of the current anchor sample and a predicted probability of each of the knowledge delivery samples;
and the training unit is used for training the neural network at least based on the soft labels of the anchoring samples with the preset number and the anchoring samples with the preset number.
Embodiments of the present disclosure provide a computer device, which includes a processor, and a memory storing instructions executable by the processor, and when the instructions are executed by the processor, the neural network training method as described above is implemented.
The embodiment of the present disclosure provides a computer-readable storage medium, on which a program is stored, and the program is applied to a computer device, and when the program is executed by a processor, the program implements the neural network training method as described above.
According to the technical scheme provided by the embodiment of the disclosure, the computer equipment can execute a loop process until soft labels meeting the preset number of anchoring samples are obtained; training a neural network at least based on the soft labels of the anchoring samples with the preset number and the anchoring samples with the preset number; wherein, the circulation process comprises the following steps: acquiring a current training sample set, and determining a current anchor sample and at least one knowledge transfer sample from the current training sample set in each period of executing a cyclic process; the current anchoring sample is any one of the current training sample set, and the at least one knowledge transfer sample is at least one other sample except the current training sample set and the current anchoring sample; determining similarity between the current anchor sample and each knowledge transfer sample, and the prediction probability of the current anchor sample and the prediction probability of each knowledge transfer sample based on a neural network; determining a soft label for the current anchor sample based on a similarity between the current anchor sample and each knowledge transfer sample, a prediction probability for the current anchor sample, and a prediction probability for each knowledge transfer sample. In this way, for each training sample under the neural network, the generation of the soft label of the training sample can be assisted by the similarity between other samples and the sample and the prediction probability of other samples, and then efficient training supervision is performed on the neural network based on the soft labels meeting the preset number of training samples. It can be seen that the conventional cross-network knowledge integration is replaced by the cross-sample knowledge integration under the same neural network, and the knowledge integration based on the similarity between the samples is realized and the effective soft label is obtained on the basis of only utilizing a single network.
Drawings
FIG. 1 is a schematic diagram of the knowledge integration distillation algorithm of a multi-teacher model in the related art;
FIG. 2 is a schematic diagram illustrating the concept of the knowledge-integrated distillation algorithm of the multi-student model in the related art;
fig. 3 is a schematic flow chart illustrating a first implementation process of a neural network training method according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of an implementation flow of a neural network training method according to an embodiment of the present disclosure;
fig. 5 is a schematic flow chart illustrating an implementation of a neural network training method according to an embodiment of the present disclosure;
fig. 6 is a schematic flow chart illustrating an implementation of a neural network training method according to an embodiment of the present disclosure;
fig. 7 is a schematic flow chart illustrating an implementation of a neural network training method according to an embodiment of the present disclosure;
fig. 8 is a schematic flow chart illustrating an implementation of a neural network training method according to an embodiment of the present disclosure;
fig. 9 is a seventh implementation flow diagram of a neural network training method proposed in the embodiment of the present disclosure;
fig. 10 is a schematic flow chart illustrating an implementation flow of a neural network training method according to an embodiment of the present disclosure;
fig. 11 is a schematic flow chart illustrating an implementation of a neural network training method according to an embodiment of the present disclosure;
fig. 12 is a schematic flow chart illustrating an implementation of a neural network training method according to an embodiment of the present disclosure;
fig. 13 is an eleventh schematic flow chart illustrating an implementation of a neural network training method according to an embodiment of the present disclosure;
FIG. 14 is a first schematic diagram illustrating the concept of the knowledge integration self-distillation algorithm proposed by the embodiments of the present disclosure;
FIG. 15 is a schematic diagram of a second principle of the knowledge integration self-distillation algorithm proposed by the embodiments of the present disclosure;
fig. 16 is a schematic structural diagram of a neural network training device according to an embodiment of the present disclosure;
fig. 17 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.
Detailed Description
For the purpose of making the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be described in further detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present disclosure, and all other embodiments obtained by a person of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present disclosure.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where permissible, so that the disclosed embodiments described herein can be practiced in other than the order shown or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the disclosure only and is not intended to be limiting of the disclosure.
Before further detailed description of the embodiments of the present disclosure, terms and expressions referred to in the embodiments of the present disclosure are explained, and the terms and expressions referred to in the embodiments of the present disclosure are applied to the following explanations.
1) Knowledge distillation: a Teacher-Student (Teacher-Student) mode is adopted, a complex and large model is used as a Teacher model, namely Teacher, the Student model Student has a simpler structure, and Teacher is used for assisting the training of the Student model. The aim is to transfer the 'dark' knowledge learned from the high-volume teacher model to the student model by means of soft-tags (soft-target), which can be class probabilities or characteristic representations output by the teacher, containing information more complete than a single tag.
Generally, a large model is often a single complex network or a collection of networks, and has good performance and generalization capability, while a small model has limited expression capability because of a small network size. Therefore, the knowledge learned by the large model can be used for guiding the training of the small model, so that the small model has the performance equivalent to that of the large model, but the number of parameters is greatly reduced, and the model compression and acceleration are realized.
2) Self-distillation: one special case of Distillation of knowledge, Self-Distillation (Self-Distillation) refers to Distillation from one's own to another, which can be considered as: two separate models F1, F2 were first ensemble-learned and then distilled into F2. Namely, the Teacher Model is an integrated version of the Student Model, and is called Self-integration (Self-Ensemble).
3) Knowledge integration: soft tags are enhanced by integrating knowledge from multiple pre-trained teacher models, such as Multi-Model expert via adaptive Learning (MEAL), Multi-student knowledge-based KDCL.
In recent years, deep neural networks have driven the rapid development of computer vision, with the image classification task being considered one of the most basic and important tasks. There is a great deal of work currently aimed at overcoming the bottleneck in performance improvement of image classification tasks, especially on large-scale datasets.
Recent research shows that the bottleneck of supervised image classification training is derived from an inaccurate 'hard label', namely an artificially labeled single label (one-hot label, one picture and one category), and the learning target imperfection caused by the problem is a key factor which prevents the classification accuracy from being further improved, and the supervised learning has great limitation.
The proposal of the knowledge distillation algorithm provides a better solution to the problem, namely, a soft probability vector predicted by a pre-trained teacher model, namely a 'soft label' is used as a training supervision to guide the training of a student model.
On the other hand, the need to locally handle deep learning related task edge devices, such as mobile phones, wearable devices, etc., is generally limited by limited resources and power consumption as well as latency and cost. The introduction of knowledge distillation algorithms may drive the widespread use of deep learning based products on edge devices.
It is well known that a collection of multiple networks generally yields better predictions than a single network in the collection. Thus, in the most advanced approach, multiple teachers or students are used to encode complementary knowledge, such as by integrating the knowledge of multiple pre-trained teacher models to enhance soft tags, which are more reliable learning objectives, and we refer to such algorithms as the knowledge-integrated distillation algorithm.
Illustratively, fig. 1 is a schematic diagram of a knowledge integration distillation algorithm of a multi-teacher model in the related art, and as shown in fig. 1, the prediction probabilities of teacher models #1 to # N for anchor samples are { p }1,…,pNP, the prediction probability of the student model for the anchor sample is panchorTo { p1,…,pNCarry on the knowledge integration, such as weighted average, and regard the result obtained as the soft label, and move to the student model by way of distilling.
Illustratively, fig. 2 is a schematic diagram of a knowledge integration distillation algorithm of a multi-student model in the related art, and as shown in fig. 2, the prediction probabilities of student models #1 to # N for an anchor sample are { p }1,…,pNIs p for { p }1,…,pN}And (4) carrying out knowledge integration, taking the obtained result as a soft label, and migrating the soft label to each student model in a distillation mode.
However, while knowledge-integrated distillation algorithms can provide effective training supervision, they have to rely on additional networks or branches, which are highly complex and greatly increase training time and space costs.
In view of the above, how to obtain effective soft labels with less training time and space cost to perform more accurate network training supervision is a problem to be solved, which is discussed in the embodiments of the present disclosure and will be described below with reference to the following specific embodiments.
The embodiment of the disclosure provides a neural network training method, a neural network training device, a neural network training apparatus, and a computer storage medium, wherein for each training sample in a neural network, the generation of a soft label of the training sample can be assisted by using the similarity between other samples and the training sample and the prediction probability of other samples, and efficient training supervision is performed on the neural network based on the soft labels satisfying a preset number of training samples. It can be seen that the conventional cross-network knowledge integration is replaced by the cross-sample knowledge integration under the same neural network, and the knowledge integration based on the similarity between the samples is realized and the effective soft label is obtained on the basis of only utilizing a single network.
The neural network training method provided by the embodiment of the disclosure is applied to computer equipment. An exemplary application of the computer device provided by the embodiment of the present disclosure is described below, and the computer device provided by the embodiment of the present disclosure may be implemented as a mobile phone terminal, a notebook computer, a tablet computer, a desktop computer, a smart television, a vehicle-mounted device, a wearable device, an industrial device, and the like.
In the following, the technical solutions in the embodiments of the present disclosure will be clearly and completely described with reference to the drawings in the embodiments of the present disclosure.
An embodiment of the present disclosure provides a neural network training method, fig. 3 is a schematic diagram illustrating an implementation flow of the neural network training method provided in the embodiment of the present disclosure, as shown in fig. 3, in an embodiment of the present disclosure, a method for a computer device to execute neural network training may include the following steps:
and S100, executing a loop process until soft labels meeting the preset number of anchor samples are obtained.
It can be understood that the more accurate the label of the training sample is, the better the model training effect is, and in order to overcome the bottleneck and the defect brought by artificial hard labels on model training, a more robust soft label can be generated for the training sample through a knowledge distillation algorithm, so that the performance of the model is improved by performing efficient training supervision through the soft label.
In some embodiments, a soft label corresponding to each training sample in the full set of training samples may be generated by a knowledge distillation algorithm; or generating corresponding soft labels for part of training samples in the whole training sample set, such as a certain batch of training samples, by a knowledge distillation algorithm.
Fig. 4 is a schematic diagram of an implementation flow of a neural network training method provided in the embodiment of the present disclosure, and as shown in fig. 4, a loop process provided in the embodiment of the present disclosure includes the following steps:
s101, obtaining a current training sample set, and determining a current anchoring sample and at least one knowledge transfer sample from the current training sample set in each period of executing a cyclic process; the current anchor sample is any one of the current training sample set, and the at least one knowledge transfer sample is at least one other sample except the current training sample set and the current anchor sample.
In some embodiments, the current training sample set may be all data sets used for neural network training, or may be any one of a plurality of training data sets used for neural network training.
In some embodiments, the anchor sample refers to a training sample in the current training dataset that needs to be subjected to soft label generation; the knowledge transfer sample refers to at least one other training sample in the training data set for acting on anchor sample soft label generation.
In the disclosed embodiment, the computer device may determine, at each cycle, any one training sample of interest that was not previously determined to be an anchor sample from the set of training samples as a current anchor sample, and determine at least one other sample in the set of current training samples other than the current anchor sample as a knowledge transfer sample.
When different training samples in the training sample set are used as current anchor samples, the corresponding knowledge transfer samples are different. For example, the same batch of samples is { x }1,x2,x3,…,xRIf x is to be determined1Determined as the current anchor sample, then { x ] remains2,x3,…,xRAs x1Corresponding knowledge transfer samples; if x is to be2Determined as the current anchor sample, then { x ] remains1,x3,…,xRAs x2The corresponding knowledge transfer samples.
In an implementation of the disclosed embodiment, in the training sample set, each training sample has at least one other training sample with view similarity to it.
And S102, determining the similarity between the current anchor sample and each knowledge transfer sample, the prediction probability of the current anchor sample and the prediction probability of each knowledge transfer sample based on the neural network.
It should be appreciated that in embodiments of the present disclosure, to enable the transfer and integration of knowledge between samples based on view similarity of the samples, after at least one knowledge transfer sample corresponding to a current anchor sample and the anchor sample is determined from a set of training samples, a sample similarity between the anchor sample and the knowledge transfer sample may be determined, and "knowledge" used for the transfer and integration based on the similarity between the anchor sample and the knowledge transfer sample.
In some embodiments, the similarity between the current anchor sample and each knowledge transfer sample may be determined by a neural network. Here, after the current anchor sample and the at least one knowledge transfer sample are input into the neural network, the neural network may determine sample characteristics of the current anchor sample and sample characteristics of each knowledge transfer sample, respectively, and further determine a sample similarity between the current anchor sample and each knowledge transfer sample based on the sample characteristics of the current anchor sample and the sample characteristics of each knowledge transfer sample.
In some embodiments, the "knowledge" used to transfer and integrate between anchor and knowledge transfer samples may be the predicted probabilities of the samples over the tasks of image classification, target detection, image segmentation, and the like. For example, the probability that a sample belongs to a certain class on the classification task.
Wherein the predicted probability of the anchor sample and the predicted probability of each knowledge transfer sample may be determined by a neural network.
S103, determining the soft label of the current anchor sample based on the similarity between the current anchor sample and each knowledge transfer sample, the prediction probability of the current anchor sample and the prediction probability of each knowledge transfer sample.
In the embodiment of the present disclosure, after determining the similarity between the anchor sample and each knowledge transfer sample and the prediction probability of each of the anchor sample and each of the knowledge transfer samples through the neural network, the transfer and integration operation of knowledge may be further performed based on the similarity between the current anchor sample and each of the knowledge transfer samples, the prediction probability of the current anchor sample, and the prediction probability of each of the knowledge transfer samples, so as to act on the generation of the soft label of the current anchor sample.
It is understood that, in the embodiment of the present disclosure, the degree of influence of each knowledge transfer sample on the "knowledge" of the anchor sample may be characterized based on the similarity between the current anchor sample and each knowledge transfer sample, so that the prediction probability of each knowledge transfer sample performs a weighted transfer operation of the knowledge according to the degree of influence of the "knowledge", and the knowledge of each knowledge transfer sample transferred with different degrees of influence is integrated for at least one knowledge transfer sample corresponding to the current anchor sample, and jointly applied to the generation of the soft label of the current anchor sample.
Therefore, the knowledge integration between different network models under the knowledge distillation algorithm is not performed any more, but the knowledge integration between different samples is performed under the single network model based on the self-distillation algorithm, so that the training sample soft label is generated. In this way, a plurality of other network models are not needed to be used for generating the soft label, but in a single network model, for each training sample, the 'dark knowledge' of the other samples is transmitted and integrated based on the similarity between the training sample and the other samples so as to be used for generating the soft label of the training sample.
And S110, training the neural network at least based on the soft labels of the anchor samples with the preset number and the anchor samples with the preset number.
In the embodiment of the present disclosure, after the soft labels satisfying the preset number of anchor samples are generated, the objective loss function of the neural network may be updated based on at least the preset number of anchor samples and the robust soft labels of the anchor samples, so as to achieve update optimization of the neural network.
Therefore, in order to improve the generalization capability and the accuracy of the training supervision of the model, in the process of the model supervision training, the model training is not performed based on the hard label corresponding to the training sample, but the robust soft label obtained by the self-distillation method of knowledge integration among the samples is combined to perform the model training.
The embodiment of the disclosure provides a neural network training method, which can assist the generation of a soft label of a training sample by utilizing the similarity between other samples and the training sample and the prediction probability of other samples aiming at each training sample under a neural network, and further perform efficient training supervision for the neural network based on the soft label meeting a preset number of training samples. It can be seen that the conventional cross-network knowledge integration is replaced by the cross-sample knowledge integration under the same neural network, and the knowledge integration based on the similarity between the samples is realized and the effective soft label is obtained on the basis of only utilizing a single network.
In an implementation manner of the embodiment of the present disclosure, fig. 5 is a schematic flow chart illustrating an implementation flow of a neural network training method proposed in the embodiment of the present disclosure, as shown in fig. 5, the method for determining, by a computer device, a similarity between a current anchor sample and each knowledge transfer sample, and a prediction probability of the current anchor sample and a prediction probability of each knowledge transfer sample, further includes the following steps:
s102a, the neural network based encoder, determines sample characteristics of the current anchor sample and sample characteristics of each knowledge transfer sample.
And S102b, determining the similarity between the current anchor sample and each knowledge transfer sample based on the sample characteristics of the current anchor sample and the sample characteristics of each knowledge transfer sample.
In the embodiment of the disclosure, the neural network is provided with an encoder, and the encoder is used for performing feature extraction and feature encoding on each training sample to obtain sample features represented in a vector form.
In some embodiments, feature extraction may be performed on the current anchor sample and each knowledge transfer sample respectively through an encoder of the neural network, sample features of the current anchor sample and sample features of each knowledge transfer sample are obtained respectively, and feature coding may be performed on sample features thereof respectively, and the sample features are expressed in a vector form. That is, the encoder of the neural network determines the sample features characterized in the form of vectors of the current anchor sample and the sample features characterized in the form of vectors of each knowledge transfer sample.
Here, view similarity, i.e., sample similarity, between the current anchor sample and each knowledge transfer sample may be determined based on sample characteristics.
In an embodiment of the present disclosure, fig. 6 is a schematic diagram illustrating an implementation flow of a neural network training method provided by an embodiment of the present disclosure, and as shown in fig. 6, the method for determining a similarity between a current anchor sample and each knowledge transfer sample based on a sample feature of the current anchor sample and a sample feature of each knowledge transfer sample includes the following steps:
s102b1, carrying out normalization processing on the sample characteristics of the current anchoring sample to obtain the normalized sample characteristics of the current anchoring sample.
S102b2, carrying out normalization processing on the sample characteristics of each knowledge transfer sample to obtain the normalized characteristics of each knowledge transfer sample.
S102b3, carrying out dot product operation processing on the normalized sample characteristics of the current anchoring sample and the normalized characteristics of each knowledge transfer sample to obtain the similarity between the current anchoring sample and each knowledge transfer sample.
It can be understood that before determining the similarity between samples based on the sample features, the sample features in the form of vectors need to be converted into the same dimension, so that the calculation of the strict sample similarity has a uniform standard.
Here, the computer device may perform normalization processing on the sample features of the current anchor sample and the sample features of each knowledge transfer sample, respectively, to obtain normalized sample features of the current anchor sample and normalized features of each knowledge transfer sample, so as to implement conversion of the sample features characterized in a vector form into the same dimension.
And then, performing dot product operation processing on the normalized features of the current anchor sample and the normalized features of each knowledge transfer sample to further obtain the similarity between the current anchor sample and each knowledge transfer sample.
The calculation formula of the similarity between the samples is as follows:
A(i,j)=σ(F(xi))Tσ(F(xj)) (1)
wherein, in the formula (1), F (x)i) Sample feature characterized in vector form, F (x), for the i-th anchor sample determined by the encoder of the neural networkj) Sample characteristics of a jth knowledge transfer sample corresponding to an ith anchor sample,
Figure BDA0003042834770000081
is a rule of l2The normalization formula of (1).
Thus, based on the formula (1), the pairwise sample similarity between the current anchor sample and each knowledge transfer sample, that is, the similarity result is stored in the affinity matrix a, and assuming that the number of samples in the training sample set is N, the affinity matrix representing the similarity between all samples can be a e RN×NIn (1).
Therefore, in the embodiment of the present disclosure, the dot product operation may be performed on the sample characteristics of the current anchor sample and each knowledge transfer sample, which are normalized and converted to the same dimension, so as to determine the pairwise similarity between two samples.
S102c, the neural network based classifier, determines the prediction probability of the current anchor sample and the prediction probability of each knowledge transfer sample.
In the embodiment of the present disclosure, the neural network is further provided with a classifier, and the classifier is configured to determine a prediction probability corresponding to each training sample.
For example, under the classification task, for the current anchor sample and each knowledge transfer sample, the classifier of the neural network may calculate the prediction probability of the current anchor sample and each knowledge transfer sample through the softmax function. The calculation formula of the sample prediction probability is as follows:
Figure BDA0003042834770000082
wherein, in the formula (2),
Figure BDA0003042834770000083
and (3) representing the probability that the training sample belongs to the class K, wherein T is a temperature hyper-parameter, and K is the total class number.
Figure BDA0003042834770000084
For a log vector of any one training sample in the set of training samples,
Figure BDA0003042834770000085
the result of summing the log vectors of all the training samples in the training sample set.
For example, under the classification task, the prediction probability of the training samples in the training sample set can be expressed as
Figure BDA0003042834770000086
Figure BDA0003042834770000087
The ith anchor sample of which satisfies
Figure BDA0003042834770000088
I.e. the probability sum of the ith anchor sample belonging to the first class, …, and the kth class is 1.
It can be seen that, in the embodiment of the present disclosure, a neural network-based encoder may obtain a similarity between a current anchor sample and each knowledge transfer sample, and a neural network-based classifier obtains a prediction probability of the current anchor sample and a prediction probability of each knowledge transfer sample, so as to further implement generation of a soft label acting on the anchor sample based on the sample similarity, the prediction probability of the current anchor sample, and the prediction probability of the knowledge transfer sample.
In the embodiment of the present disclosure, fig. 7 is a schematic diagram illustrating an implementation flow of a neural network training method provided by the embodiment of the present disclosure, and as shown in fig. 7, based on a similarity between a current anchor sample and each of the knowledge transfer samples, a prediction probability of the current anchor sample, and a prediction probability of each of the knowledge transfer samples, the method for determining a soft label of the current anchor sample includes the following steps:
s103a, determining knowledge transfer parameters of each knowledge transfer sample to the current anchor sample based on the similarity between the current anchor sample and each knowledge transfer sample.
In the embodiment of the present disclosure, the prediction probability of each knowledge transfer sample is to be transferred to the current anchor sample with different weight values for the generation of the soft label of the current anchor sample, and the computer device may normalize the sample similarity by the softmax function, and calculate the "knowledge" transfer weight value, i.e., the knowledge transfer parameter, of each knowledge transfer sample to the current anchor sample.
Fig. 8 is a schematic diagram illustrating a sixth implementation flow of the neural network training method according to the embodiment of the present disclosure, and as shown in fig. 8, the method for determining the knowledge transfer parameters of each knowledge transfer sample to the current anchor sample based on the similarity between the current anchor sample and each knowledge transfer sample includes the following steps:
s103a1, accumulating at least one similarity between the current anchor sample and each knowledge transfer sample to obtain a similarity accumulated value.
S103a2, determining knowledge delivery parameters of each knowledge delivery sample to the current anchor sample based on the similarity between the anchor sample and each knowledge delivery sample and the accumulated value of the similarity.
In the embodiment of the disclosure, the computer device may first perform an accumulation and summation process on the similarity between the current anchor sample and each knowledge transfer sample to obtain a summation result, i.e., a similarity accumulated value, and then calculate, for each knowledge transfer sample, the knowledge transfer parameter of each knowledge transfer sample to the current anchor sample through the softmax function in combination with the similarity between the current anchor sample and each knowledge transfer sample and the similarity accumulated value.
The calculation formula of the knowledge transfer parameter is as follows:
Figure BDA0003042834770000091
wherein in formula (3), exp (a (i, j)) characterizes the sample similarity, Σ, between the ith anchor sample and the jth knowledge transfer samplej≠iexp (A (i, j)) characterizes the similarity accumulated value of all knowledge transfer samples corresponding to the ith anchor sample.
After the normalization processing of the sample similarity in the formula (3), the 'knowledge' transfer weight value of the current anchor sample is accumulated for each knowledge transfer sample, and the accumulated value is 1, namely
Figure BDA0003042834770000092
It can be seen that, in the embodiment of the present disclosure, a "knowledge" transfer weight value of each knowledge transfer sample to an anchor sample may be obtained through a normalization process on the sample similarity, and then a weighted transfer of the prediction probability of each knowledge transfer sample is performed according to the weight value.
S103b, determining the soft label of the current anchor sample based on the knowledge transfer parameters of each knowledge transfer sample to the current anchor sample, the prediction probability of the current anchor sample, and the prediction probability of each knowledge transfer sample.
In the embodiment of the present disclosure, after obtaining the knowledge transfer parameter of each knowledge transfer sample for the current anchor sample, the determination of the soft label of the current anchor sample may be performed based on the "knowledge" transfer weight value of each knowledge transfer sample for the current anchor sample, the prediction probability of the current anchor sample, and the prediction probability of each knowledge transfer sample until the "dark knowledge" of the transfer of each knowledge transfer sample is completely fused to the current anchor sample, so that an accurate and robust soft label of the current anchor sample may be obtained.
Fig. 9 is a schematic diagram illustrating an implementation flow of a neural network training method according to an embodiment of the present disclosure, and as shown in fig. 9, a specific method for determining a soft label of a current anchor sample based on knowledge transfer parameters of each knowledge transfer sample to the current anchor sample, a prediction probability of the current anchor sample, and the prediction probability of each knowledge transfer sample includes the following steps:
s103b1, executing knowledge transfer processing once on the knowledge transfer parameters of the current anchor sample and the prediction probability of each knowledge transfer sample based on each knowledge transfer sample, and obtaining the initial knowledge transfer probability of the current anchor sample.
In the embodiment of the present disclosure, based on the knowledge transfer parameter of each knowledge transfer sample to the current anchor sample and the prediction probability of each knowledge transfer sample, the "dark knowledge" of at least one knowledge transfer sample may be subjected to one weighted integration transfer, so as to obtain the initial knowledge transfer probability of the current anchor sample.
In some embodiments, the prediction probability of each knowledge transfer sample may be weighted based on its "knowledge" transfer weight to the current anchor sample. Here, the knowledge transfer probability of each knowledge transfer sample for the current anchor sample may be determined first based on the knowledge transfer parameters of each knowledge transfer sample for the current anchor sample and the predicted probability of each knowledge transfer sample.
In some embodiments, each knowledge transfer sample may be integrated with the "knowledge" that needs to be weighted and transferred to the current anchor sample. Here, the at least one knowledge transfer probability of the current anchor sample may be accumulated for each knowledge transfer sample to obtain an accumulated value of the knowledge transfer probabilities, and then the accumulated value of the knowledge transfer probabilities is used to perform a knowledge transfer process for the first time to obtain an initial knowledge transfer probability that the at least one knowledge transfer sample is transferred to the current anchor sample for the first time.
Here, the calculation formula of the knowledge transfer probability is as follows:
Figure BDA0003042834770000101
in the formula (4), the first and second groups,
Figure BDA0003042834770000102
for the knowledge transfer probability corresponding to the ith anchor sample,
Figure BDA0003042834770000103
for the sample similarity of the ith anchor sample and the jth knowledge transfer sample,
Figure BDA0003042834770000104
the initial prediction probability of the sample is passed for the j-th knowledge.
Based on the above, the prediction probability of the training samples in the training sample set can beIs shown as
Figure BDA0003042834770000105
Figure BDA0003042834770000106
In order to avoid the enhancement of self-knowledge of the current anchor sample, we use A ═ A [ (1-I) ] method to discard the diagonal entries of the affinity matrix A, where I is an identity matrix, indicating element-by-element multiplication, i.e., keeping the diagonal of the affinity matrix A equal to 0. And then to
Figure BDA0003042834770000107
In (1)
Figure BDA0003042834770000108
I.e. the knowledge transfer probability of the current anchor sample is calculated,
Figure BDA0003042834770000109
Figure BDA00030428347700001010
the "dark knowledge" of the middle transfer only retains the weighted integrated prediction probability of at least one knowledge transfer sample
Intuitively, if the ith anchor sample and the jth knowledge transfer sample have a higher similarity, i.e. the
Figure BDA00030428347700001011
The value is higher, then the prediction probability of the j-th knowledge transfer sample
Figure BDA00030428347700001012
With a greater weight value of delivery to
Figure BDA00030428347700001013
In the embodiment of the present disclosure, after calculating the probability of knowledge transfer when each training sample in the same batch is used as an anchor sample based on formula (4), "dark knowledge" in the same batch may be combinedLine transfer, i.e. transferring knowledge transfer probabilities of all samples of the same batch in parallel, i.e.
Figure BDA00030428347700001014
Wherein the content of the first and second substances,
Figure BDA00030428347700001015
characterizing the knowledge transfer parameters for each training sample as an anchor sample,
Figure BDA00030428347700001016
it can be seen that weighting and integration of other sample prediction probabilities for delivery to the current anchor sample can be implemented in embodiments of the present disclosure.
S103b2, performing knowledge fusion processing once based on the initial knowledge transfer probability and the prediction probability of the current anchor sample to obtain the initial soft label of the current anchor sample.
In the embodiment of the present disclosure, after the "dark knowledge" of at least one knowledge transfer sample is subjected to weighted integration transfer, a knowledge fusion process may be performed based on the transferred "dark knowledge" and the currently existing "knowledge" of the current anchor sample to obtain an initial soft tag of the current anchor sample.
Here, after performing one weighted integration transfer on the prediction probability of the at least one knowledge transfer sample, a knowledge fusion process may be performed on the initial knowledge transfer probability of the current anchor sample and the prediction probability of the current anchor sample based on the at least one knowledge transfer sample to obtain an initial soft label.
The generation function for the initial soft label of the current anchor sample is as follows:
Figure BDA00030428347700001017
wherein, in the formula (5),
Figure BDA00030428347700001018
initial soft label of ith training sampleOmega is a weighting factor and is a hyperparameter, and omega belongs to [0, 1 ]]。
It should be understood that in the case of parallel transfer of "dark knowledge" of samples in the same batch, the initial soft label generation function for all training samples in the same batch is as follows:
Figure BDA00030428347700001019
wherein, in the formula (6), QTTo include the initial soft label for each training sample under the same batch,
Figure BDA00030428347700001020
Figure BDA00030428347700001021
s103b3, based on the initial soft label of the current anchoring sample, executing a loop process until the prediction probability of at least one knowledge transfer sample is smaller than a preset probability threshold value, and obtaining the soft label of the current anchoring sample.
It should be ensured that in order to enable the "dark knowledge" of the at least one knowledge transfer sample transfer to fully contribute to the generation of the current anchor sample soft label, the transfer and integration of knowledge may be performed multiple times until the knowledge of the at least one knowledge transfer sample transfer is fully fused to the current anchor sample.
Fig. 10 is an implementation flow diagram of an eighth implementation flow of the neural network training method provided in the embodiment of the present disclosure, and as shown in fig. 10, the loop process includes the following steps:
and S103b31, in each period of the loop process, executing knowledge transfer processing based on the soft label of the current anchor sample obtained in the last period and each knowledge transfer parameter, and obtaining the knowledge transfer probability of the current anchor sample.
S103b31, executing knowledge fusion processing based on the knowledge transfer probability of the current anchor sample and the prediction probability of the current anchor sample, and obtaining the soft label of the current anchor sample of the next period.
It should be understood that, in order to improve the accuracy of the soft label and to better improve the performance of the student model, the above process of knowledge weighted transfer and integration may be performed many times until convergence, so as to achieve sufficient knowledge fusion.
Wherein, the knowledge transfer and fusion process is as follows:
Figure BDA0003042834770000111
wherein, in the formula (7), t represents the iteration of the t-th transfer and integration,
Figure BDA0003042834770000112
soft label for the ith anchor sample of the previous cycle.
It can be seen that, in each cycle of the cyclic process, the knowledge transfer process is performed once based on the soft label of the current anchor sample obtained in the previous cycle and each knowledge transfer parameter to obtain the knowledge transfer probability of the current anchor sample, that is, the knowledge transfer probability of the current anchor sample is obtained
Figure BDA0003042834770000113
Knowledge transfer probabilities based on the current anchor sample
Figure BDA0003042834770000114
And prediction probability of current anchor sample
Figure BDA0003042834770000115
Executing a knowledge fusion process to obtain the soft label of the current anchoring sample of the next period
Figure BDA0003042834770000116
Here, in the case of parallel transfer of the same batch of training samples, the knowledge transfer and fusion process is as follows:
Figure BDA0003042834770000117
wherein, in the formula (8), t represents the iteration of the t-th transfer and integration,
Figure BDA0003042834770000118
soft labels for training samples in the same batch from the previous cycle.
It can be seen that, in each cycle of the cyclic process, the soft label of each training sample in the same batch obtained in the previous cycle is firstly obtained, and then the knowledge fusion processing is executed with the prediction probability of each training sample to obtain the soft label of each training sample in the next cycle.
When the loop process is executed for a plurality of times, namely the knowledge transfer process and the knowledge fusion process are iterated for an infinite number of times, and the prediction probability of at least one knowledge transfer sample of each training sample is smaller than a preset probability threshold, for example, under the condition that the prediction probability of the knowledge transfer sample approaches zero in an infinitesimal manner,
Figure BDA0003042834770000119
at the same time, the user can select the desired position,
Figure BDA00030428347700001110
thereby obtaining a soft label for each training sample.
Based on the above knowledge multiple transfer and fusion process, the generating function of the soft label of each training sample can be estimated as:
Figure BDA00030428347700001111
therefore, the 'dark knowledge' transmitted by at least one knowledge transmission sample of each training sample is completely fused to each training sample, the accuracy of the soft label corresponding to each training sample is higher and approaches to 100 percent
For all training samples in the same batch, the training samples naturally meet the requirements under the same dimension
Figure BDA00030428347700001112
No additional normalization process is required.
It can be seen that, in the embodiment of the present disclosure, for each training sample in the same batch, the similarity of the training sample from each other sample in the same batch of training samples may be combined, and the "dark knowledge" of each other sample is weighted and integrated and transmitted to the current training sample until the knowledge is completely fused, so as to obtain the accurate and robust soft label of each training sample.
In this disclosure, fig. 11 is a schematic diagram illustrating an implementation flow of a neural network training method provided in this disclosure, as shown in fig. 11, the neural network training method further includes the following steps:
s120, a training data set is obtained, wherein the training data set comprises at least one batch of training data subsets.
And S130, selecting a batch of training data subsets which are not selected before as a training sample set from the training data set as a current training sample set.
In the embodiment of the present disclosure, a training data set, such as ImageNet (data set), may be obtained, but considering that the training data set is too large, and the training data cannot be loaded into a computing device at one time in practical application, we may divide the training data set into at least one training data subset with a small occupied capacity, so as to perform at least one batch of neural network training.
In one implementation of the disclosed embodiment, the neural network training may be performed in the form of multiple batches, namely mini-batch. Furthermore, any one of the training subsets of the plurality of training data subsets may be determined as the training sample set, and the soft label of the anchor sample may be obtained by performing a self-distillation algorithm of knowledge integration of S101 to S103.
When the training sample set is a batch of training data subset, the self-distillation algorithm of the knowledge integration of S101-S103 may be performed on each training sample in the training sample set to obtain the soft label corresponding to each training sample in the training sample set.
Here, after performing the above-mentioned S100-S110 with a certain batch of training data subsets as the current training sample set, another batch of training data subsets that have not been determined as the training sample set before in the training data set may be determined as the training sample set of the next round, and the self-distillation algorithm of the knowledge integration of the above-mentioned S101-S103 and the neural network training method of S100-S110 are performed to improve the network performance.
In the embodiment of the present disclosure, fig. 12 is a schematic view illustrating an implementation flow of a neural network training method provided in the embodiment of the present disclosure, as shown in fig. 12, the neural network training method further includes the following steps:
s140, random sampling processing is carried out on the training data set to obtain at least one piece of first training data.
S150, determining a hard label of each first training data, and continuously performing similarity sampling processing on the remaining data which are not selected as the first training data in the training data set based on the hard label of each first training data to obtain at least one second training data corresponding to each first training data.
And S160, using a batch of training data subsets constructed based on at least one first training data and at least one second training data corresponding to each first training data as a current training sample set.
It will be appreciated that in order to enable weighted transfer and integration of knowledge between samples based on similarity between samples, it is first ensured that in the training samples, there is at least one other sample with which each sample has a view similarity.
In an implementation manner of the embodiment of the present disclosure, a class of data samplers may be provided, that is, sampling of training samples based on view similarity is implemented on the basis of a general random sampling mechanism.
In the sampling process, the training data set ImageNet can be randomly sampled by the data sampler to obtain at least one first training data; an artificial hard label corresponding to each of the first training data is then determined.
Then, similarity sampling processing is performed on the remaining data in the training data set based on the hard label, that is, at least one second training data having view similarity with each first training data, that is, the hard label is the same is sampled, and then a batch of training data subsets is formed by the at least one first training data and the at least one second training data corresponding to each first training data, and the batch of training data subsets is used as a training sample set.
Thus, a plurality of batches of training data subsets with view similarity between samples can be selected from the training data set based on the method, and each batch of training data subsets can be used as the current training sample set when the self-distillation algorithm based on the knowledge integration of S101-S103 and the neural network training method based on S100-S110.
For example, if the number of first training data obtained by random sampling is N, and M second training data with view similarity are selected for each first training data, the number of samples of the training data subset finally constituting one batch is N × (M + 1).
Therefore, by the data sampling method based on the sample similarity, each training sample in the current training sample set can be ensured to have at least one other sample similar to the visual property of the training sample, and further the weighted transfer of knowledge among the samples can be realized according to the similarity among the samples.
In this disclosure, fig. 13 is an eleventh implementation flow diagram of a neural network training method provided in this disclosure, and as shown in fig. 13, the method for training the neural network based on at least the soft labels of the anchor samples in the preset number and the anchor samples in the preset number further includes the following steps:
and S111, determining the relative entropy of each anchor sample based on the soft label of each anchor sample and the prediction probability corresponding to the anchor sample.
And S112, determining the cross entropy of each anchor sample based on the hard label of each anchor sample and the prediction probability corresponding to the anchor sample.
S113, training the neural network based on the cross entropy of the preset number of anchor samples and the relative entropy of the preset number of anchor samples.
In the embodiment of the present disclosure, the relative entropy, i.e., KL divergence (KLD), the cross entropy and the relative entropy are used to describe the difference between the real result distribution and the predicted result distribution of the sample.
In the embodiment of the present disclosure, when the soft labels of the anchor samples in the preset number are obtained, and the training of the neural network is performed based on at least the anchor samples in the preset number and the soft labels of the anchor samples, the training of the neural network may be implemented by calculating two types of differences of the training samples.
One of the methods is to use an artificial hard tag as the real result distribution of a sample, and determine the cross entropy based on the difference between the artificial hard tag and the prediction probability; the other is to distribute robust soft labels as "true" results for the exemplars, determining the relative entropy based on the difference between the soft labels and the prediction probabilities.
It should be appreciated that the neural network is trained to minimize both cross entropy and relative entropy, i.e., the predicted resultant distribution of the samples determined by the neural network approximates the actual resultant distribution of the samples. Here, since the robust soft labels satisfying the preset number of anchor samples are obtained through the above loop process to perform neural network training, the label accuracy is good, and then the performance of the correspondingly trained neural network is also improved.
Here, the loss function is as follows:
Figure BDA0003042834770000131
wherein y ∈ {1, …, K }, p (y) ═ p (1), …, p (K) in formula (10)]Denotes a hard tag, pTFor predicting the probability, λ is the weight coefficient, T is the temperature, DKL(qT||pT) Is the KL divergence.
It can be seen that the former part is the cross entropy determined according to the hard tag and the initial prediction probability, and the latter part is the KL divergence determined according to the soft tag and the initial prediction probability.
In the model training process, the cross entropy and the KL divergence value are calculated based on the formula (10), and further minimization is performed, so that the training of the neural network is realized, and the performance of the network model is improved.
It can be seen that in the embodiments of the present disclosure, training a model with smaller data and higher learning rate can be achieved by using a highly accurate and robust soft tag.
FIG. 14 is a schematic diagram of the knowledge integration self-distillation algorithm proposed in the embodiments of the present disclosure, as shown in FIG. 14, { x1,…,xNThe samples are other samples except the anchored sample in the same small batch sample set, and the prediction probability of the student model for the anchored sample is panchorFor { x1,…,xNThe prediction probability is { p }1,…,pN}。
Further, the prediction probabilities p corresponding to other samples may be used1,…,pNCarry out knowledge integration, and then regard the obtained results as soft labels and migrate to the anchoring sample by means of distillation.
It can be seen that compared to fig. 1 and 2, the knowledge integration proposed by the present application uses only one network, and obtains knowledge sets, namely { x ] by dynamically aggregating "dark knowledge" from different samples in the same batch process by aggregating other samples than the anchor sample in the same batch production1,…,xNThe knowledge of the network is used for generating a robust soft label, and the knowledge is integrated into a single network, so that the memory and time cost is saved to a great extent.
FIG. 15 is a schematic diagram of a second principle of the knowledge integration self-distillation algorithm proposed by the embodiment of the present disclosure, as shown in FIG. 15, the samples in the same training sample set of the batch include an anchor sample and at least one knowledge transfer sample { x }1,…,xNAnd performing feature coding processing on the anchor sample and each knowledge transfer sample by applying a coder (encoder) to obtain a sample feature f of the anchor sampleanchorAnd sample characteristics { f ] of at least one knowledge transfer sample1,f2,f3…, estimating the similarity between the anchor sample and each knowledge transfer sample based on the sample characteristics.
Further, determining the corresponding prediction probability of the sample by a classifier of the current student model, including the prediction probability p of the anchor sampleanchorAnd a prediction probability of at least one knowledge transfer sample p1,…,pnFor anchor samples, { p ] for at least one knowledge transfer sample based on sample similarity1,…,pNThe weighted transfer and integration are performed to form a soft label of the anchor sample and migrate to the anchor sample by means of distillation.
Table 1 compares the effectiveness of the knowledge integration distillation algorithms, MEAL and KDCL, of the multi-teacher/student model with the knowledge weighted transfer and integrated self-distillation algorithm proposed by the embodiments of the present disclosure:
TABLE 1
Method Number of training sessions Label accuracy Extra network
MEAL 180 78.2 ResNet-101&152
KDCL 200 77.8 ResNet-18
This scheme 100 78,0 Is free of
Based on table 1, it can be seen that the MEAL and KDCL methods shown in fig. 1 and 2 both require additional network assistance, compared to the self-distillation algorithm with knowledge weighted transfer and integration proposed in the embodiment of the present disclosure, which does not require additional network assistance, and the present scheme performs less training time in a single network, such as 100 times, to obtain similar results as the MEAL and KDCL methods in the pipeline technology, even to obtain similar label performance using half of the training times of the KDCL method.
Table 2 presents the effectiveness of the knowledge weighted transfer and integrated self-distillation algorithm proposed by the embodiments of the present disclosure on various network architectures:
TABLE 2
Figure BDA0003042834770000141
As can be seen from table 2, the related art refers to the training of the conventional cross-entropy loss, and the knowledge weighted transfer and integrated self-distillation algorithm proposed in the embodiments of the present disclosure is applied to a widely used architecture such as ResNet-5; evaluating deeper/broader architectures such as ResNet-152 and ResNeXt-152; and lighter architectures such as MobileNet-V2, all improve network performance with minimal computational overhead and require less Graphics Processing Unit (GPU) time.
For example, the ResNet-50 architecture is promoted from 76.8 to 78.0 and only takes 3.7% of the time.
Table 3 compares the effectiveness of the knowledge weighted transfer and integrated self-distillation algorithm proposed in the examples of the present disclosure with the self-distillation method of the related art:
TABLE 3
Figure BDA0003042834770000142
As can be seen from Table 3, although the conventional self-distillation method and a series of methods such as Label smoothening, Tf-KDregBAN, CS-KD and Tf-KDaelfThe label regularization algorithms are all based on a single network, but compared with the traditional self-distillation algorithm and the label regularization algorithm, the training results of the knowledge weighted transfer and integrated self-distillation algorithm provided by the embodiment of the disclosure on the data set ImageNet are superior to those of the traditional self-distillation algorithm and the label regularization algorithm. No teacher Tf-KD as on architecture ResNet-50regThe regularization algorithm has a label accuracy of 77.5%, but still less than 0.5% than the present solution.
It can be seen that the self-distillation algorithm of knowledge weighted delivery and integration proposed by the embodiments of the present disclosure can not only save memory and time by implementing knowledge integration in a single network, but also generate equally strong soft labels by aggregating knowledge from a group of samples in the same small lot.
Based on the above embodiments, in an embodiment of the present disclosure, fig. 16 is a schematic structural diagram of a neural network training device provided in an embodiment of the present disclosure, and as shown in fig. 16, the neural network training device 10 includes an obtaining unit 11, a training unit 12, a selecting unit 13, a sampling unit 14, and a determining unit 15.
The acquiring unit 11 is configured to execute a loop process until soft labels satisfying a preset number of anchor samples are obtained; wherein the cyclic process comprises the steps of: acquiring a current training sample set, and determining the current anchoring sample and at least one knowledge transfer sample from the current training sample set in each period of executing a cyclic process; wherein the current anchor sample is any one of the current training sample set, and the at least one knowledge transfer sample is at least one other sample in the current training sample set except the current anchor sample; determining, based on the neural network, a similarity between the current anchor sample and each of the knowledge delivery samples, and a predicted probability of the current anchor sample and a predicted probability of each of the knowledge delivery samples; determining a soft label for the current anchor sample based on a similarity between the current anchor sample and each of the knowledge transfer samples, a prediction probability for the current anchor sample, and a prediction probability for each of the knowledge transfer samples.
The training unit 12 is configured to train the neural network based on at least the soft labels of the anchor samples in the preset number and the anchor samples in the preset number.
In some embodiments, the training unit 12 is specifically configured to determine a relative entropy of each of the anchor samples based on the soft label of each of the anchor samples and the prediction probability corresponding to the anchor sample; determining a cross entropy for each of the anchor samples based on the hard label for each of the anchor samples and the prediction probability corresponding to the anchor sample; training the neural network based on the cross entropy of the preset number of anchor samples and the relative entropy of the preset number of anchor samples.
In some embodiments, the neural network comprises an encoder and a classifier, and the obtaining unit 12 is specifically configured to determine, based on the encoder of the neural network, a sample characteristic of the current anchor sample and a sample characteristic of each knowledge transfer sample; determining a similarity between the current anchor sample and each of the knowledge transfer samples based on sample characteristics of the current anchor sample and sample characteristics of each of the knowledge transfer samples; based on a classifier of the neural network, a prediction probability of the current anchor sample and a prediction probability of each of the knowledge transfer samples are determined.
In some embodiments, the obtaining unit 11 is further specifically configured to perform normalization processing on the sample feature of the current anchor sample to obtain a normalized sample feature of the current anchor sample; carrying out normalization processing on the sample characteristics of each knowledge transfer sample to obtain the normalization characteristics of each knowledge transfer sample; and carrying out dot product operation processing on the normalized sample characteristics of the current anchoring sample and the normalized characteristics of each knowledge transfer sample to obtain the similarity between the current anchoring sample and each knowledge transfer sample.
In some embodiments, the obtaining unit 11 is further specifically configured to determine a knowledge transfer parameter of each knowledge transfer sample to the current anchor sample based on a similarity between the current anchor sample and each knowledge transfer sample; determining a soft label for the current anchor sample based on knowledge delivery parameters of each of the knowledge delivery samples for the current anchor sample, the prediction probability of the current anchor sample, and the prediction probability of each of the knowledge delivery samples.
In some embodiments, the obtaining unit 11 is further specifically configured to perform accumulation processing on at least one similarity between the current anchor sample and each knowledge transfer sample to obtain a similarity accumulated value; and determining knowledge delivery parameters of each knowledge delivery sample to the current anchor sample based on the similarity between the anchor sample and each knowledge delivery sample and the accumulated value of the similarities.
In some embodiments, the obtaining unit 11 is further specifically configured to perform knowledge transfer processing on the knowledge transfer parameters of the current anchor sample and the prediction probability of each knowledge transfer sample once based on each knowledge transfer sample, so as to obtain an initial knowledge transfer probability of the current anchor sample; performing knowledge fusion processing once based on the initial knowledge transfer probability and the prediction probability of the current anchoring sample to obtain an initial soft label of the current anchoring sample; based on the initial soft label of the current anchoring sample, executing a loop process until the prediction probability of the at least one knowledge transfer sample is smaller than a preset probability threshold value to obtain the soft label of the current anchoring sample; wherein the cyclic process comprises: in each period of the cyclic process, executing knowledge transfer processing based on the soft label of the current anchoring sample obtained in the previous period and each knowledge transfer parameter to obtain the knowledge transfer probability of the current anchoring sample; and executing knowledge fusion processing based on the knowledge transfer probability of the current anchoring sample and the prediction probability of the current anchoring sample to obtain a soft label of the current anchoring sample in the next period.
In some embodiments, the obtaining unit 11 is further specifically configured to determine a knowledge transfer probability of each knowledge transfer sample for the current anchor sample based on the knowledge transfer parameter of each knowledge transfer sample for the current anchor sample and the prediction probability of each knowledge transfer sample; accumulating at least one knowledge transfer probability of the current anchoring sample for each knowledge transfer sample to obtain an accumulated value of the knowledge transfer probabilities; and carrying out primary knowledge transfer processing based on the knowledge transfer probability accumulated value to obtain the initial knowledge transfer probability of the current anchor sample.
In some embodiments, the obtaining unit 11 is further configured to obtain a training data set, where the training data set includes at least one batch of training data subsets.
In some embodiments, the selecting unit 13 is configured to select, from the training data set, a batch of the training data subset that has not been selected previously as a training sample set as the current training sample set.
In some embodiments, the sampling unit 14 is configured to perform a random sampling process on the training data set to obtain at least one first training data.
In some embodiments, the determining unit 15 is further configured to determine a target initial hard tag corresponding to the first sample.
In some embodiments, the sampling unit 14 is further configured to perform similarity sampling processing on remaining data that is not selected as the first training data in the training data set based on the hard label of each first training data, so as to obtain at least one second training data corresponding to each first training data.
In some embodiments, the determining unit 15 is further configured to use a batch of the training data subsets constructed based on the at least one first training data and at least one second training data corresponding to each first training data as the current training sample set.
In the embodiment of the present disclosure, further, fig. 17 is a schematic diagram of a composition structure of a computer device provided in the embodiment of the present disclosure, as shown in fig. 17, a computer device 20 provided in the embodiment of the present disclosure may further include a processor 21, a memory 22 storing executable instructions of the processor 21, and further, the living body detecting device 20 may further include a communication interface 23, and a bus 24 for connecting the processor 21, the memory 22, and the communication interface 23.
In the embodiment of the present disclosure, the Processor 21 may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a ProgRAMmable Logic Device (PLD), a Field ProgRAMmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the above-described processor functions may be other devices, and the embodiments of the present disclosure are not particularly limited. The biopsy device 20 may further comprise a memory 22, which memory 22 may be connected to the processor 21, wherein the memory 22 is adapted to store executable program code comprising computer operating instructions, and wherein the memory 22 may comprise a high speed RAM memory and may further comprise a non-volatile memory, e.g. at least two disk memories.
In the embodiment of the present disclosure, the bus 24 is used to connect the communication interface 23, the processor 21, and the memory 22 and the intercommunication among these devices.
In an embodiment of the present disclosure, memory 22 is used to store instructions and data.
Further, in the embodiment of the present disclosure, the processor 21 is configured to perform a loop process until a soft label satisfying a preset number of anchor samples is obtained; training a neural network based on at least the soft labels of the preset number of anchor samples and the preset number of anchor samples; wherein the cyclic process comprises the steps of: acquiring a current training sample set, and determining the current anchoring sample and at least one knowledge transfer sample from the current training sample set in each period of executing a cyclic process; wherein the current anchor sample is any one of the current training sample set, and the at least one knowledge transfer sample is at least one other sample in the current training sample set except the current anchor sample; determining, based on the neural network, a similarity between the current anchor sample and each of the knowledge delivery samples, and a predicted probability of the current anchor sample and a predicted probability of each of the knowledge delivery samples; determining a soft label for the current anchor sample based on a similarity between the current anchor sample and each of the knowledge transfer samples, a prediction probability for the current anchor sample, and a prediction probability for each of the knowledge transfer samples.
In practical applications, the Memory 22 may be a volatile Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor 21.
In addition, each functional module in this embodiment may be integrated into one recommendation unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.
Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The disclosed embodiments provide a computer device that can perform a loop process until soft tags satisfying a preset number of anchor samples are obtained; training a neural network at least based on the soft labels of the anchoring samples with the preset number and the anchoring samples with the preset number; wherein, the circulation process comprises the following steps: acquiring a current training sample set, and determining a current anchor sample and at least one knowledge transfer sample from the current training sample set in each period of executing a cyclic process; the current anchoring sample is any one of the current training sample set, and the at least one knowledge transfer sample is at least one other sample except the current training sample set and the current anchoring sample; determining similarity between the current anchor sample and each knowledge transfer sample, and the prediction probability of the current anchor sample and the prediction probability of each knowledge transfer sample based on a neural network; determining a soft label for the current anchor sample based on a similarity between the current anchor sample and each knowledge transfer sample, a prediction probability for the current anchor sample, and a prediction probability for each knowledge transfer sample.
In this way, for each training sample under the neural network, the generation of the soft label of the training sample can be assisted by the similarity between other samples and the sample and the prediction probability of other samples, and then efficient training supervision is performed on the neural network based on the soft labels meeting the preset number of training samples. It can be seen that the conventional cross-network knowledge integration is replaced by the cross-sample knowledge integration under the same neural network, and the knowledge integration based on the similarity between the samples is realized and the effective soft label is obtained on the basis of only utilizing a single network.
The disclosed embodiments provide a computer-readable storage medium on which a program is stored, which when executed by a processor implements a neural network training method as described above.
Specifically, the program instructions corresponding to a neural network training method in this embodiment may be stored on a storage medium such as an optical disc, a hard disc, or a usb disk, and when the program instructions corresponding to a neural network training method in the storage medium are read or executed by an electronic device, the method includes the following steps:
executing a loop process until soft labels meeting a preset number of anchoring samples are obtained;
training a neural network based on at least the soft labels of the preset number of anchor samples and the preset number of anchor samples;
wherein the cyclic process comprises the steps of:
acquiring a current training sample set, and determining the current anchoring sample and at least one knowledge transfer sample from the current training sample set in each period of executing a cyclic process; wherein the current anchor sample is any one of the current training sample set, and the at least one knowledge transfer sample is at least one other sample in the current training sample set except the current anchor sample; determining, based on the neural network, a similarity between the current anchor sample and each of the knowledge delivery samples, and a predicted probability of the current anchor sample and a predicted probability of each of the knowledge delivery samples; determining a soft label for the current anchor sample based on a similarity between the current anchor sample and each of the knowledge transfer samples, a prediction probability for the current anchor sample, and a prediction probability for each of the knowledge transfer samples.
Accordingly, the disclosed embodiments further provide a computer program product, which includes computer executable instructions for implementing the steps in the neural network training method proposed by the disclosed embodiments.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of implementations of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable computer apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable computer apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable computer apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable computer apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks in the flowchart and/or block diagram block or blocks.
The above description is only for the preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure.

Claims (13)

1. A method of training a neural network, the method comprising:
executing a loop process until soft labels meeting a preset number of anchoring samples are obtained;
training a neural network based on at least the soft labels of the preset number of anchor samples and the preset number of anchor samples;
wherein the cyclic process comprises the steps of:
acquiring a current training sample set, and determining the current anchoring sample and at least one knowledge transfer sample from the current training sample set in each period of executing a cyclic process; wherein the current anchor sample is any one of the current training sample set, and the at least one knowledge transfer sample is at least one other sample in the current training sample set except the current anchor sample;
determining, based on the neural network, a similarity between the current anchor sample and each of the knowledge delivery samples, and a predicted probability of the current anchor sample and a predicted probability of each of the knowledge delivery samples;
determining a soft label for the current anchor sample based on a similarity between the current anchor sample and each of the knowledge transfer samples, a prediction probability for the current anchor sample, and a prediction probability for each of the knowledge transfer samples.
2. The method of claim 1, wherein training the neural network based on at least the soft labels of the preset number of anchor samples and the preset number of anchor samples comprises:
determining a relative entropy for each of the anchor samples based on the soft label for each of the anchor samples and the prediction probability corresponding to the anchor sample;
determining a cross entropy for each of the anchor samples based on the hard label for each of the anchor samples and the prediction probability corresponding to the anchor sample;
training the neural network based on the cross entropy of the preset number of anchor samples and the relative entropy of the preset number of anchor samples.
3. The method of claim 1 or 2, wherein the neural network comprises an encoder and a classifier; the determining, based on the neural network, a similarity between the current anchor sample and each of the knowledge transfer samples, and a prediction probability of the current anchor sample and a prediction probability of each of the knowledge transfer samples, comprises:
determining, based on an encoder of the neural network, sample characteristics of the current anchor sample and sample characteristics of each of the knowledge transfer samples;
determining a similarity between the current anchor sample and each of the knowledge transfer samples based on sample characteristics of the current anchor sample and sample characteristics of each of the knowledge transfer samples;
based on a classifier of the neural network, a prediction probability of the current anchor sample and a prediction probability of each of the knowledge transfer samples are determined.
4. The method of claim 3, wherein determining the similarity between the current anchor sample and each of the knowledge transfer samples based on the sample characteristics of the current anchor sample and the sample characteristics of each of the knowledge transfer samples comprises:
carrying out normalization processing on the sample characteristics of the current anchoring sample to obtain the normalized sample characteristics of the current anchoring sample;
carrying out normalization processing on the sample characteristics of each knowledge transfer sample to obtain the normalization characteristics of each knowledge transfer sample;
and carrying out dot product operation processing on the normalized sample characteristics of the current anchoring sample and the normalized characteristics of each knowledge transfer sample to obtain the similarity between the current anchoring sample and each knowledge transfer sample.
5. The method of any one of claims 1 to 4, wherein determining the soft label for the current anchor sample based on the similarity between the current anchor sample and each of the knowledge transfer samples, the prediction probability for the current anchor sample, and the prediction probability for each of the knowledge transfer samples comprises:
determining knowledge transfer parameters of each knowledge transfer sample to the current anchor sample based on a similarity between the current anchor sample and each knowledge transfer sample;
determining a soft label for the current anchor sample based on knowledge delivery parameters for each of the knowledge delivery samples for the current anchor sample, a prediction probability for each of the knowledge delivery samples, and a prediction probability for the current anchor sample.
6. The method of claim 5, wherein determining knowledge transfer parameters for each of the knowledge transfer samples for the current anchor sample based on a similarity between the current anchor sample and each of the knowledge transfer samples comprises:
accumulating at least one similarity between the current anchor sample and each knowledge transfer sample to obtain a similarity accumulated value;
and determining knowledge delivery parameters of each knowledge delivery sample to the current anchor sample based on the similarity between the anchor sample and each knowledge delivery sample and the accumulated value of the similarities.
7. The method of claim 5 or 6, wherein the determining the soft label for the current anchor sample based on the knowledge transfer parameters of each knowledge transfer sample for the current anchor sample, the prediction probability of each knowledge transfer sample, and the prediction probability of the current anchor sample comprises:
performing knowledge transfer processing once on knowledge transfer parameters of the current anchor sample and the prediction probability of each knowledge transfer sample based on each knowledge transfer sample to obtain the initial knowledge transfer probability of the current anchor sample;
performing knowledge fusion processing once based on the initial knowledge transfer probability and the prediction probability of the current anchoring sample to obtain an initial soft label of the current anchoring sample;
based on the initial soft label of the current anchoring sample, executing a loop process until the prediction probability of the at least one knowledge transfer sample is smaller than a preset probability threshold value to obtain the soft label of the current anchoring sample;
wherein the cyclic process comprises:
in each period of the cyclic process, executing knowledge transfer processing based on the soft label of the current anchoring sample obtained in the previous period and each knowledge transfer parameter to obtain the knowledge transfer probability of the current anchoring sample;
and executing knowledge fusion processing based on the knowledge transfer probability of the current anchoring sample and the prediction probability of the current anchoring sample to obtain a soft label of the current anchoring sample in the next period.
8. The method of claim 7, wherein the performing a knowledge transfer process on the knowledge transfer parameters of the current anchor sample and the prediction probability of each knowledge transfer sample based on each knowledge transfer sample to obtain an initial knowledge transfer probability of the current anchor sample comprises:
determining a knowledge transfer probability of each knowledge transfer sample for the current anchor sample based on knowledge transfer parameters of each knowledge transfer sample for the current anchor sample and a predicted probability of each knowledge transfer sample;
accumulating at least one knowledge transfer probability of the current anchoring sample for each knowledge transfer sample to obtain an accumulated value of the knowledge transfer probabilities;
and carrying out primary knowledge transfer processing based on the knowledge transfer probability accumulated value to obtain the initial knowledge transfer probability of the current anchor sample.
9. The method according to any one of claims 1 to 8, further comprising:
obtaining a training data set, wherein the training data set comprises at least one batch of training data subsets;
selecting a batch of the training data subset that has not been selected previously as a training sample set from the training data set as the current training sample set.
10. The method of claim 9, further comprising:
performing random sampling processing on the training data set to obtain at least one first training data;
determining a hard label of each first training data, and performing similarity sampling processing on the remaining data which are not selected as the first training data in the training data set based on the hard label of each first training data to obtain at least one second training data corresponding to each first training data;
and using the training data subset of one batch constructed based on the at least one first training data and at least one second training data corresponding to each first training data as the current training sample set.
11. An apparatus for training a neural network, comprising:
the acquisition unit is used for executing a cyclic process until soft labels meeting a preset number of anchoring samples are obtained; wherein the cyclic process comprises the steps of: acquiring a current training sample set, and determining the current anchoring sample and at least one knowledge transfer sample from the current training sample set in each period of executing a cyclic process; wherein the current anchor sample is any one of the current training sample set, and the at least one knowledge transfer sample is at least one other sample in the current training sample set except the current anchor sample; determining, based on the neural network, a similarity between the current anchor sample and each of the knowledge delivery samples, and a predicted probability of the current anchor sample and a predicted probability of each of the knowledge delivery samples;
and the training unit is used for training the neural network at least based on the soft labels of the anchoring samples with the preset number and the anchoring samples with the preset number.
12. A computer device comprising a processor, a memory storing instructions executable by the processor, the instructions when executed by the processor implementing the method of any one of claims 1 to 10.
13. A computer-readable storage medium, on which a program is stored, for use in a computer device, wherein the program, when executed by a processor, implements the method of any one of claims 1 to 10.
CN202110462397.6A 2021-04-27 2021-04-27 Neural network training method, device and equipment and computer storage medium Pending CN113222139A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110462397.6A CN113222139A (en) 2021-04-27 2021-04-27 Neural network training method, device and equipment and computer storage medium
PCT/CN2021/121379 WO2022227400A1 (en) 2021-04-27 2021-09-28 Neural network training method and apparatus, device, and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110462397.6A CN113222139A (en) 2021-04-27 2021-04-27 Neural network training method, device and equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN113222139A true CN113222139A (en) 2021-08-06

Family

ID=77089304

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110462397.6A Pending CN113222139A (en) 2021-04-27 2021-04-27 Neural network training method, device and equipment and computer storage medium

Country Status (2)

Country Link
CN (1) CN113222139A (en)
WO (1) WO2022227400A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487614A (en) * 2021-09-08 2021-10-08 四川大学 Training method and device for fetus ultrasonic standard section image recognition network model
WO2022227400A1 (en) * 2021-04-27 2022-11-03 商汤集团有限公司 Neural network training method and apparatus, device, and computer storage medium
CN115936091A (en) * 2022-11-24 2023-04-07 北京百度网讯科技有限公司 Deep learning model training method and device, electronic equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361658A (en) * 2023-04-07 2023-06-30 北京百度网讯科技有限公司 Model training method, task processing method, device, electronic equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635668A (en) * 2018-11-16 2019-04-16 华中师范大学 Facial expression recognizing method and system based on soft label integrated rolled product neural network
CN111681059A (en) * 2020-08-14 2020-09-18 支付宝(杭州)信息技术有限公司 Training method and device of behavior prediction model
WO2020194077A1 (en) * 2019-03-22 2020-10-01 International Business Machines Corporation Unification of models having respective target classes with distillation
CN111753092A (en) * 2020-06-30 2020-10-09 深圳创新奇智科技有限公司 Data processing method, model training device and electronic equipment
WO2021056765A1 (en) * 2019-09-24 2021-04-01 北京市商汤科技开发有限公司 Image processing method and related apparatus

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674880B (en) * 2019-09-27 2022-11-11 北京迈格威科技有限公司 Network training method, device, medium and electronic equipment for knowledge distillation
CN111368997B (en) * 2020-03-04 2022-09-06 支付宝(杭州)信息技术有限公司 Training method and device of neural network model
CN111507378A (en) * 2020-03-24 2020-08-07 华为技术有限公司 Method and apparatus for training image processing model
CN113222139A (en) * 2021-04-27 2021-08-06 商汤集团有限公司 Neural network training method, device and equipment and computer storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635668A (en) * 2018-11-16 2019-04-16 华中师范大学 Facial expression recognizing method and system based on soft label integrated rolled product neural network
WO2020194077A1 (en) * 2019-03-22 2020-10-01 International Business Machines Corporation Unification of models having respective target classes with distillation
WO2021056765A1 (en) * 2019-09-24 2021-04-01 北京市商汤科技开发有限公司 Image processing method and related apparatus
CN111753092A (en) * 2020-06-30 2020-10-09 深圳创新奇智科技有限公司 Data processing method, model training device and electronic equipment
CN111681059A (en) * 2020-08-14 2020-09-18 支付宝(杭州)信息技术有限公司 Training method and device of behavior prediction model

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022227400A1 (en) * 2021-04-27 2022-11-03 商汤集团有限公司 Neural network training method and apparatus, device, and computer storage medium
CN113487614A (en) * 2021-09-08 2021-10-08 四川大学 Training method and device for fetus ultrasonic standard section image recognition network model
CN113487614B (en) * 2021-09-08 2021-11-30 四川大学 Training method and device for fetus ultrasonic standard section image recognition network model
CN115936091A (en) * 2022-11-24 2023-04-07 北京百度网讯科技有限公司 Deep learning model training method and device, electronic equipment and storage medium
CN115936091B (en) * 2022-11-24 2024-03-08 北京百度网讯科技有限公司 Training method and device for deep learning model, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2022227400A1 (en) 2022-11-03

Similar Documents

Publication Publication Date Title
Han et al. Memory-augmented dense predictive coding for video representation learning
Tang et al. Supervised deep hashing for scalable face image retrieval
CN113222139A (en) Neural network training method, device and equipment and computer storage medium
Springenberg et al. Improving deep neural networks with probabilistic maxout units
EP3143563B1 (en) Distributed model learning
JP2022066192A (en) Dynamic adaptation of deep neural networks
CN110674323B (en) Unsupervised cross-modal Hash retrieval method and system based on virtual label regression
CN113705772A (en) Model training method, device and equipment and readable storage medium
CN111414461A (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN109272332B (en) Client loss prediction method based on recurrent neural network
CN113822776B (en) Course recommendation method, device, equipment and storage medium
Chen et al. Binarized neural architecture search for efficient object recognition
Dai et al. Hybrid deep model for human behavior understanding on industrial internet of video things
CN113632106A (en) Hybrid precision training of artificial neural networks
Gao et al. Adversarial mobility learning for human trajectory classification
Umer et al. On-Device saliency prediction based on Pseudoknowledge distillation
Passalis et al. Deep supervised hashing using quadratic spherical mutual information for efficient image retrieval
Zeng et al. Compressing and accelerating neural network for facial point localization
Lyu et al. A survey of model compression strategies for object detection
CN114625886A (en) Entity query method and system based on knowledge graph small sample relation learning model
Menon Deep learning for prediction of amyotrophic lateral sclerosis using stacked auto encoders
CN114238798A (en) Search ranking method, system, device and storage medium based on neural network
CN113822291A (en) Image processing method, device, equipment and storage medium
Shetty et al. Comparative analysis of different classification techniques
Yang et al. Pruning Convolutional Neural Networks via Stochastic Gradient Hard Thresholding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40051368

Country of ref document: HK