CN111967597A

CN111967597A - Neural network training and image classification method, device, storage medium and equipment

Info

Publication number: CN111967597A
Application number: CN202010833576.1A
Authority: CN
Inventors: 李潇婕; 王飞; 钱晨
Original assignee: Shanghai Sensetime Lingang Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Lingang Intelligent Technology Co Ltd
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2020-11-20

Abstract

The disclosure provides a neural network training and image classification method, device, storage medium and equipment, wherein the method comprises the following steps: respectively inputting the sample images into a teacher network and a student network to obtain a first feature diagram output by the teacher network and a second feature diagram output by the student network; determining first associated feature information based on feature information included in the first feature map, and determining second associated feature information based on feature information included in the second feature map; performing supervised training on the student network based on the difference of the second associated feature information relative to the first associated feature information.

Description

Neural network training and image classification method, device, storage medium and equipment

Technical Field

The present disclosure relates to the field of deep learning, and in particular, to a method, an apparatus, a storage medium, and a device for neural network training and image classification.

Background

At present, knowledge distillation becomes a more key technical problem in the field of deep learning, and is generally used for compressing models, and aims to transmit sample relation information learned by a network model (teacher model) with more parameters and better performance to a lightweight network model (student model) with less parameters and higher speed, so that the precision of the student model is improved.

Disclosure of Invention

The disclosure provides a neural network training and image classification method, device, storage medium and equipment.

According to a first aspect of embodiments of the present disclosure, there is provided a neural network training method, including: respectively inputting the sample images into a teacher network and a student network to obtain a first feature diagram output by the teacher network and a second feature diagram output by the student network; determining first associated feature information based on feature information included in the first feature map, and determining second associated feature information based on feature information included in the second feature map; performing supervised training on the student network based on the difference of the second associated feature information relative to the first associated feature information.

In some optional embodiments, the method further comprises: determining a target area based on the first feature map; determining a first target feature map based on the target area and the first feature map, and determining a second target feature map based on the target area and the second feature map; the determining the first associated feature information based on the feature information included in the first feature map and the determining the second associated feature information based on the feature information included in the second feature map includes: determining the first associated feature information based on feature information included in the first target feature map, and determining the second associated feature information based on feature information included in the second target feature map.

In some optional embodiments, the target region of the sample image comprises a target object, and/or a region of the sample image other than the target region comprises at least one of: a background portion, a portion of the target object that is occluded.

In some optional embodiments, the determining a target region based on the first feature map includes: inputting the first feature map into a pre-trained first neural network to obtain a pixel normalization value corresponding to each of a plurality of regions included in the sample image output by the first neural network; and taking the area of which the pixel normalization value is greater than or equal to a preset value as the target area.

In some optional embodiments, the determining a first target feature map based on the target region and the first feature map, and determining a second target feature map based on the target region and the second feature map, includes: performing point multiplication on the pixel normalization value corresponding to each of the plurality of regions and the feature information included in the first feature map to obtain a first attention feature map, and performing point multiplication on the pixel normalization value corresponding to each of the plurality of regions and the feature information included in the second feature map to obtain a second attention feature map; and obtaining the first target feature map based on the target area on the first attention feature map, and obtaining the second target feature map based on the target area on the second attention feature map.

In some optional embodiments, the first neural network comprises a plurality of first network layers and a first output layer, and the second neural network comprises a plurality of second network layers and a second output layer, wherein the network structure of the plurality of second network layers is the same as the network structure of the plurality of first network layers; the method further comprises the following steps: taking the first feature map as the input of the second neural network to obtain a classification result output by the second neural network; performing supervised training on the second neural network based on the difference of the classification result output by the second neural network relative to the classification result labeled in the sample image; and taking the network parameters of the plurality of second network layers included in the trained second neural network as the network parameters of the plurality of first network layers included in the first neural network.

In some optional embodiments, the determining, based on the feature information included in the first feature map, first associated feature information includes: determining the first associated feature information based on the similarity between feature information of different areas on the first feature map; the determining second connection feature information based on the feature information included in the second feature map includes: and determining the second associated characteristic information based on the similarity between the characteristic information of different areas on the second characteristic diagram.

In some optional embodiments, the number of the sample images is plural, and the number of the first feature map and the second feature map is plural; the determining of the first associated feature information based on the feature information included in the first feature map includes: determining the first associated feature information based on the similarity between feature information of the same region on the first feature map respectively corresponding to different sample images; the determining second associated feature information based on the feature information included in the second feature map includes: and determining the second associated feature information based on the similarity between the feature information of the same region on the second feature maps respectively corresponding to different sample images.

In some optional embodiments, the number of the sample images is plural, and the number of the first feature map and the second feature map is plural; the determining of the first associated feature information based on the feature information included in the first feature map includes: determining the first associated feature information based on the similarity between feature information of different areas on the first feature map respectively corresponding to different sample images; the determining second associated feature information based on the feature information included in the second feature map includes: and determining the second associated feature information based on the similarity between the feature information of different areas on the second feature map respectively corresponding to different sample images.

In some optional embodiments, the number of the first feature maps and the second feature maps is multiple, and a plurality of the first feature maps correspond to different resolutions and a plurality of the second feature maps correspond to different resolutions; the determining the first associated feature information based on the feature information included in the first feature map and the determining the second associated feature information based on the feature information included in the second feature map includes: according to the sequence of the resolutions from large to small, the first associated feature information is determined based on the feature information included in the first feature maps with different resolutions in sequence, and the second associated feature information is determined based on the feature information included in the second feature maps with different resolutions.

In some optional embodiments, the supervised training of the student network comprises: and carrying out supervision training on the student network by adopting a knowledge distillation mode.

In some optional embodiments, the sample image comprises a physical image labeling the classification result.

According to a second aspect of the embodiments of the present disclosure, there is provided an image classification method, the method including:

inputting an image to be detected into an image classification network to obtain a classification result of a target object included in the image to be detected and output by the image classification network; the image classification network is a student network trained by the method of any one of the first aspect.

According to a third aspect of the embodiments of the present disclosure, there is provided a neural network training device, including: the acquisition module is used for respectively inputting the sample images into a teacher network and a student network to obtain a first characteristic diagram output by the teacher network and a second characteristic diagram output by the student network; the first determining module is used for determining first associated characteristic information based on the characteristic information included in the first characteristic diagram and determining second associated characteristic information based on the characteristic information included in the second characteristic diagram; and the first training module is used for carrying out supervised training on the student network based on the difference of the second associated characteristic information relative to the first associated characteristic information.

In some optional embodiments, the apparatus further comprises: the second determination module is used for determining a target area based on the first feature map; a third determining module, configured to determine a first target feature map based on the target region and the first feature map, and determine a second target feature map based on the target region and the second feature map; the first determining module includes: the first determining sub-module is configured to determine the first associated feature information based on feature information included in the first target feature map, and determine the second associated feature information based on feature information included in the second target feature map.

In some optional embodiments, the second determining module comprises: the second determining submodule is used for inputting the first feature map into a pre-trained first neural network to obtain a pixel normalization value corresponding to each of a plurality of regions included in the sample image output by the first neural network; and the third determining submodule is used for taking the area of which the pixel normalization value is greater than or equal to a preset value as the target area.

In some optional embodiments, the third determining module comprises: a fourth determining submodule, configured to perform dot multiplication on the pixel normalization value corresponding to each of the multiple regions and the feature information included in the first feature map to obtain a first attention feature map, and perform dot multiplication on the pixel normalization value corresponding to each of the multiple regions and the feature information included in the second feature map to obtain a second attention feature map; a fifth determining submodule, configured to obtain the first target feature map based on the target region on the first attention feature map, and obtain the second target feature map based on the target region on the second attention feature map.

In some optional embodiments, the first neural network comprises a plurality of first network layers and a first output layer, and the second neural network comprises a plurality of second network layers and a second output layer, wherein the network structure of the plurality of second network layers is the same as the network structure of the plurality of first network layers; the device further comprises: the classification result acquisition module is used for taking the first feature map as the input of the second neural network to obtain a classification result output by the second neural network; the second training module is used for carrying out supervised training on the second neural network based on the difference of the classification result output by the second neural network relative to the classification result marked in the sample image; a fourth determining module, configured to use the trained network parameters of the plurality of second network layers included in the second neural network as the network parameters of the plurality of first network layers included in the first neural network.

In some optional embodiments, the first determining module comprises: a sixth determining submodule, configured to determine the first associated feature information based on similarity between feature information of different areas on the first feature map; a seventh determining sub-module, configured to determine the second associated feature information based on a similarity between feature information of different areas on the second feature map.

In some optional embodiments, the number of the sample images is plural, and the number of the first feature map and the second feature map is plural; the first determining module includes: the eighth determining submodule is used for determining the first associated feature information based on the similarity between feature information of the same area on the first feature map respectively corresponding to different sample images; and the ninth determining submodule is used for determining the second associated characteristic information based on the similarity between the characteristic information of the same area on the second characteristic map respectively corresponding to different sample images.

In some optional embodiments, the number of the sample images is plural, and the number of the first feature map and the second feature map is plural; the first determining module includes: a tenth determining submodule, configured to determine the first associated feature information based on similarities between feature information of different areas on the first feature map respectively corresponding to different sample images; and the eleventh determining submodule is used for determining the second associated characteristic information based on the similarity between the characteristic information of different areas on the second characteristic map respectively corresponding to different sample images.

In some optional embodiments, the number of the first feature maps and the second feature maps is multiple, and a plurality of the first feature maps correspond to different resolutions and a plurality of the second feature maps correspond to different resolutions; the first determining module includes: and the twelfth determining submodule is used for sequentially determining the first associated feature information based on the feature information included in the first feature maps with different resolutions according to the sequence of the resolutions from large to small, and determining the second associated feature information based on the feature information included in the second feature maps with different resolutions.

In some optional embodiments, the first training module comprises: and the training submodule is used for carrying out supervision training on the student network in a knowledge distillation mode.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an image classification apparatus, the apparatus including: the image classification module is used for inputting the image to be detected into an image classification network to obtain a classification result of a target object included in the image to be detected and output by the image classification network; the image classification network is a student network trained by the method of any one of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the neural network training method according to any one of the first aspect or the image classification method according to the second aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to invoke executable instructions stored in the memory to implement the neural network training method of any one of the first aspect or the image classification method of the second aspect.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the embodiment of the disclosure, the sample image is respectively input into the teacher network and the student network, and the first characteristic diagram and the second characteristic diagram can be obtained. First associated feature information may be determined based on feature information included in the first feature map, and second associated feature information may be determined based on feature information included in the second feature map. Further, the student network is supervised trained based on the difference of the second associated feature information relative to the first associated feature information. According to the method and the system, the student network is supervised and trained based on the feature information included in the first feature diagram output by the teacher network, so that the teacher network and the student network can keep consistent on local associated features, and the performance of the student network obtained through final training is better. In addition, because the student network is generally smaller in scale and has more efficient processing performance compared with a teacher network, the student network trained by the technical scheme provided by the disclosure can be deployed in mobile devices such as mobile phones and tablet computers, and functions such as image classification can be realized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flowchart illustrating a neural network training method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow chart diagram of another neural network training method illustrated in accordance with an exemplary embodiment of the present disclosure;

FIG. 3 is a flow chart diagram of another neural network training method illustrated in accordance with an exemplary embodiment of the present disclosure;

FIG. 4 is a flow chart diagram of another neural network training method illustrated in accordance with an exemplary embodiment of the present disclosure;

FIG. 5 is a flow chart diagram of another neural network training method illustrated in accordance with an exemplary embodiment of the present disclosure;

FIG. 6A is a schematic diagram of a neural network training architecture illustrating the present disclosure in accordance with an exemplary embodiment;

FIG. 6B is a schematic diagram illustrating one type of determining associated characteristic information according to an exemplary embodiment of the present disclosure;

FIG. 7 is a block diagram of a neural network training device shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 8 is a block diagram of an image classification device shown in accordance with an exemplary embodiment of the present disclosure;

fig. 9 is a schematic diagram illustrating a structure of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as operated herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.

Prior to introducing the Knowledge-based neural network training scheme provided by the present disclosure, the relevant content of Knowledge Distillation (KD) is introduced. Knowledge distillation is essentially closer to transfer learning, and a student model more suitable for reasoning is obtained according to a trained teacher model. The teacher model has more parameters and better performance, and the student model has fewer parameters but higher reasoning speed.

The loss function corresponding to the more common KD method can be expressed by the following equation 1:

where τ is a hyper-parameter called temperature, teacher model is T, student model is S, x_iAs a set of samples

Of (ii), f^T(x_i) And f^S(x_i) Respectively corresponding to teacher model and student model for x_iTo output of (c).

The student network is trained by minimizing the divergence of the relative entropies (Kullback-Leibler, KL) of the student model and the teacher model.

However, the existing knowledge distillation mode focuses on maintaining the consistency of characteristics of example level (instance-level) or the consistency of relationship of example level, but usually, local characteristics usually contain more detailed information, which is important information for distinguishing different targets. For example, since two animals look very similar as a whole, one image containing a cat and one image containing a leopard would be misjudged if the two animals were considered to belong to the same category by comparing the feature consistency of the whole example class. But if the two can be distinguished exactly by local features, such as those of the head or foot.

In addition, the classified sample image input to the network usually contains much interference information that is not related to the category, such as the background of the image or an object blocking the image subject. In the knowledge distillation process of the existing method, all information contained in the image is transmitted to the student model without distinction, and great redundancy or noise is caused.

In order to solve the above problems, embodiments of the present disclosure provide a neural network training scheme based on knowledge distillation, which enables a student model to learn local features, and improves the performance of a student network.

For example, as shown in fig. 1, fig. 1 illustrates a knowledge-based distillation neural network training method according to an exemplary embodiment, including the following steps:

in step 101, sample images are respectively input into a teacher network and a student network, so as to obtain a first feature map output by the teacher network and a second feature map output by the student network.

In the embodiment of the present disclosure, the sample image may include a real object image labeling the classification result. In the training process, the sample image may include, but is not limited to, an animal image, a plant image, a vehicle image, and the like.

In the disclosed embodiment, the resolution of the first feature map is higher than the resolution of the second feature map, and/or the feature information included in the first feature map is richer relative to the second feature map.

In step 102, first associated feature information is determined based on feature information included in the first feature map, and second associated feature information is determined based on feature information included in the second feature map.

In the embodiment of the present disclosure, the association features between different regions of the first feature map may be determined, for example, taking a sample image as an example, one first feature map corresponding to the sample image includes a plurality of regions, and the association features between every two regions may be used as the association features between different regions of the first feature map, so that the association relationships between different autumn rains in the same sample image may be embodied; in the case that the number of the sample images is multiple, the same area and/or the correlation feature between different areas of the first feature maps corresponding to different sample images may also be determined, and taking two sample images as an example, in the case that each sample image corresponds to one first feature map, the correlation feature between the same areas in the first feature maps corresponding to the two sample images, and/or the correlation feature between different areas in the first feature maps corresponding to the two sample images, may be used as the above-mentioned correlation feature, so that the correlation relationship between the same area and/or different areas of different sample images may be embodied. And obtaining first associated characteristic information based on at least one item. Similarly, the second correlation characteristic information may be determined in the same or similar manner, which is not described herein again.

In step 103, supervised training is performed on the student network based on the difference of the second associated feature information relative to the first associated feature information.

In the embodiment of the disclosure, the loss function may be determined based on a difference between the second correlation characteristic information and the first correlation characteristic information, and further, the student network may be supervised trained in a knowledge distillation manner.

In the above embodiment, the student network is supervised and trained based on the feature information included in the first feature map output by the teacher network, so that the teacher network and the student network can be kept consistent in local associated features, and the performance of the student network obtained through training is better.

In some alternative embodiments, such as shown in fig. 2, the method may further include:

in step 104, a target area is determined based on the first feature map.

In the embodiment of the present disclosure, the target region of the sample image includes a target object (i.e., a content that plays a dominant role in the image classification process, such as an animal category, a plant variety, a vehicle style, and the like), and in order to make the process of knowledge distillation focus more on the target region on the image, a region unrelated to the category feature, i.e., a region other than the target region in the sample image, may be removed. These regions include, but are not limited to, background portions of the sample image, and/or portions of the target object that are occluded.

In step 105, a first target feature map is determined based on the target region and the first feature map, and a second target feature map is determined based on the target region and the second feature map.

In the embodiment of the present disclosure, after the target region of the sample image is determined, a first target feature map, which is a feature map including feature information of the target region on the first feature map, may be determined based on the target region and the first feature map. Likewise, a second target feature map may be determined based on the target region and the second feature map. Accordingly, step 102 may include:

determining the associated feature information based on feature information included in the first target feature map, and determining the second associated feature information based on feature information included in the second target feature map.

In the above embodiment, the target area where the target object is located may be determined based on the first feature map, and then the first target feature map and the second target feature map are determined respectively. The first associated feature information is determined based on feature information included in the first target feature map, the second associated feature information is determined based on feature information included in the second target feature map, redundancy or noise brought by information irrelevant to class features in knowledge distillation is reduced, performance of the student network after knowledge distillation is improved, and identification accuracy of the student network is improved.

In some alternative embodiments, such as shown in fig. 3, step 104 may include:

in step 104-1, the first feature map is input into a pre-trained first neural network, and a pixel normalization value corresponding to each of a plurality of regions included in the sample image output by the first neural network is obtained.

In the disclosed embodiment, the first neural network includes a plurality of first network layers and a first output layer, which are sequentially connected, the plurality of first network layers including, but not limited to, a convolutional layer (conv), a batch normalization layer (bn), an activation layer (relu), and a threshold (sigmoid) layer may be sequentially connected as the first output layer after the plurality of first network layers.

Considering that the first feature map learned by the teacher model is more obvious for the feature information of the category-related region and less obvious for the feature information of the category-unrelated region, the target region and other regions can be better distinguished, and therefore, the first feature map can be used as an input of a first neural network, and the first neural network outputs a pixel normalized value corresponding to each of a plurality of regions included in the sample image. The plurality of regions included in the sample image refer to a part or all of pixel regions of the whole sample image, and the pixel normalization value may be a continuous value in a range of [0, 1 ].

In one example, the first neural network may output an image mask having a size that is consistent with the sample image size, and the different regions identify the pixel normalization values corresponding to the regions.

In step 104-2, the region where the pixel normalization value is greater than or equal to a preset value is taken as the target region.

In an example, the pixel normalization value corresponding to each of the plurality of regions included in the sample image may be thresholded, for example, the pixel normalization value corresponding to the region having the pixel normalization value greater than the preset value is set as a threshold 1, and the pixel normalization values corresponding to other regions are set as a threshold 0. The boundary range of the target region can be determined according to the region with the pixel normalization value of 1 after thresholding.

In the embodiment, the pixel normalization value corresponding to each of the plurality of regions included in the sample image can be obtained through the pre-trained first neural network, so that the target region of the sample image is determined, and the method is simple and convenient to implement and high in usability.

In some alternative embodiments, such as shown in FIG. 4, step 105 may include;

in step 105-1, a first attention feature map is obtained by performing point multiplication on the pixel normalization value corresponding to each of the plurality of regions and the feature information included in the first feature map, and a second attention feature map is obtained by performing point multiplication on the pixel normalization value corresponding to each of the plurality of regions and the feature information included in the second feature map.

In the embodiment of the present disclosure, a pixel normalization value corresponding to each of a plurality of regions included in the sample image may be dot-multiplied with feature information included in the first feature map, so as to obtain the first attention feature map capable of highlighting foreground information related to the category and weakening information unrelated to the category.

In the same manner, the pixel normalization value corresponding to each of the plurality of regions included in the sample image is dot-multiplied with the feature information included in the second feature map to obtain a second attention feature map.

In step 105-2, the first target feature map is obtained based on the target region on the first attention feature map, and the second target feature map is obtained based on the target region on the second attention feature map.

In the disclosed embodiment, the boundary extent of the target region is determined according to the thresholded mask, and the first candidate feature map may be cropped from the first attention feature map based on the boundary extent of the target region.

Wherein, if the size of the first candidate feature map conforms to a preset size, for example, k × k size, the first candidate feature map may be directly used as the first target feature map. If the size of the first candidate feature map does not meet the preset size, the first target feature map meeting the preset size can be obtained by reshaping (reshape) the first candidate feature.

In the same way, a plurality of second target feature maps can be obtained.

In the above embodiment, the attention mechanism may be adopted to focus on the target area where the target object is located on the feature map, so that knowledge distillation is performed on the student network according to the correlation features of the target area in the following, and the performance of the student network is improved.

In some optional embodiments, the first neural network comprises a plurality of first network layers and a first output layer, and in order to obtain the trained first neural network, a second neural network may be provided, the second neural network comprises a plurality of second network layers and a second output layer, wherein the network structure of the plurality of second network layers is the same as the network structure of the plurality of first network layers. For example, the plurality of first network layers include M1 conv layers, M2 bn layers, and M3 relu layers, and are sequentially connected, and then the plurality of second network layers also include M1 conv layers, M2 bn layers, and M3 relu layers, which are sequentially connected. Wherein M1, M2, and M3 may be integers greater than or equal to zero, and if the value is zero, it represents that the network layer does not exist, and the value is an integer greater than or equal to 2, which indicates that there are multiple identical network layers connected in sequence, for example, M1 is 2, and the multiple first network layers include 2 conv layers connected in sequence.

For example, as shown in fig. 5, the method may further include:

in step 100-1, the first feature map is used as an input of a second neural network, and a classification result output by a classification layer of the second neural network is obtained.

In the embodiment of the present disclosure, the second output layer included in the second neural network may adopt a classification layer, and output a classification result corresponding to the first feature map.

In step 100-2, the second neural network is supervised-trained based on the difference of the classification result output by the second neural network with respect to the classification result labeled in the sample image.

In the embodiment of the present disclosure, the classification result of the target object has been labeled in advance in the sample image, the classification result labeled in the sample image is used as a true value of the classification result, and a corresponding loss function is determined according to a difference between the classification result output by the second neural network and the true value of the classification result, so as to perform supervised training on the second neural network.

In step 100-3, the trained network parameters of the plurality of second network layers included in the second neural network are used as the network parameters of the plurality of first network layers included in the first neural network.

In the embodiment of the present disclosure, since the first neural network and the second neural network have the same structure, the trained network parameters of the plurality of second network layers included in the second neural network can be directly used as the network parameters of the plurality of first network layers included in the first neural network, so as to obtain the trained first neural network.

In the above embodiment, the classification label in the sample image may be used as a supervision, and the trained second neural network may be obtained by minimizing a difference between the classification result output by the second neural network and the true value of the classification result, so that the network parameters of the plurality of first network layers of the first neural network may be determined according to the network parameters of the second neural network, which is simple and convenient to implement and high in usability.

In some optional embodiments, the step 102 of determining the first associated feature information based on the feature information included in the first feature map may include:

and determining the first associated characteristic information based on the similarity between the characteristic information of different areas on the first characteristic diagram.

Accordingly, the process of determining the second associated feature information based on the feature information included in the second feature map in step 102 may include:

and determining the second associated characteristic information based on the similarity between the characteristic information of different areas on the second characteristic diagram.

Further, step 103 may determine a corresponding loss function L according to a difference between the second correlation characteristic information and the first correlation characteristic information_intraAnd a knowledge distillation mode is adopted, so that the student network is supervised and trained. Loss function L_intraCan be expressed by the following equation 2:

wherein L is the total number of stages of the knowledge distillation and may be a positive integer. G () is a function that constructs the incidence relation matrix.

Is the L2 loss function.

Respectively representing the xth of the teacher's network_iSimilarity between feature information of different areas on the first feature map corresponding to the sample images,

representing the xth of a student network_iAnd similarity between the feature information of different areas on the second feature map corresponding to the sample images.

In some optional embodiments, the number of the sample images is plural, and the number of the first feature maps and the second feature maps is plural. In general, a sample image may correspond to one or more first feature maps and one or more second feature maps.

The process of determining the first associated feature information based on the feature information included in the first feature map in step 102 may include:

determining the first associated feature information based on the similarity between feature information of the same region on the first feature map respectively corresponding to different sample images;

and determining the second associated feature information based on the similarity between the feature information of the same region on the second feature maps respectively corresponding to different sample images.

Accordingly, the loss function L determined in step 103_inter-sCan be expressed by the following equation 3:

wherein the content of the first and second substances,

representing the similarity between the feature information of the same region i on the first feature map corresponding to different sample images in the teacher network,

and representing the similarity between the feature information of the same area i on the second feature map corresponding to different sample images in the student network.

Based on the loss function L_inter-sThe knowledge distillation mode can be adopted to supervise and train the student network.

In some optional embodiments, the number of the sample images is plural, and the number of the first feature maps and the second feature maps is plural.

determining the first associated feature information based on the similarity between feature information of different areas on the first feature map respectively corresponding to different sample images;

and determining the second associated feature information based on the similarity between the feature information of different areas on the second feature map respectively corresponding to different sample images.

Accordingly, the loss function L determined in step 103_inter-dCan be expressed by the following equation 4:

wherein the content of the first and second substances,

sample images x representing different teacher networks_iAnd x_jSimilarity between the feature information of different regions p and q on the corresponding first feature maps,

sample images x representing different student networks_iAnd x_jSimilarity between feature information of different regions p and q on the corresponding second feature maps.

Based on the loss function L_inter-dThe knowledge distillation mode can be adopted to supervise and train the student network.

In the above embodiment, the first associated feature information and the second associated feature information may be determined in at least one manner, so as to obtain corresponding loss functions, so as to perform supervised training on the student network, so that the trained student network focuses more on the local associated features of the image, and has higher performance.

In some alternative embodiments, the loss function used in performing the knowledge distillation may be derived based on the above loss function, as shown in equation 5:

L_LKD＝L_intra+L_inter-s+L_inter-dequation 5

In the disclosed embodiment, since the student network is expected to focus more on the feature information of the target area, equation 5 may be transformed into equation 6:

wherein the content of the first and second substances,

and

corresponding to a first target feature mapA difference between the associated feature information and second associated feature information determined based on the second target feature map, the resulting loss function.

In some alternative embodiments, to further improve the performance of the student network, equation 7 may be used as a function of the loss during knowledge distillation:

wherein alpha and beta are hyperparameters, L_KDIs a loss function determined by a common knowledge distillation mode and can be obtained by a formula 1, L_CECorresponding to the loss function determined by adopting a cross entropy mode.

In some optional embodiments, the number of the first feature maps and the second feature maps is multiple, and a plurality of the first feature maps correspond to different resolutions and a plurality of the second feature maps correspond to different resolutions.

Accordingly, step 102 may include:

according to the sequence of the resolutions from large to small, the first associated feature information is determined based on the feature information included in the first feature maps with different resolutions in sequence, and the second associated feature information is determined based on the feature information included in the second feature maps with different resolutions.

In the embodiment of the present disclosure, the knowledge distillation may be performed in stages according to the order of decreasing the resolution, and the manner of knowledge distillation in each stage may be the manner provided in the above embodiment.

In the above embodiment, knowledge distillation may be performed in stages, thereby improving the performance of the student network.

In some alternative embodiments, for example, as shown in fig. 6A, a plurality of sample images with classification labels are respectively input into the teacher network and the student network, and a plurality of first feature maps and a plurality of second feature maps can be obtained respectively. Dividing the knowledge distillation process into a plurality of stages, and sequentially determining the first associated feature information based on the feature information included in the first feature maps with different resolutions and determining the second associated feature information based on the feature information included in the second feature maps with different resolutions according to the sequence of the resolution from large to small.

For example, a sample image corresponds to a plurality of first feature maps with different resolutions, and the sample image corresponds to a plurality of second feature maps with different resolutions, then the first feature map with the highest resolution and the second feature map with the highest classification rate corresponding to the sample image may be used in the first stage of knowledge distillation, and the resolutions of the first feature map and the second feature map used in the subsequent stage of knowledge distillation are gradually reduced. And in the final stage of knowledge distillation, adopting a first characteristic map with the lowest resolution corresponding to the sample image and a second characteristic map with the lowest resolution. In fig. 6A, target regions of different sample images are determined for a plurality of first feature maps at any stage. And performing dot product operation by combining the target areas of different sample images and the feature information included in the corresponding first feature map to obtain the corresponding first attention feature map. Similarly, a corresponding second attention feature map can be obtained by performing a dot product operation in combination with the target regions of different sample images and the feature information included in the corresponding first feature map. And intercepting the target area from the second attention feature map in the same way to obtain a second target feature map. First associated feature information may be determined from feature information included in the first target feature map, and in the same way, second associated feature information may be determined based on feature information included in the second target feature map. And determining a corresponding loss function based on the difference between the second associated characteristic information and the first associated characteristic information, and training the student network by adopting a knowledge distillation mode.

The obtained student network focuses more on the feature information of the target area, redundancy and noise brought in the knowledge distillation process are reduced, and a loss function is established by using local associated features, so that the student network performance obtained by training is better, and the result is more accurate when image classification is carried out.

In fig. 6A, each sample image is input to a teacher network and a student network, respectively, to obtain a first feature map output by the teacher network and a second feature map output by the student network.

At any stage of knowledge distillation, the first feature map can be input into a first neural network trained in advance, a pixel normalization value corresponding to each of a plurality of regions included in the sample image can be obtained, namely an image mask corresponding to the sample image is obtained, and the image mask and feature information included in the first feature map are subjected to dot multiplication to obtain a first attention feature map. A second attention feature map may be obtained in the same manner.

Furthermore, thresholding is carried out on the image mask, the boundary range of the target region is determined, the target region is intercepted from the first attention feature map, and a first target feature map is obtained. In the same manner, a second target feature map may be obtained.

In the embodiment of the present disclosure, the number of the sample images is multiple, and the number of the first target feature map and the second feature map is also multiple, and further, the first associated feature information and the second associated feature information may be determined in a manner shown in fig. 6B. Fig. 6B is a diagram showing that first associated feature information is determined for feature information included in a plurality of first target feature maps obtained at a certain knowledge distillation stage, and second associated feature information is determined for feature information included in a plurality of second target feature maps obtained at the certain stage.

At least one of the similarity between the feature information of different regions on the same first feature map, the similarity between the feature information of the same region on the first feature map corresponding to different sample images, and the similarity between the feature information of different regions on the first feature map corresponding to different sample images may be determined to obtain the first associated feature information. In the same manner, the second associated characteristic information may be obtained.

And after determining the corresponding loss function according to the formula 7, performing supervision training on the student network at different stages by adopting a knowledge distillation mode.

In the embodiment, the target area related to the category on the image is focused more, knowledge distillation is performed based on the local relation characteristics, so that the redundancy and noise in the knowledge distillation process are reduced, and the performance of the student network is improved.

In some optional embodiments, the present disclosure further provides an image classification method, comprising:

in step 201, an image to be detected is input into an image classification network, and a classification result of a target object included in the image to be detected output by the image classification network is obtained.

In an embodiment of the present disclosure, the image classification network is a student network obtained by training through any one of the above methods. After the image to be detected is input into the student network obtained through training, the student network outputs a classification result to which the target object included in the image to be detected belongs, wherein the classification result includes, but is not limited to, a category to which the target object belongs, or categories to which the target object may belong, and corresponding probabilities and/or scores, for example, the probability that the target object belongs to the first category is 60%, the probability that the target object belongs to the second category is 15%, the probability that the target object belongs to the third category is 15%, and the probability that the target object belongs to the fourth category is 10%, wherein the first category, the second category, the third category, the fourth category, and the fourth category may be dissimilar or partially similar.

In the above embodiment, the image classification network adopts the student network obtained by the training, and because the student network focuses more on the target area where the target object is located and the associated features of the target area in the training process, the trained student network can highlight the response value of the target area in the image to be detected in the inference process of image classification, so that the output classification result is more accurate and reliable.

Corresponding to the foregoing method embodiments, the present disclosure also provides embodiments of an apparatus.

As shown in fig. 7, fig. 7 is a block diagram of a neural network training device shown in the present disclosure according to an exemplary embodiment, the device including: the obtaining module 310 is configured to input the sample image into a teacher network and a student network respectively to obtain a first feature map output by the teacher network and a second feature map output by the student network; a first determining module 320, configured to determine first associated feature information based on feature information included in the first feature map, and determine second associated feature information based on feature information included in the second feature map; a first training module 330, configured to perform supervised training on the student network based on a difference between the second relevant feature information and the first relevant feature information.

As shown in fig. 8, fig. 8 is a block diagram of an image classification apparatus according to an exemplary embodiment, the apparatus including: the image classification module 410 is configured to input an image to be detected into an image classification network, so as to obtain a classification result of a target object included in the image to be detected output by the image classification network; the image classification network is a student network obtained by training through any one of the methods.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiment of the present disclosure further provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is used to execute any one of the above neural network training methods or the above image classification method.

In some optional embodiments, the present disclosure provides a computer program product, including computer readable code, which when run on a device, a processor in the device executes instructions for implementing a neural network training method or the image classification method as provided in any one of the above embodiments.

In some optional embodiments, the present disclosure further provides another computer program product for storing computer readable instructions, which when executed, cause a computer to perform the neural network training method or the image classification method provided in any one of the above embodiments.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke executable instructions stored in the memory to implement any of the neural network training methods or the image classification method described above.

Fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure. The electronic device 510 includes a processor 511 and may also include an input 512, an output 513, and a memory 514. The input device 512, the output device 513, the memory 514 and the processor 511 are connected to each other via a bus.

The memory includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), which is used for storing instructions and data.

The input means are for inputting data and/or signals and the output means are for outputting data and/or signals. The output means and the input means may be separate devices or may be an integral device.

The processor may include one or more processors, for example, one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU.

The memory is used to store program codes and data of the network device.

The processor is used for calling the program codes and data in the memory and executing the steps in the method embodiment. Specifically, reference may be made to the description of the method embodiment, which is not repeated herein.

It will be appreciated that fig. 9 only shows a simplified design of an electronic device. In practical applications, the electronic devices may also respectively include necessary other components, including but not limited to any number of input/output devices, processors, controllers, memories, etc., and all devices that can implement the neural network training method or the image classification method according to the embodiments of the present disclosure are within the scope of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

The above description is only exemplary of the present disclosure and should not be taken as limiting the disclosure, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A neural network training method, comprising:

respectively inputting the sample images into a teacher network and a student network to obtain a first feature diagram output by the teacher network and a second feature diagram output by the student network;

determining first associated feature information based on feature information included in the first feature map, and determining second associated feature information based on feature information included in the second feature map;

performing supervised training on the student network based on the difference of the second associated feature information relative to the first associated feature information.

2. The method of claim 1, further comprising:

determining a target area based on the first feature map;

determining a first target feature map based on the target area and the first feature map, and determining a second target feature map based on the target area and the second feature map;

the determining the first associated feature information based on the feature information included in the first feature map and the determining the second associated feature information based on the feature information included in the second feature map includes:

determining the first associated feature information based on feature information included in the first target feature map, and determining the second associated feature information based on feature information included in the second target feature map.

3. The method according to claim 2, wherein the target region of the sample image comprises a target object, and/or wherein a region of the sample image other than the target region comprises at least one of: a background portion, a portion of the target object that is occluded.

4. The method of claim 2 or 3, wherein determining a target region based on the first feature map comprises:

inputting the first feature map into a pre-trained first neural network to obtain a pixel normalization value corresponding to each of a plurality of regions included in the sample image output by the first neural network;

and taking the area of which the pixel normalization value is greater than or equal to a preset value as the target area.

5. The method of claim 4, wherein determining a first target feature map based on the target region and the first feature map, and determining a second target feature map based on the target region and the second feature map comprises:

performing point multiplication on the pixel normalization value corresponding to each of the plurality of regions and the feature information included in the first feature map to obtain a first attention feature map, and performing point multiplication on the pixel normalization value corresponding to each of the plurality of regions and the feature information included in the second feature map to obtain a second attention feature map;

and obtaining the first target feature map based on the target area on the first attention feature map, and obtaining the second target feature map based on the target area on the second attention feature map.

6. The method of claim 4 or 5, wherein the first neural network comprises a plurality of first network layers and a first output layer, and wherein the second neural network comprises a plurality of second network layers and a second output layer, wherein the network structure of the plurality of second network layers is the same as the network structure of the plurality of first network layers;

the method further comprises the following steps:

taking the first feature map as the input of the second neural network to obtain a classification result output by the second neural network;

performing supervised training on the second neural network based on the difference of the classification result output by the second neural network relative to the classification result labeled in the sample image;

and taking the network parameters of the plurality of second network layers included in the trained second neural network as the network parameters of the plurality of first network layers included in the first neural network.

7. The method according to any one of claims 1-6, wherein the determining first associated feature information based on the feature information included in the first feature map comprises:

determining the first associated feature information based on the similarity between feature information of different areas on the first feature map;

the determining second connection feature information based on the feature information included in the second feature map includes:

8. The method according to any one of claims 1 to 7, wherein the number of the sample images is plural, and the number of the first feature map and the second feature map is plural;

the determining of the first associated feature information based on the feature information included in the first feature map includes:

the determining second associated feature information based on the feature information included in the second feature map includes:

9. The method according to any one of claims 1 to 8, wherein the number of the sample images is plural, and the number of the first feature map and the second feature map is plural;

10. The method according to any one of claims 1 to 9, wherein the number of the first feature maps and the second feature maps is plural, and a plurality of the first feature maps correspond to different resolutions and a plurality of the second feature maps correspond to different resolutions;

11. The method according to any one of claims 1-10, wherein said supervised training of said student network comprises:

and carrying out supervision training on the student network by adopting a knowledge distillation mode.

12. The method of any one of claims 1-11, wherein the sample image comprises a physical image with labeled classification results.

13. A method of image classification, the method comprising:

inputting an image to be detected into an image classification network to obtain a classification result of a target object included in the image to be detected and output by the image classification network;

the image classification network is a student network trained by the method of any one of claims 1-11.

14. A neural network training device, comprising:

the acquisition module is used for respectively inputting the sample images into a teacher network and a student network to obtain a first characteristic diagram output by the teacher network and a second characteristic diagram output by the student network;

the first determining module is used for determining first associated characteristic information based on the characteristic information included in the first characteristic diagram and determining second associated characteristic information based on the characteristic information included in the second characteristic diagram;

and the first training module is used for carrying out supervised training on the student network based on the difference of the second associated characteristic information relative to the first associated characteristic information.

15. An image classification apparatus, characterized in that the apparatus comprises:

the image classification module is used for inputting the image to be detected into an image classification network to obtain a classification result of a target object included in the image to be detected and output by the image classification network;

16. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the neural network training method of any one of the above claims 1 to 12 or the image classification method of claim 13.

17. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to invoke executable instructions stored in the memory to implement the neural network training method of any one of claims 1-12 or the image classification method of claim 13.