CN114239731A - Training method of classification network, image classification method and device - Google Patents

Training method of classification network, image classification method and device Download PDF

Info

Publication number
CN114239731A
CN114239731A CN202111565700.1A CN202111565700A CN114239731A CN 114239731 A CN114239731 A CN 114239731A CN 202111565700 A CN202111565700 A CN 202111565700A CN 114239731 A CN114239731 A CN 114239731A
Authority
CN
China
Prior art keywords
feature
feature map
sample image
layer
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111565700.1A
Other languages
Chinese (zh)
Inventor
李阳光
邵婧
闫俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sensetime Technology Development Co Ltd
Original Assignee
Beijing Sensetime Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sensetime Technology Development Co Ltd filed Critical Beijing Sensetime Technology Development Co Ltd
Priority to CN202111565700.1A priority Critical patent/CN114239731A/en
Publication of CN114239731A publication Critical patent/CN114239731A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure provides a training method of a classification network, an image classification method and an image classification device, which comprise the following steps: acquiring a sample image and annotation information of the sample image; inputting the sample image to a binaryzation initial feature extraction layer of a classification network to be trained, and determining a target feature map corresponding to the sample image; the initial feature extraction layer comprises channel value adjustment parameters to be trained; dividing the target feature map into a plurality of feature units based on a preset size, and inputting the target feature map into a feature fusion layer, wherein the feature fusion layer comprises a plurality of multi-layer perception modules for binarization, and each multi-layer perception module is used for performing feature fusion and deep-layer feature extraction on the feature units; and training the classification network based on the semantic feature map output by the feature fusion layer and the labeling information of the sample image.

Description

Training method of classification network, image classification method and device
Technical Field
The disclosure relates to the technical field of neural networks, in particular to a training method of a classification network, an image classification method and an image classification device.
Background
With the method and application of the neural network technology, people have higher and higher requirements on the network computing speed and computing precision of the neural network, and therefore, the network scale of the neural network is larger and larger. When a large-scale neural network is used for operation, large-scale computing resources are required, and therefore, the hardware requirement for deploying equipment is high. Therefore, if the network precision is guaranteed, compressing the neural network becomes an urgent problem to be solved.
Disclosure of Invention
The embodiment of the disclosure at least provides a training method of a classification network, an image classification method and an image classification device.
In a first aspect, an embodiment of the present disclosure provides a method for training a classification network, including:
acquiring a sample image and annotation information of the sample image;
inputting the sample image to a binaryzation initial feature extraction layer of a classification network to be trained, and determining a target feature map corresponding to the sample image; the initial feature extraction layer comprises channel value adjustment parameters to be trained;
dividing the target feature map into a plurality of feature units based on a preset size, and inputting the target feature map into a feature fusion layer, wherein the feature fusion layer comprises a plurality of multi-layer perception modules for binarization, and each multi-layer perception module is used for performing feature fusion and deep-layer feature extraction on the feature units;
and training the classification network based on the semantic feature map output by the feature fusion layer and the labeling information of the sample image.
In the method, the initial feature extraction layer of the classification network to be trained comprises a channel value adjustment parameter to be trained, and the channel value adjustment parameter can adjust the channel value of the first feature map of the sample image extracted by the initial feature extraction layer, so that the first feature map after the channel value adjustment is subjected to binarization processing, which can be understood as dynamically determining a binarization threshold value, thereby reducing the precision difference between binarization feature extraction and full-precision feature extraction; furthermore, the feature fusion layer can perform feature fusion and deep feature extraction on a plurality of feature units, so that the output semantic feature map gives consideration to local features and global features, the classification network precision trained based on the semantic feature map and the labeling information is higher, and the initial feature extraction layer and the feature fusion layer are both binarized, so that the network scale of the classification network is smaller, i.e. the network scale and the network precision can be both considered.
In a possible implementation manner, the inputting the sample image to a binarized initial feature extraction layer of a classification network to be trained, and determining a target feature map corresponding to the sample image includes:
carrying out global average pooling on the sample image, and determining a first feature map corresponding to the sample image;
performing first adjustment on the channel value of the first feature map based on the channel value adjustment parameter, and determining an adjusted second feature map;
performing binary activation processing on the sample image and the second feature map to determine a third feature map;
and performing feature extraction on the third feature map, and determining a target feature map corresponding to the sample image.
Here, since the channel value adjustment parameter is dynamic (i.e., trainable), the embedded expression Embedding of the adjusted sample image is also dynamic, that is, the sample image is characterized by the dynamic Embedding, and thus, the precision difference between the binary feature extraction and the full-precision feature extraction can be reduced.
In a possible implementation, the determining a third feature map by performing binary activation processing based on the sample image and the second feature map includes:
and determining the third feature map based on a first threshold value of a preset activation function and a difference value of corresponding channel values of the sample image and the second feature map.
In a possible implementation manner, the performing feature extraction on the third feature map and determining a target feature map corresponding to the sample image includes:
performing feature extraction on the third feature map, and determining a fourth feature map;
performing second adjustment on the feature value of the first feature map based on the channel value adjustment parameter, and determining an adjusted fifth feature map;
and performing feature fusion on the fourth feature map and the fifth feature map to determine the target feature map.
Here, in order to avoid the influence of binarization on the accuracy of the extracted features, the features of the original image may be added to the fourth feature map, that is, the fifth feature map and the fourth feature map after the second adjustment is performed on the first feature map may be fused.
In a possible embodiment, the output of the nth multi-layer sensing module in the feature fusion layer is the input of the (N + 1) th multi-layer sensing module, the input of the first multi-layer sensing module is the target feature map, the output of the last multi-layer sensing module is the semantic feature map, and N is a positive integer.
Deep feature extraction is carried out through the multilayer perception module, the depth of the finally extracted semantic features can be guaranteed to be deep enough, and then classification accuracy is improved.
In a possible implementation manner, for any multi-layer perception module, the multi-layer perception module is configured to perform feature fusion and deep-layer feature extraction on feature units of an input feature map input into the multi-layer perception module by the following methods:
performing binary activation processing on the input feature map, and determining a sixth feature map;
performing feature exchange on the feature units of the sixth feature map according to at least one exchange distance to obtain an exchange feature map;
respectively extracting the features of the exchange feature map and the sixth feature map, and then performing feature fusion with the input feature map to obtain a fusion feature map;
and activating the fusion characteristic diagram to obtain an output characteristic diagram of the multilayer perception module.
The global features can be fused in the current feature unit through long-distance exchange, and the local features can be fused in the current feature unit through short-distance exchange, so that the feature graph obtained through the implementation mode can be combined with the local features and the global features, and feature extraction is carried out based on convolution, and only the local features can be combined, so that the features fused by the method are more comprehensive.
In a possible implementation manner, for any exchange distance, the performing feature exchange on the feature units of the sixth feature map to obtain an exchange feature map includes:
for any feature unit, determining a feature unit to be exchanged corresponding to the feature unit in the sixth feature map based on the interaction distance;
and determining the value of the characteristic unit on each channel after the characteristic exchange is carried out on the basis of the value of the characteristic unit to be exchanged on the corresponding channel.
In one possible embodiment, after the feature extraction is performed on the exchange feature map and the sixth feature map respectively, the feature fusion is performed on the exchange feature map and the sixth feature map, so as to obtain a fused feature map, where the method includes:
respectively extracting features of the exchange feature map and the sixth feature map based on a binarization multi-layer perceptron, and determining a plurality of deep feature maps;
and after normalization processing is carried out on the plurality of deep layer feature maps, feature fusion is carried out on the deep layer feature maps and the input feature map, so that the fusion feature map is obtained.
In a possible embodiment, the training the classification network based on the semantic feature map output by the feature fusion layer and the annotation information of the sample image includes:
acquiring a full-precision teacher network corresponding to the classification network to be trained;
and training the classification network based on the semantic feature map output by the feature fusion layer, the labeling information of the sample image and the full-precision teacher network.
The acquired full-precision teacher network may be trained, and the inference target of the full-precision teacher network may be the same as the inference target of the classification network. Because the network parameters of the full-precision teacher network are full-precision, the network precision of the full-precision teacher network is higher than that of the classification network, and the network precision of the classification network can be improved by distilling and training the classification network through the full-precision teacher network.
In a possible embodiment, the training the classification network based on the semantic feature map output by the feature fusion layer, the annotation information of the sample image, and the full-precision teacher network includes:
determining a first loss value based on the semantic feature map and a first prediction result of the full-precision teacher network on the sample image; determining a second loss value based on the semantic feature map and the annotation information of the sample image;
training the classification network based on the first loss value and the second loss value.
In one possible embodiment, the determining a first loss value based on the semantic feature map and a first prediction result of the full-precision teacher network on the sample image includes:
determining a second prediction result of the classification network based on a distillation head module and the semantic feature map; wherein the number of parameter bits of the distillation head module is the same as the number of parameter bits of the full-precision teacher network;
determining the first loss value based on the first prediction and the second prediction.
The purpose of keeping the parameter digit of the distillation head module the same as the parameter digit of the full-precision teacher network is to receive the high-precision characteristics of the full-precision teacher network through the distillation head module, and on the other hand, the distillation training is performed on the premise that the parameter digit of the head module is the same as the parameter digit of the teacher network.
In one possible embodiment, the determining a second loss value based on the semantic feature map and the annotation information of the sample image includes:
determining a third prediction result of the classification network based on a supervision header module and the semantic feature map;
determining the second loss value based on the third prediction result and annotation information of the sample image.
In one possible embodiment, after the training of the classification network is completed, the method further includes:
and constructing an inference head module based on the trained distillation head module and the trained supervision head module, wherein the inference head module is used for determining an inference result based on the output of the trained feature fusion layer of the classification network during network inference.
The inference head module can be directly constructed by the method without extra and excessive calculation, and the constructed inference head module has higher precision.
In a second aspect, an embodiment of the present disclosure provides an image classification method, including:
acquiring an image to be classified;
and inputting the image to be classified into a classification network obtained by training based on the training method of the classification network described in the first aspect or any one of the possible embodiments of the first aspect, and determining a classification result of the image to be classified.
In a third aspect, an embodiment of the present disclosure provides a training apparatus for a classification network, including:
the first acquisition unit is used for acquiring the sample image and the labeling information of the sample image;
the characteristic extraction unit is used for inputting the sample image to a binaryzation initial characteristic extraction layer of a classification network to be trained and determining a target characteristic diagram corresponding to the sample image; the initial feature extraction layer comprises channel value adjustment parameters to be trained;
the characteristic fusion unit is used for dividing the target characteristic diagram into a plurality of characteristic units based on a preset size and inputting the target characteristic diagram into a characteristic fusion layer, the characteristic fusion layer comprises a plurality of multi-layer perception modules for binaryzation, and each multi-layer perception module is used for carrying out characteristic fusion and deep-layer characteristic extraction on the characteristic units;
and the training unit is used for training the classification network based on the semantic feature map output by the feature fusion layer and the labeling information of the sample image.
In one possible implementation manner, when the sample image is input to a binarized initial feature extraction layer of a classification network to be trained, and a target feature map corresponding to the sample image is determined, the feature extraction unit is configured to:
carrying out global average pooling on the sample image, and determining a first feature map corresponding to the sample image;
performing first adjustment on the channel value of the first feature map based on the channel value adjustment parameter, and determining an adjusted second feature map;
performing binary activation processing on the sample image and the second feature map to determine a third feature map;
and performing feature extraction on the third feature map, and determining a target feature map corresponding to the sample image.
In one possible implementation, the feature extraction unit, when performing binary activation processing based on the sample image and the second feature map to determine a third feature map, is configured to:
and determining the third feature map based on a first threshold value of a preset activation function and a difference value of corresponding channel values of the sample image and the second feature map.
In one possible embodiment, when performing feature extraction on the third feature map and determining a target feature map corresponding to the sample image, the feature extraction unit is configured to:
performing feature extraction on the third feature map, and determining a fourth feature map;
performing second adjustment on the feature value of the first feature map based on the channel value adjustment parameter, and determining an adjusted fifth feature map;
and performing feature fusion on the fourth feature map and the fifth feature map to determine the target feature map.
In a possible embodiment, the output of the nth multi-layer sensing module in the feature fusion layer is the input of the (N + 1) th multi-layer sensing module, the input of the first multi-layer sensing module is the target feature map, the output of the last multi-layer sensing module is the semantic feature map, and N is a positive integer.
In a possible implementation manner, for any multi-layer perception module, the multi-layer perception module is configured to perform feature fusion and deep-layer feature extraction on feature units of an input feature map input into the multi-layer perception module by the following methods:
performing binary activation processing on the input feature map, and determining a sixth feature map;
performing feature exchange on the feature units of the sixth feature map according to at least one exchange distance to obtain an exchange feature map;
respectively extracting the features of the exchange feature map and the sixth feature map, and then performing feature fusion with the input feature map to obtain a fusion feature map;
and activating the fusion characteristic diagram to obtain an output characteristic diagram of the multilayer perception module.
In a possible implementation manner, for any exchange distance, the feature fusion unit, when performing feature exchange on the feature cells of the sixth feature map to obtain an exchange feature map, is configured to:
for any feature unit, determining a feature unit to be exchanged corresponding to the feature unit in the sixth feature map based on the interaction distance;
and determining the value of the characteristic unit on each channel after the characteristic exchange is carried out on the basis of the value of the characteristic unit to be exchanged on the corresponding channel.
In one possible embodiment, the feature fusion unit, after performing feature extraction on the exchange feature map and the sixth feature map respectively, performs feature fusion with the input feature map to obtain a fusion feature map, is configured to:
respectively extracting features of the exchange feature map and the sixth feature map based on a binarization multi-layer perceptron, and determining a plurality of deep feature maps;
and after normalization processing is carried out on the plurality of deep layer feature maps, feature fusion is carried out on the deep layer feature maps and the input feature map, so that the fusion feature map is obtained.
In one possible embodiment, the training unit, when training the classification network based on the semantic feature map output by the feature fusion layer and the annotation information of the sample image, is configured to:
acquiring a full-precision teacher network corresponding to the classification network to be trained;
and training the classification network based on the semantic feature map output by the feature fusion layer, the labeling information of the sample image and the full-precision teacher network.
In one possible embodiment, when the classification network is trained based on the semantic feature map output by the feature fusion layer, the annotation information of the sample image, and the full-precision teacher network, the training unit is configured to:
determining a first loss value based on the semantic feature map and a first prediction result of the full-precision teacher network on the sample image; determining a second loss value based on the semantic feature map and the annotation information of the sample image;
training the classification network based on the first loss value and the second loss value.
In a possible embodiment, the training unit, when determining a first loss value based on the semantic feature map and a first prediction result of the full-precision teacher network on the sample image, is configured to:
determining a second prediction result of the classification network based on a distillation head module and the semantic feature map; wherein the number of parameter bits of the distillation head module is the same as the number of parameter bits of the full-precision teacher network;
determining the first loss value based on the first prediction and the second prediction.
In a possible embodiment, the training unit, when determining the second loss value based on the semantic feature map and the annotation information of the sample image, is configured to:
determining a third prediction result of the classification network based on a supervision header module and the semantic feature map;
determining the second loss value based on the third prediction result and annotation information of the sample image.
In a possible implementation, after the training of the classification network is completed, the apparatus further includes a construction unit configured to:
and constructing an inference head module based on the trained distillation head module and the trained supervision head module, wherein the inference head module is used for determining an inference result based on the output of the trained feature fusion layer of the classification network during network inference.
In a fourth aspect, an embodiment of the present disclosure provides an image classification apparatus, including:
the second acquisition unit is used for acquiring the image to be classified;
and the classification unit is used for inputting the image to be classified into a classification network obtained by training based on the training method of the classification network described in the first aspect or any possible implementation manner of the first aspect, and determining a classification result of the image to be classified.
In a fifth aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any one of the possible implementations of the first aspect, or the second aspect described above.
In a sixth aspect, this disclosed embodiment also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program, when executed by a processor, performs the steps in the first aspect, or any one of the possible implementations of the first aspect, or performs the steps in the second aspect.
For the description of the effects of the training apparatus, the computer device, and the computer-readable storage medium of the classification network, reference is made to the description of the training method of the classification network, which is not repeated herein.
In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.
Fig. 1 is a schematic diagram illustrating a method for performing binarization compression on a convolutional neural network according to an embodiment of the present disclosure;
FIG. 2 is a flow chart illustrating a training method of a classification network according to an embodiment of the present disclosure;
FIG. 3 is a flow chart illustrating a method for determining a target feature map corresponding to a sample image according to an embodiment of the disclosure;
FIG. 4 is a schematic diagram illustrating an internal structure of an initial feature extraction layer provided by an embodiment of the present disclosure;
FIG. 5 is a schematic diagram illustrating a structure of a feature fusion layer provided by an embodiment of the present disclosure;
FIG. 6 is a flow chart illustrating a method for implementing a multi-layer aware module provided by an embodiment of the present disclosure;
fig. 7a is a schematic diagram illustrating an implementation of a short-range switching operation provided by an embodiment of the present disclosure;
fig. 7b is a schematic diagram illustrating an implementation of a long-haul switching operation provided by an embodiment of the present disclosure;
FIG. 8 is a schematic diagram illustrating an internal structure of a multi-layered sensing module provided by an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram illustrating a training method of a classification network according to an embodiment of the present disclosure;
FIG. 10 is a flow chart illustrating an image classification method provided by an embodiment of the present disclosure;
FIG. 11 is a schematic diagram illustrating an architecture of a training apparatus for a classification network according to an embodiment of the present disclosure;
fig. 12 is a schematic diagram illustrating an architecture of an image classification apparatus provided in an embodiment of the present disclosure;
fig. 13 shows a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.
In the related art, when a classification network is compressed, binarization compression is generally performed on a full-precision Convolutional Neural Network (CNN), and a network result is as shown in fig. 1. However, in this method, on one hand, since the binarization threshold is fixed, precision is seriously damaged when the feature map is subjected to binarization processing; on the other hand, the feature extraction of deep features based on the convolution module can only ensure the feature depth and neglect the correlation between the features, so that the compression method can result in lower model accuracy.
Based on the research, the present disclosure provides a training method for a classification network, where an initial feature extraction layer of the classification network to be trained includes a channel value adjustment parameter to be trained, and the channel value adjustment parameter can adjust a channel value of a first feature map of a sample image extracted by the initial feature extraction layer, so that binarization processing is performed on the first feature map after channel value adjustment, which can be understood as dynamically determining a binarization threshold, thereby reducing a precision difference between binarization feature extraction and full-precision feature extraction; furthermore, the feature fusion layer can perform feature fusion and deep feature extraction on a plurality of feature units, so that the output semantic feature map gives consideration to local features and global features, the classification network precision trained based on the semantic feature map and the labeling information is higher, and the initial feature extraction layer and the feature fusion layer are both binarized, so that the network scale of the classification network is smaller, i.e. the network scale and the network precision can be both considered.
The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
To facilitate understanding of the present embodiment, first, a detailed description is given to a training method for a classification network disclosed in an embodiment of the present disclosure, where an execution subject of the training method for a classification network provided in an embodiment of the present disclosure is generally a computer device with certain computing power, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the training method of the classification network may be implemented by a processor calling computer-readable instructions stored in a memory.
Referring to fig. 2, a flowchart of a training method for a classification network provided in an embodiment of the present disclosure is shown, where the method includes steps 201 to 204, where:
step 201, obtaining a sample image and annotation information of the sample image.
Step 202, inputting the sample image to a binarization initial feature extraction layer of a classification network to be trained, and determining a target feature map corresponding to the sample image; the initial feature extraction layer comprises channel value adjustment parameters to be trained.
Step 203, dividing the target feature map into a plurality of feature units based on a preset size, and inputting the target feature map into a feature fusion layer, where the feature fusion layer includes a plurality of multi-layer perception modules for binarization, and each multi-layer perception module is used for performing feature fusion and deep-layer feature extraction on the feature units.
And step 204, training the classification network based on the semantic feature map output by the feature fusion layer and the labeling information of the sample image.
The following is a detailed description of the above-described method.
For step 201,
In a possible implementation manner, the sample image may be an image to be classified, and the labeling information of the sample image may refer to a classification result of the sample image labeled in advance.
With respect to step 202,
In a possible implementation manner, when the sample image is input to a binarized initial feature extraction layer of a classification network to be trained, and a target feature map corresponding to the sample image is determined, specifically, the method may include the following steps as shown in fig. 3:
step 301, performing global average pooling on the sample image, and determining a first feature map corresponding to the sample image.
Step 302, performing a first adjustment on the channel value of the first feature map based on the channel value adjustment parameter, and determining an adjusted second feature map.
Step 303, performing binary activation processing based on the sample image and the second feature map, and determining a third feature map.
And step 304, performing feature extraction on the third feature map, and determining a target feature map corresponding to the sample image.
In step 301, after performing Global Average Pooling (GAP) processing on the sample image, a preliminary embedded representation corresponding to the sample image may be obtained, that is, the first feature map is a preliminary Embedding of the sample image.
Then, in step 302, the channel value of the first feature map is adjusted by the channel value adjustment parameter, which may be understood as adjusting the preliminary Embedding.
Specifically, the channel value adjustment parameter may include a first adjustment parameter, a second adjustment parameter, and a third adjustment parameter, and when the channel value of the first feature map is first adjusted based on the channel value adjustment parameter, the channel value of the first feature map may be adjusted by the first adjustment parameter and the second adjustment parameter, and exemplarily, a value on each channel of the first feature map may be linearly operated with the first adjustment parameter and the second adjustment parameter. The third adjustment parameter is used to perform a second adjustment on the channel value of the first feature map, which will be described below in detail.
Here, since the channel value adjustment parameter is dynamic (i.e., trainable), the adjusted Embedding is also dynamic, i.e., the sample image is characterized by dynamic Embedding.
In step 303, the performing binary activation processing based on the sample image and the second feature map to determine a third feature map may be determining the third feature map based on a first threshold of a preset activation function and a difference between corresponding channel values of the sample image and the second feature map.
Specifically, a difference between channel values corresponding to the sample image and the second feature map may be calculated, and if the channel value of the sample image is greater than the channel value of the second image, the channel value is set to +1, and if the channel value of the sample image is less than or equal to the channel value of the second image, the channel value is set to-1.
Here, although the first threshold value of the activation function is fixed, since the channel value of the first feature map is adjusted in step 302 before the binarization processing based on the activation function, it is also substantially equivalent to dynamically adjusting the first threshold value of the activation function.
In step 304, when feature extraction is performed on the third feature map and a target feature map corresponding to the sample image is determined, feature extraction may be performed on the third feature map first to determine a fourth feature map; secondly, performing second adjustment on the characteristic value of the first characteristic diagram based on the channel value adjustment parameter, and determining an adjusted fifth characteristic diagram; and then carrying out feature fusion on the fourth feature map and the fifth feature map to determine the target feature map.
Specifically, when the feature value of the first feature map is adjusted based on the channel value adjustment parameter, the feature value of the first feature map may be adjusted based on the first adjustment parameter and the third adjustment parameter, and a specific adjustment method may be similar to the first adjustment method.
For the feature fusion of the fourth feature map and the fifth feature map, values on channels corresponding to the fourth feature map and the fifth feature map may be summed to determine the target feature map.
For example, the internal structure of the initial feature extraction layer may be as shown in fig. 4, where W is1Denotes a first tuning parameter, W2Denotes a second adjustment parameter, W3Indicating a third adjustment parameter.
After the sample image is input, a part of the sample image is processed by GAP to obtain a first feature map, and then a second feature map is obtained after the sample image is adjusted by a first adjusting parameter and a second adjusting parameter. The first feature map may be adjusted based on the first adjustment parameter by the following formula:
α(X)=GAP(X)W1+biasα (1)
wherein, X represents each feature point (i.e. pixel point) bias of the second sample imageαThe bias parameter (hyper-parameter) is represented, α (X) represents the adjusted feature map, and gap (X) represents the first feature map.
For example, the adjustment may be performed based on the second adjustment parameter by the following formula:
β(X)=α(X)W2+biasβ (2)
among them, biasβThe bias parameter (hyper parameter) is shown, and β (X) is a second feature map.
Then, a binary activation process is performed based on the difference between the channel values of the sample image and the second feature map and the first threshold value, so as to determine a third feature map, which may be calculated by the following formula:
Figure BDA0003421923500000121
where X denotes a channel value in the sample image, β (X) denotes a first threshold value, Qb(x) A third profile is shown and sign represents the activation function.
Through the processing, the sample image can be converted into a binary embedding representation, and then the features of the third feature map can be extracted through binary convolution to obtain a fourth feature map.
Further, in order to avoid the influence of binarization on the accuracy of the extracted features, the features of the original image may be added to the fourth feature map, that is, the fifth feature map and the fourth feature map after the second adjustment is performed on the first feature map may be fused.
For example, when performing the second adjustment on the first feature map, the following formula may be used:
γ(X)=α(X)W3+biasγ (4)
among them, biasγThe bias parameter (hyper parameter) is shown, and γ (X) is a fifth characteristic diagram.
When the fifth feature map and the fourth feature map are fused, the fusion may be performed by the following formula:
x2'=x1'+γ(X) (5)
x1' denotes a fourth feature map, x, after feature extraction2' represents the fused target feature map.
In the method, the channel value of the first feature map is adjusted, and then the first feature map after the channel value is adjusted is subjected to binarization processing, which can be essentially understood as dynamically determining a binarization threshold value, so that the precision difference between the binarization feature extraction and the full-precision feature extraction is reduced.
Here, although the third feature map is a binary feature map, and the fourth feature map obtained by performing feature extraction (binary convolution) on the third feature map is also a binary feature map, since the fifth feature map is not a binary feature map, and the target feature map obtained by fusing the fourth feature map and the fifth feature map is also not a binary feature map, it is still necessary to perform binarization operation by using an activation function in subsequent processing.
For step 203,
Here, when the target feature map is divided into a plurality of feature units, the feature units may be divided according to a preset size, each feature unit may be regarded as a symbol token, and each token may be understood as a matrix of c × n × m, where c denotes the number of channels, and n and m denote the preset size.
An exemplary structure of the feature fusion layer may be as shown in fig. 5, and includes a plurality of multilayer sensing modules, where an output of an nth multilayer sensing module is an input of an N +1 th multilayer sensing module, an input of a first plurality of sensing modules is the target feature map, an output of a last multilayer sensing module is the semantic feature map, and N is a positive integer.
For any multi-layer perception module, in one possible implementation, feature fusion and deep-level feature extraction may be performed by the method shown in fig. 6, which includes the following steps:
step 601, performing binary activation processing on the input feature map, and determining a sixth feature map.
And step 602, performing feature exchange on the feature units of the sixth feature map according to at least one exchange distance to obtain an exchange feature map.
And 603, respectively extracting the features of the exchange feature map and the sixth feature map, and then performing feature fusion with the input feature map to obtain a fusion feature map.
And step 604, activating the fusion feature map to obtain an output feature map of the multilayer perception module.
In step 601, the binary activation processing is performed on the input feature map, which can be understood as performing binary processing on the input feature map based on an activation function, and since the input of the first multi-layer sensing module is a target feature map, and the target feature map is not a binary feature map, the network parameters can be further compressed through the activation processing; the input of the other multi-layer sensing module is the output of the previous multi-layer sensing module, the output of the previous multi-layer sensing module is processed through the steps, and the processed feature map is not necessarily a binary feature map, so that binary activation processing is also required.
For example, when the binary activation process is performed, the calculation may be performed by the following formula:
Figure BDA0003421923500000131
where α denotes a second threshold value of the activation function, sign denotes the activation function, x denotes a characteristic value of the input characteristic diagram, and Qb' (x) denotes a sixth characteristic diagram.
In step 602, the exchange distance may be a preset distance, and for example, may refer to a distance adjacent to a feature unit, or a half of the feature size.
For any interaction distance, performing feature exchange on the feature units of the sixth feature map to obtain an exchange feature map, which may be that for any feature unit, based on the interaction distance, a feature unit to be exchanged corresponding to the feature unit in the sixth feature map is determined; and determining the value of the characteristic unit on each channel after the characteristic exchange is carried out on the basis of the value of the characteristic unit to be exchanged on the corresponding channel.
In a possible implementation, the switching operation can be divided into a long-distance switching operation and a short-distance switching operation based on the difference of the switching distances, and the long-distance switching operation can refer to a distance from the current feature unit to the current feature unit
Figure BDA0003421923500000141
Or
Figure BDA0003421923500000142
To be treatedThe feature exchange of the exchange unit, the short-distance exchange operation may refer to the feature exchange of a unit to be exchanged adjacent to the current feature unit.
For example, as shown in fig. 7a, if the current feature cell is feature cell a (i.e., the feature cell pointed to by the arrow), the feature cell adjacent to the current feature cell a is feature cell B, C, D, E; then, for the feature unit a, when performing short-distance switching operation, the value of the feature unit B, C, D, E on the corresponding channel may be determined, and the value of the feature unit a on each channel after performing feature switching is determined.
Specifically, when determining the value of the feature unit after feature exchange based on the value of the feature unit to be exchanged on the corresponding channel, the corresponding channels of the feature unit to be exchanged at different positions may be different, for example, if the feature unit has c channels, the values on the 0 th channel to the c/4 th channel of the feature unit to be exchanged on the left side of the current feature unit may be taken as the values on the 0 th channel to the c/4 th channel of the feature unit a after feature exchange; taking values on the c/4 th channel to the c/2 th channel of the feature unit to be exchanged on the right side of the current feature unit as values on the c/4 th channel to the c/2 th channel of the feature unit a after feature exchange; taking values on the c/2 th channel to the 3c/4 th channel of the feature unit to be exchanged on the upper side of the current feature unit as values on the c/2 th channel to the 3c/4 th channel of the feature unit a after feature exchange; and taking values on 3c/4 th to c th channels of the feature unit to be exchanged on the lower side of the current feature unit as values on the 3c/4 th to c th channels of the feature unit a after feature exchange.
For example, as shown in fig. 7b, if the current feature cell is feature cell a (i.e., the feature cell pointed by the arrow), and the size of the feature map is 7 × 7 during the long-distance swapping operation, the distance feature cell a is
Figure BDA0003421923500000143
Or
Figure BDA0003421923500000144
The unit to be exchanged is the feature unit B, C, D, E, and the remaining exchange process is similar to the above exchange process and will not be described again.
In a special case, for the feature unit at the edge, if the feature unit to be exchanged, which is a switching distance away from the feature unit, exceeds the boundary, the feature graph may be added again at the edge position as shown in fig. 7b, and the feature unit to be exchanged is determined according to the switching distance.
In a specific implementation, when determining the position coordinates of each feature unit to be exchanged, the following formula may be exemplarily used:
S(r1,r2)={y:y=(x1+r1,x2+r2)} (7)
wherein, S (r)1,r2) Set of coordinates, x, representing the characteristic unit to be exchanged1、x2Position coordinates representing the current feature cell, r1 r2Representing the horizontal and vertical exchange distances, the coordinates of the characteristic units to be exchanged should be (x) respectively1,x2+r2)、(x1+r1,x2)、(x1-r1,x2)、(x1,x2-r2)。
Illustratively, when performing a short-range switching operation, the calculation can be performed by the following formula:
Figure BDA0003421923500000145
wherein the content of the first and second substances,
Figure BDA0003421923500000146
indicating the current characteristic unit after a short-range exchange, Ab[0:c/4]S(-1,0)Representing the values of the 0 th channel to the c/4 th channel of the feature unit with the coordinate difference (-1,0) with the current feature unit, Ab[c/4:c/2]S(1,0)C/4 th pass representing a feature cell having a coordinate difference of (1,0) from the current feature cellValue of channel-c/2 channel, Ab[c/2:3c/4]S(0,-1)Representing the values of the c/2 th channel to the 3c/4 th channel of the feature unit with the coordinate difference of (0, -1) with the current feature unit, Ab[3c/4:c]S(0,1)And representing the values of 3c/4 th channel to c-th channel of the feature unit with the coordinate difference of (0,1) with the current feature unit, Cat representing concat operation, and c representing the number of channels.
For example, when long-distance switching operation is performed, the calculation can be performed by the following formula:
Figure BDA0003421923500000151
wherein h represents the length of the feature map, w represents the width of the feature map,
Figure BDA0003421923500000152
the current feature cell after long distance exchange is shown, and the rest of the explanation is similar to that in equation (8), and will not be described again.
By the method, the current feature unit can be fused with the global feature through long-distance exchange, and the current feature unit can be fused with the local feature through short-distance exchange, so that the feature graph obtained by the method can be combined with the local feature and the global feature, and feature extraction is carried out based on convolution, and only the local feature can be combined, so that the features fused by the method are more comprehensive.
In step 603, the number of the exchange feature maps may be multiple, the exchange feature map and the sixth feature map are respectively subjected to feature extraction, and then feature fusion is performed with the input feature map to obtain a fusion feature map, and the exchange feature map and the sixth feature map may be respectively subjected to feature extraction based on a binary multi-layer perceptron to determine multiple deep feature maps, and then the multiple deep feature maps are subjected to normalization processing, and then feature fusion is performed with the input feature map to obtain the fusion feature map.
Here, the feature fusion with the input feature map is intended to prevent the occurrence of a phenomenon of overfitting or a gradient problem in the feature after the feature exchange and feature extraction are previously performed.
Or, in another possible implementation, the values of the to-be-exchanged feature unit in each channel may be averaged, and the average value may be used as the value of the current feature unit channel.
The activation process in step 604 may be different from the binary activation process in steps 303 and 601, and may be, for example, an RPRelu process or the like. The thresholds in the binary activation processing in step 303 and step 601 may be different, that is, the first threshold and the second threshold may be different.
For example, as shown in fig. 8, the internal structure of the multi-layered sensing module may be, where the Binary MLP is a binarized multi-layered sensing machine for deep feature extraction, and the RPReLU is an activation function.
The normalizing process performed on the plurality of deep layer feature maps may be performed after the plurality of deep layer feature maps are fused, and the step of fusing the plurality of deep layer feature maps may be calculated according to the following formula:
Figure BDA0003421923500000153
wherein A isi' denotes the feature map after the fusion,
Figure BDA0003421923500000161
a sixth characteristic diagram is shown in which,
Figure BDA0003421923500000162
a deep characteristic diagram after long-distance exchange operation is shown,
Figure BDA0003421923500000163
represents a deep profile after a short-range exchange,
Figure BDA0003421923500000164
the feature map after feature extraction corresponding to the sixth feature map (i.e. the output of the third MLP in fig. 8) is shown,
Figure BDA0003421923500000165
the feature map after feature extraction (i.e. the output of the first MLP in fig. 8) corresponding to the deep feature map after the long-distance swap operation is shown,
Figure BDA0003421923500000166
the feature map (i.e., the output of the second MLP in fig. 8) after feature extraction corresponding to the deep feature map after short-distance exchange is shown.
After the normalization processing is performed on the plurality of deep layer feature maps, feature fusion is performed on the plurality of deep layer feature maps and the input feature map to obtain a fusion feature map, which can be exemplarily calculated by the following formula:
Ai=RPReLU(BN(Ai')+Ai-1) (11)
wherein A isiRepresenting a fusion feature map, RPReLU representing an activation function, BN representing a normalization process, Ai-1An input feature map is represented.
Here, the fused feature map fuses the local feature and the global feature with a deeper accuracy, and therefore, when the classification result is determined based on the fused feature map, the accuracy is higher.
With respect to step 104,
In one possible implementation, when the classification network is trained based on the semantic feature map output by the feature fusion layer and the label information of the sample image, the classification result of the classification network may be determined based on the semantic feature map and a supervised header module (i.e. supervised head), then a loss value (which may be cross entropy loss, for example) may be determined based on the classification result and the label information, and the classification network may be trained based on the loss value.
In another possible embodiment, in order to improve the network accuracy of the classification network, the classification network may be distilled and trained.
Specifically, when the classification network is trained based on the semantic feature map output by the feature fusion layer and the labeling information of the sample image, the method may include the following steps:
and step A, acquiring a full-precision teacher network corresponding to the classification network to be trained.
The acquired full-precision teacher network can be trained, and the inference target of the full-precision teacher network and the inference target of the classification network can be the same. Because the network parameters of the full-precision teacher network are full-precision, the network precision of the full-precision teacher network is higher than that of the classification network, and the network precision of the classification network can be improved by distilling and training the classification network through the full-precision teacher network.
And B, training the classification network based on the semantic feature map output by the feature fusion layer, the labeling information of the sample image and the full-precision teacher network.
In a possible implementation manner, when the classification network is trained based on the semantic feature map output by the feature fusion layer, the annotation information of the sample image, and the full-precision teacher network, a first loss value may be determined based on a first prediction result of the semantic feature map and the full-precision teacher network on the sample image; and determining a second loss value based on the semantic feature map and the labeling information of the sample image, and then training the classification network based on the first loss value and the second loss value.
Wherein the first loss value is used for representing distillation loss of the full-precision teacher network during distillation training, and the second loss value is used for representing classification loss of the classification network. And training the classification network by combining the first loss value and the second loss value, so that the network precision of the classification network can be improved.
Specifically, when determining a first loss value based on the semantic feature map and a first prediction result of the full-precision teacher network on the sample image, a second prediction result of the classification network may be determined based on a distillation head module and the semantic feature map; wherein the number of parameter bits of the distillation head module is the same as the number of parameter bits of the full-precision teacher network; the first penalty value is then determined based on the first prediction and the second prediction.
The purpose of keeping the number of parameter bits of the distillation head module the same as the number of parameter bits of the full-precision teacher network is to receive high-precision features of the full-precision teacher network via the distillation head module, while the distillation training is premised on the fact that the number of parameter bits of the head module is the same as the number of parameter bits of the teacher network.
Specifically, when determining the second loss value based on the semantic feature map and the annotation information of the sample image, a third prediction result of the classification network may be determined based on a supervision header module and the semantic feature map; the second loss value is then determined based on the third prediction result and annotation information of the sample image.
Illustratively, the first loss value may be a relative entropy, divergence (KL) loss; the second penalty value may be a cross entropy penalty or the like.
In training the classification network based on the first loss value and the second loss value, the first loss value and the second loss value may be weighted and summed to determine a total loss value, and then the classification network may be trained based on the total loss value.
Where the total loss value is calculated, an exemplary one may be represented by the following formula:
Figure BDA0003421923500000171
wherein the content of the first and second substances,
Figure BDA0003421923500000172
the value of the second loss is represented,
Figure BDA0003421923500000173
which represents the value of the first loss to be,
Figure BDA0003421923500000174
represents the output value of the supervisory header module,
Figure BDA0003421923500000175
weight of the module representing the supervision header, bsRepresenting the bias parameters of the supervised head module, y representing the annotation information of the sample image, ytThe output, i.e. the first prediction,
Figure BDA0003421923500000176
weight of distillation head module, bdThe bias parameters of the distillation head module are represented, Z represents a semantic feature map of the feature fusion layer output, and L represents the total loss value.
In a possible implementation, after the training of the classification network is completed, an inference head module, that is, parameters of the inference head module, can be constructed based on the trained distillation head module and the supervision head module, so as to improve the accuracy of the classification network in making network recommendations.
The reasoning head module is used for determining a reasoning result based on the output of the trained feature fusion layer of the classification network when network reasoning is carried out.
Illustratively, the internal calculation process of the inference header module can be represented by the following formula:
Figure BDA0003421923500000181
wherein BiMLPs (X) represents the inference result of the inference head module, WsThe weight of the supervisory header module is represented,
Figure BDA0003421923500000182
showing distillation head diesWeight of block, bsIndicating a bias parameter of the supervision header module, bdIndicating the bias parameters of the distillation head module.
The inference head module can be directly constructed by the method without extra and excessive calculation, and the constructed inference head module has higher precision.
The following will show and describe the training method of the classification network with reference to a specific structural drawing, and with reference to fig. 9, a structural schematic diagram of the training method of the classification network provided by the embodiment of the present disclosure specifically includes:
after the sample images are obtained, a part of the sample images are input into a full-precision teacher network to be subjected to distillation training, and a part of the sample images are input into a classification network to be trained to be subjected to feature extraction.
After the specific input is input into the classification network, the following steps are executed:
step 1, inputting the sample image into an initial feature extraction layer to extract an initial feature (the specific implementation process refers to the description of fig. 4), so as to obtain a target feature map.
And 2, dividing a plurality of tokens based on the target feature map.
And 3, inputting the target feature graph into the feature fusion layer, and performing feature extraction and feature fusion between tokens.
Wherein the feature fusion layer comprises a plurality of multilayer perception modules.
And 4, inputting a part of semantic feature numbers output by the feature fusion layer into the distillation head module, calculating KL divergence loss together with the output of the full-precision teacher network, inputting a part of semantic feature numbers into the supervision head module, and calculating cross entropy loss together with the labeling information of the sample image.
And 5, training the classification network based on the KL loss and the cross entropy loss.
And 6, after the neural network training is finished, constructing an inference head module based on the distillation head module and the supervision head module for reasoning.
In the method, the initial feature extraction layer of the classification network to be trained comprises a channel value adjustment parameter to be trained, and the channel value adjustment parameter can adjust the channel value of the first feature map of the sample image extracted by the initial feature extraction layer, so that the first feature map after the channel value adjustment is subjected to binarization processing, which can be understood as dynamically determining a binarization threshold value, thereby reducing the precision difference between binarization feature extraction and full-precision feature extraction; furthermore, the feature fusion layer can perform feature fusion and deep feature extraction on a plurality of feature units, so that the output semantic feature map gives consideration to local features and global features, the classification network precision trained based on the semantic feature map and the labeling information is higher, and the initial feature extraction layer and the feature fusion layer are both binarized, so that the network scale of the classification network is smaller, i.e. the network scale and the network precision can be both considered.
Based on the same concept, the present disclosure further provides an image classification method, and referring to fig. 10, the flowchart of the image classification method provided by the present disclosure includes the following steps:
step 1001, an image to be classified is obtained.
Step 1002, inputting the image to be classified into a classification network obtained by training based on the training method of the classification network described in the above embodiment, and determining a classification result of the image to be classified.
The network scale of the classification network obtained by training based on the training method of the classification network is small, the precision is high, so that the images to be classified can be classified quickly, and the precision of the classification result is high.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
Based on the same inventive concept, the embodiment of the present disclosure further provides a training apparatus for a classification network corresponding to the training method for the classification network, and since the principle of the apparatus in the embodiment of the present disclosure for solving the problem is similar to the training method for the classification network described above in the embodiment of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are omitted.
Referring to fig. 11, there is shown a schematic architecture diagram of a training apparatus for a classification network according to an embodiment of the present disclosure, where the apparatus includes: a first acquisition unit 1101, a feature extraction unit 1102, a feature fusion unit 1103, a training unit 1104, and a construction unit 1105; wherein the content of the first and second substances,
a first obtaining unit 1101 for obtaining a sample image and annotation information of the sample image;
a feature extraction unit 1102, configured to input the sample image to a binarized initial feature extraction layer of a classification network to be trained, and determine a target feature map corresponding to the sample image; the initial feature extraction layer comprises channel value adjustment parameters to be trained;
a feature fusion unit 1103, configured to divide the target feature map into a plurality of feature units based on a preset size, and input the target feature map into a feature fusion layer, where the feature fusion layer includes a plurality of multi-layered binarization sensing modules, and each multi-layered binarization sensing module is configured to perform feature fusion and deep-level feature extraction on the feature units;
a training unit 1104, configured to train the classification network based on the semantic feature map output by the feature fusion layer and the labeling information of the sample image.
In one possible implementation manner, when the sample image is input to a binarized initial feature extraction layer of a classification network to be trained, and a target feature map corresponding to the sample image is determined, the feature extraction unit 1102 is configured to:
carrying out global average pooling on the sample image, and determining a first feature map corresponding to the sample image;
performing first adjustment on the channel value of the first feature map based on the channel value adjustment parameter, and determining an adjusted second feature map;
performing binary activation processing on the sample image and the second feature map to determine a third feature map;
and performing feature extraction on the third feature map, and determining a target feature map corresponding to the sample image.
In one possible implementation manner, the feature extraction unit 1102, when performing binary activation processing based on the sample image and the second feature map to determine a third feature map, is configured to:
and determining the third feature map based on a first threshold value of a preset activation function and a difference value of corresponding channel values of the sample image and the second feature map.
In one possible implementation manner, when performing feature extraction on the third feature map and determining a target feature map corresponding to the sample image, the feature extraction unit 1102 is configured to:
performing feature extraction on the third feature map, and determining a fourth feature map;
performing second adjustment on the feature value of the first feature map based on the channel value adjustment parameter, and determining an adjusted fifth feature map;
and performing feature fusion on the fourth feature map and the fifth feature map to determine the target feature map.
In a possible embodiment, the output of the nth multi-layer sensing module in the feature fusion layer is the input of the (N + 1) th multi-layer sensing module, the input of the first multi-layer sensing module is the target feature map, the output of the last multi-layer sensing module is the semantic feature map, and N is a positive integer.
In a possible implementation manner, for any multi-layer perception module, the multi-layer perception module is configured to perform feature fusion and deep-layer feature extraction on feature units of an input feature map input into the multi-layer perception module by the following methods:
performing binary activation processing on the input feature map, and determining a sixth feature map;
performing feature exchange on the feature units of the sixth feature map according to at least one exchange distance to obtain an exchange feature map;
respectively extracting the features of the exchange feature map and the sixth feature map, and then performing feature fusion with the input feature map to obtain a fusion feature map;
and activating the fusion characteristic diagram to obtain an output characteristic diagram of the multilayer perception module.
In a possible implementation manner, for any exchange distance, the feature fusion unit 1103, when performing feature exchange on the feature units of the sixth feature map to obtain an exchange feature map, is configured to:
for any feature unit, determining a feature unit to be exchanged corresponding to the feature unit in the sixth feature map based on the interaction distance;
and determining the value of the characteristic unit on each channel after the characteristic exchange is carried out on the basis of the value of the characteristic unit to be exchanged on the corresponding channel.
In one possible embodiment, the feature fusion unit 1103, after performing feature extraction on the exchange feature map and the sixth feature map respectively, performs feature fusion with the input feature map to obtain a fusion feature map, and is configured to:
respectively extracting features of the exchange feature map and the sixth feature map based on a binarization multi-layer perceptron, and determining a plurality of deep feature maps;
and after normalization processing is carried out on the plurality of deep layer feature maps, feature fusion is carried out on the deep layer feature maps and the input feature map, so that the fusion feature map is obtained.
In a possible implementation manner, the training unit 1104, when training the classification network based on the semantic feature map output by the feature fusion layer and the annotation information of the sample image, is configured to:
acquiring a full-precision teacher network corresponding to the classification network to be trained;
and training the classification network based on the semantic feature map output by the feature fusion layer, the labeling information of the sample image and the full-precision teacher network.
In one possible implementation, the training unit 1104, when training the classification network based on the semantic feature map output by the feature fusion layer, the annotation information of the sample image, and the full-precision teacher network, is configured to:
determining a first loss value based on the semantic feature map and a first prediction result of the full-precision teacher network on the sample image; determining a second loss value based on the semantic feature map and the annotation information of the sample image;
training the classification network based on the first loss value and the second loss value.
In a possible implementation, the training unit 1104, when determining a first loss value based on the semantic feature map and a first prediction result of the full-precision teacher network on the sample image, is configured to:
determining a second prediction result of the classification network based on a distillation head module and the semantic feature map; wherein the number of parameter bits of the distillation head module is the same as the number of parameter bits of the full-precision teacher network;
determining the first loss value based on the first prediction and the second prediction.
In one possible implementation, the training unit 1104, when determining the second loss value based on the semantic feature map and the annotation information of the sample image, is configured to:
determining a third prediction result of the classification network based on a supervision header module and the semantic feature map;
determining the second loss value based on the third prediction result and annotation information of the sample image.
In a possible implementation manner, after the training of the classification network is completed, the apparatus further includes a constructing unit 1105 configured to:
and constructing an inference head module based on the trained distillation head module and the trained supervision head module, wherein the inference head module is used for determining an inference result based on the output of the trained feature fusion layer of the classification network during network inference.
The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.
Based on the same inventive concept, an image classification device corresponding to the image classification method is also provided in the embodiments of the present disclosure, and because the principle of solving the problem of the device in the embodiments of the present disclosure is similar to the image classification method in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.
Referring to fig. 12, there is shown a schematic architecture diagram of an image classification apparatus according to an embodiment of the present disclosure, the apparatus includes: a second acquisition unit 1201 and a classification unit 1202; wherein the content of the first and second substances,
a second obtaining unit 1201, configured to obtain an image to be classified;
a classifying unit 1202, configured to input the image to be classified into a classification network trained based on the training method of a classification network described in the first aspect or any possible implementation manner of the first aspect, and determine a classification result of the image to be classified.
Based on the same technical concept, the embodiment of the disclosure also provides computer equipment. Referring to fig. 13, a schematic structural diagram of a computer device 1300 provided in the embodiment of the present disclosure includes a processor 1301, a memory 1302, and a bus 1303. The storage 1302 is used for storing execution instructions and includes a memory 13021 and an external storage 13022; the memory 13021 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 1301 and data exchanged with an external storage 13022 such as a hard disk, the processor 1301 exchanges data with the external storage 13022 through the memory 13021, and when the computer device 1300 runs, the processor 1301 and the storage 1302 communicate through the bus 1303, so that the processor 1301 executes the following instructions:
acquiring a sample image and annotation information of the sample image;
inputting the sample image to a binaryzation initial feature extraction layer of a classification network to be trained, and determining a target feature map corresponding to the sample image; the initial feature extraction layer comprises channel value adjustment parameters to be trained;
dividing the target feature map into a plurality of feature units based on a preset size, and inputting the target feature map into a feature fusion layer, wherein the feature fusion layer comprises a plurality of multi-layer perception modules for binarization, and each multi-layer perception module is used for performing feature fusion and deep-layer feature extraction on the feature units;
and training the classification network based on the semantic feature map output by the feature fusion layer and the labeling information of the sample image.
Alternatively, processor 1301 may execute the following instructions:
acquiring an image to be classified;
and inputting the image to be classified into a classification network obtained by training based on the training method of the classification network described in the embodiment, and determining the classification result of the image to be classified.
The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the training method of the classification network and the image classification method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.
The embodiments of the present disclosure also provide a computer program product, where the computer program product bears a program code, and instructions included in the program code may be used to execute the steps of the method for training a classification network and the method for classifying an image described in the above method embodiments, which may be referred to specifically for the above method embodiments and are not described herein again.
The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (18)

1. A method for training a classification network, comprising:
acquiring a sample image and annotation information of the sample image;
inputting the sample image to a binaryzation initial feature extraction layer of a classification network to be trained, and determining a target feature map corresponding to the sample image; the initial feature extraction layer comprises channel value adjustment parameters to be trained;
dividing the target feature map into a plurality of feature units based on a preset size, and inputting the target feature map into a feature fusion layer, wherein the feature fusion layer comprises a plurality of multi-layer perception modules for binarization, and each multi-layer perception module is used for performing feature fusion and deep-layer feature extraction on the feature units;
and training the classification network based on the semantic feature map output by the feature fusion layer and the labeling information of the sample image.
2. The method according to claim 1, wherein the inputting the sample image to a binarized initial feature extraction layer of a classification network to be trained and determining a target feature map corresponding to the sample image comprises:
carrying out global average pooling on the sample image, and determining a first feature map corresponding to the sample image;
performing first adjustment on the channel value of the first feature map based on the channel value adjustment parameter, and determining an adjusted second feature map;
performing binary activation processing on the sample image and the second feature map to determine a third feature map;
and performing feature extraction on the third feature map, and determining a target feature map corresponding to the sample image.
3. The method according to claim 2, wherein the performing a binary activation process based on the sample image and the second feature map to determine a third feature map comprises:
and determining the third feature map based on a first threshold value of a preset activation function and a difference value of corresponding channel values of the sample image and the second feature map.
4. The method according to claim 2 or 3, wherein the performing feature extraction on the third feature map and determining a target feature map corresponding to the sample image comprises:
performing feature extraction on the third feature map, and determining a fourth feature map;
performing second adjustment on the feature value of the first feature map based on the channel value adjustment parameter, and determining an adjusted fifth feature map;
and performing feature fusion on the fourth feature map and the fifth feature map to determine the target feature map.
5. The method according to any one of claims 1 to 4, wherein the output of the Nth multi-layer perception module in the feature fusion layer is the input of the (N + 1) th multi-layer perception module, the input of the first multi-layer perception module is the target feature map, the output of the last multi-layer perception module is the semantic feature map, and N is a positive integer.
6. The method according to any one of claims 1 to 5, wherein for any one of the multi-layer perception modules, the multi-layer perception module is configured to perform feature fusion and deep-level feature extraction on feature units of an input feature map input into the multi-layer perception module by:
performing binary activation processing on the input feature map, and determining a sixth feature map;
performing feature exchange on the feature units of the sixth feature map according to at least one exchange distance to obtain an exchange feature map;
respectively extracting the features of the exchange feature map and the sixth feature map, and then performing feature fusion with the input feature map to obtain a fusion feature map;
and activating the fusion characteristic diagram to obtain an output characteristic diagram of the multilayer perception module.
7. The method according to claim 6, wherein the feature exchanging the feature units of the sixth feature map for any exchange distance to obtain an exchange feature map comprises:
for any feature unit, determining a feature unit to be exchanged corresponding to the feature unit in the sixth feature map based on the interaction distance;
and determining the value of the characteristic unit on each channel after the characteristic exchange is carried out on the basis of the value of the characteristic unit to be exchanged on the corresponding channel.
8. The method according to claim 6 or 7, wherein the step of performing feature extraction on the exchange feature map and the sixth feature map respectively, and performing feature fusion on the exchange feature map and the sixth feature map to obtain a fused feature map comprises:
respectively extracting features of the exchange feature map and the sixth feature map based on a binarization multi-layer perceptron, and determining a plurality of deep feature maps;
and after normalization processing is carried out on the plurality of deep layer feature maps, feature fusion is carried out on the deep layer feature maps and the input feature map, so that the fusion feature map is obtained.
9. The method according to any one of claims 1 to 8, wherein the training of the classification network based on the semantic feature map output by the feature fusion layer and the annotation information of the sample image comprises:
acquiring a full-precision teacher network corresponding to the classification network to be trained;
and training the classification network based on the semantic feature map output by the feature fusion layer, the labeling information of the sample image and the full-precision teacher network.
10. The method of claim 9, wherein the training of the classification network based on the semantic feature map output by the feature fusion layer, the annotation information for the sample image, and the full-precision teacher network comprises:
determining a first loss value based on the semantic feature map and a first prediction result of the full-precision teacher network on the sample image; determining a second loss value based on the semantic feature map and the annotation information of the sample image;
training the classification network based on the first loss value and the second loss value.
11. The method of claim 10, wherein determining a first loss value based on the semantic feature map and a first prediction of the sample image by the full-precision teacher network comprises:
determining a second prediction result of the classification network based on a distillation head module and the semantic feature map; wherein the number of parameter bits of the distillation head module is the same as the number of parameter bits of the full-precision teacher network;
determining the first loss value based on the first prediction and the second prediction.
12. The method according to claim 10 or 11, wherein the determining a second loss value based on the semantic feature map and annotation information of the sample image comprises:
determining a third prediction result of the classification network based on a supervision header module and the semantic feature map;
determining the second loss value based on the third prediction result and annotation information of the sample image.
13. The method of claim 11 or 12, wherein after the training of the classification network is completed, the method further comprises:
and constructing an inference head module based on the trained distillation head module and the trained supervision head module, wherein the inference head module is used for determining an inference result based on the output of the trained feature fusion layer of the classification network during network inference.
14. An image classification method, comprising:
acquiring an image to be classified;
inputting the image to be classified into a classification network obtained by training based on the training method of the classification network according to any one of claims 1 to 13, and determining the classification result of the image to be classified.
15. An apparatus for training a classification network, comprising:
the first acquisition unit is used for acquiring the sample image and the labeling information of the sample image;
the characteristic extraction unit is used for inputting the sample image to a binaryzation initial characteristic extraction layer of a classification network to be trained and determining a target characteristic diagram corresponding to the sample image; the initial feature extraction layer comprises channel value adjustment parameters to be trained;
the characteristic fusion unit is used for dividing the target characteristic diagram into a plurality of characteristic units based on a preset size and inputting the target characteristic diagram into a characteristic fusion layer, the characteristic fusion layer comprises a plurality of multi-layer perception modules for binaryzation, and each multi-layer perception module is used for carrying out characteristic fusion and deep-layer characteristic extraction on the characteristic units;
and the training unit is used for training the classification network based on the semantic feature map output by the feature fusion layer and the labeling information of the sample image.
16. An image classification apparatus, comprising:
the second acquisition unit is used for acquiring the image to be classified;
a classification unit, configured to input the image to be classified into a classification network trained based on the training method of a classification network according to any one of claims 1 to 13, and determine a classification result of the image to be classified.
17. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is run, the machine-readable instructions when executed by the processor performing the steps of the training method of a classification network according to any one of claims 1 to 13 or performing the steps of the image classification method according to claim 14.
18. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, is adapted to carry out the steps of the method of training a classification network according to any one of claims 1 to 13 or the steps of the method of image classification according to claim 14.
CN202111565700.1A 2021-12-20 2021-12-20 Training method of classification network, image classification method and device Pending CN114239731A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111565700.1A CN114239731A (en) 2021-12-20 2021-12-20 Training method of classification network, image classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111565700.1A CN114239731A (en) 2021-12-20 2021-12-20 Training method of classification network, image classification method and device

Publications (1)

Publication Number Publication Date
CN114239731A true CN114239731A (en) 2022-03-25

Family

ID=80759529

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111565700.1A Pending CN114239731A (en) 2021-12-20 2021-12-20 Training method of classification network, image classification method and device

Country Status (1)

Country Link
CN (1) CN114239731A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237744A (en) * 2023-11-10 2023-12-15 之江实验室 Training method and device of image classification model, medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237744A (en) * 2023-11-10 2023-12-15 之江实验室 Training method and device of image classification model, medium and electronic equipment
CN117237744B (en) * 2023-11-10 2024-01-30 之江实验室 Training method and device of image classification model, medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN109816009B (en) Multi-label image classification method, device and equipment based on graph convolution
CN110245132B (en) Data anomaly detection method, device, computer readable storage medium and computer equipment
CN112016315B (en) Model training method, text recognition method, model training device, text recognition device, electronic equipment and storage medium
CN113255915B (en) Knowledge distillation method, device, equipment and medium based on structured instance graph
CN104661037B (en) The detection method and system that compression image quantization table is distorted
CN112818975A (en) Text detection model training method and device and text detection method and device
CN109117742B (en) Gesture detection model processing method, device, equipment and storage medium
CN108596944A (en) A kind of method, apparatus and terminal device of extraction moving target
CN114005012A (en) Training method, device, equipment and storage medium of multi-mode pre-training model
CN113255714A (en) Image clustering method and device, electronic equipment and computer readable storage medium
CN112200889A (en) Sample image generation method, sample image processing method, intelligent driving control method and device
CN112801063B (en) Neural network system and image crowd counting method based on neural network system
CN113094533B (en) Image-text cross-modal retrieval method based on mixed granularity matching
CN112651364A (en) Image processing method, image processing device, electronic equipment and storage medium
CN113919444B (en) Training method of target detection network, target detection method and device
CN114239731A (en) Training method of classification network, image classification method and device
CN112668675B (en) Image processing method and device, computer equipment and storage medium
CN112528077B (en) Video face retrieval method and system based on video embedding
CN112749576A (en) Image recognition method and device, computing equipment and computer storage medium
CN112529897A (en) Image detection method and device, computer equipment and storage medium
CN111523548A (en) Image semantic segmentation and intelligent driving control method and device
CN111507250A (en) Image recognition method, device and storage medium
CN115526310A (en) Network model quantification method, device and equipment
CN115909465A (en) Face positioning detection method, image processing device and readable storage medium
CN115690704A (en) LG-CenterNet model-based complex road scene target detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination