CN114998694A - Method, apparatus, device, medium and program product for training image processing model - Google Patents

Method, apparatus, device, medium and program product for training image processing model Download PDF

Info

Publication number
CN114998694A
CN114998694A CN202210647722.0A CN202210647722A CN114998694A CN 114998694 A CN114998694 A CN 114998694A CN 202210647722 A CN202210647722 A CN 202210647722A CN 114998694 A CN114998694 A CN 114998694A
Authority
CN
China
Prior art keywords
feature map
feature
image processing
processing model
block set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210647722.0A
Other languages
Chinese (zh)
Inventor
杨昆霖
邱增玉
宗道明
侯军
伊帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Sensetime Intelligent Technology Co Ltd
Original Assignee
Shanghai Sensetime Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Sensetime Intelligent Technology Co Ltd filed Critical Shanghai Sensetime Intelligent Technology Co Ltd
Priority to CN202210647722.0A priority Critical patent/CN114998694A/en
Publication of CN114998694A publication Critical patent/CN114998694A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to a method, apparatus, device, medium, and program product for training an image processing model. The method comprises the following steps: acquiring a first feature map and a second feature map of a training image, wherein the first feature map is output through a first image processing model, and the second feature map is output through a second image processing model; generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map; determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map; and training the second image processing model according to the value of the loss function.

Description

Training method, device, equipment, medium and program product of image processing model
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for training an image processing model, an electronic device, a storage medium, and a program product.
Background
Knowledge Distillation (KD) refers to distilling and extracting Knowledge contained in a trained teacher model into student models. The general knowledge distillation method utilizes the last output logits of the teacher model as the knowledge for the student model to learn, which is commonly referred to as "soft label" or "dark knowledge". The related art also proposes a feature-based knowledge distillation method that takes a feature map output from the middle layer of the teacher model as knowledge for the student model to learn.
In the related art, the performance of the teacher model cannot be improved after distillation. That is, there is a gap between teacher and student models (i.e., teacher model and student model), and there is a bottleneck in improving the performance of student models, and it is difficult to continue optimization after improving to a certain extent.
For mobile terminals that lack computing power, only a small scale student model can typically be deployed. In the related art, the performance of the student model is poor, so that the student model is difficult to fall on the ground, and the low-precision student model is difficult to meet the requirements of users. If the gap between the teacher model and the student model can be broken, the performance of the student model can be improved along with the improvement of the performance of the teacher model, the landing of more student models can be accelerated, and more comfortable experience is brought to users.
Disclosure of Invention
The present disclosure provides a training technical solution of an image processing model.
According to an aspect of the present disclosure, there is provided a training method of an image processing model, including:
acquiring a first feature map and a second feature map of a training image, wherein the first feature map is output through a first image processing model, and the second feature map is output through a second image processing model;
generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map;
determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map;
and training the second image processing model according to the value of the loss function.
The method comprises the steps of obtaining a first feature map and a second feature map of a training image, wherein the first feature map is output through a first image processing model, the second feature map is output through a second image processing model, generating a third feature map according to partial features in the first feature map and partial features in the second feature map, determining values of loss functions corresponding to the second image processing model according to the third feature map and the first feature map, and training the second image processing model according to the values of the loss functions, so that a part of the feature map extracted by the first image processing model is used as a priori knowledge of the second image processing model to enable the second image processing model to imitate the features output by the first image processing model, and the second image processing model can be promoted along with the promotion of the performance of the first image processing model, the method can solve the gap between the teacher model and the student model, and can solve the problem that the performance of the second image model is not improved any more along with the improvement of the performance of the first image model, so that the second image processing model with better performance can be obtained by distillation by using the first image processing model with larger scale and better performance.
In a possible implementation manner, the generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map includes:
determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the first feature block set represents a set of feature blocks used for generating the third feature map in the first feature map, the first feature block set comprises partial features of the first feature map, the second feature block set represents a set of feature blocks used for generating the third feature map in the second feature map, and the second feature block set comprises partial features of the second feature map;
and generating the third feature map according to the first feature block set and the second feature block set.
In this implementation, a first feature block set used for generating a third feature map is determined from the first feature map, a second feature block set used for generating the third feature map is determined from the second feature map, and the third feature map is generated according to the first feature block set and the second feature block set, so that the first feature map and the second feature map are divided by taking the feature block as a minimum unit, and the third feature map is generated based on a part of feature blocks in the first feature map and a part of feature blocks in the second feature map, so that the second image processing model is trained by using a priori feature blocks provided by a teacher model, which helps to improve the accuracy of the trained second image processing model.
In a possible implementation manner, the determining, from the first feature map, a first feature block set used for generating a third feature map, and determining, from the second feature map, a second feature block set used for generating the third feature map includes:
determining a mask ratio for merging the first feature map and the second feature map according to the first feature map and the second feature map;
determining a mask area according to the mask proportion;
according to the mask region, determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the positions of feature blocks in the first feature block set and the second feature block set are complementary.
In this implementation, a mask ratio for merging the first feature map and the second feature map is determined according to the first feature map and the second feature map, determining a mask region according to the mask proportion, determining a first feature block set for generating a third feature map from the first feature map, and determining a second feature block set for generating the third feature map from the second feature map, wherein the first set of feature blocks are complementary in position to feature blocks in the second set of feature blocks, whereby, the mask ratio for merging the first feature map and the second feature map is determined based on the first feature map and the second feature map, instead of using a fixed mask ratio, that is, the mask ratio for merging the feature map pairs is dynamically adjusted according to the similarity information (e.g., similarity) of the feature map pairs. For example, different feature map pairs output by different intermediate layer pairs have different similarity information, and different mask ratios may be adopted; as another example, different feature map pairs output by the same middle layer pair in different training rounds will likely employ different mask ratios. By adopting the mask proportion determined dynamically according to the characteristic diagram, the difference between the second image processing model and the first image processing model is favorably reduced, and the precision of the trained second image processing model is favorably improved.
In a possible implementation manner, the determining, according to the first feature map and the second feature map, a mask ratio for merging the first feature map and the second feature map includes:
determining similarity information between the first feature map and the second feature map;
and determining a mask ratio for merging the first feature map and the second feature map according to the similarity information.
In this implementation, similarity information between the first feature map and the second feature map is determined, and a mask ratio for combining the first feature map and the second feature map is determined according to the similarity information, thereby facilitating improvement of the accuracy of the trained second image processing model.
In one possible implementation form of the method,
the determining similarity information between the first feature map and the second feature map includes: determining an intermediate core alignment CKA similarity index between the first feature map and the second feature map;
determining a mask ratio for merging the first feature map and the second feature map according to the similarity information includes: determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index.
In this implementation, by determining the CKA similarity index between the first feature map and the second feature map and determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index, the accuracy of the trained second image processing model is improved.
In a possible implementation manner, the determining similarity information between the first feature map and the second feature map includes:
aligning the first feature map with the second feature map;
and determining similarity information between the aligned first feature map and the aligned second feature map.
In this implementation, the similarity information between the first feature map and the second feature map can be determined more accurately by aligning the first feature map and the second feature map and determining the similarity information between the aligned first feature map and the aligned second feature map.
In one possible implementation, the aligning the first feature map and the second feature map includes:
in response to the difference between the channel numbers of the second feature map and the first feature map, performing convolution processing on the second feature map to enable the channel numbers of the convolved second feature map and the first feature map to be the same, and/or in response to the difference between the sizes of the second feature map and the first feature map, performing bilinear interpolation on the second feature map to enable the size of the bilinear interpolated second feature map to be the same as the size of the first feature map;
alternatively, the first and second liquid crystal display panels may be,
and in response to the difference of the channel numbers of the second feature map and the first feature map, performing convolution processing on the first feature map to enable the channel numbers of the convolved first feature map and the convolved second feature map to be the same, and/or in response to the difference of the sizes of the second feature map and the first feature map, performing bilinear interpolation on the first feature map to enable the size of the bilinear interpolated first feature map and the size of the second feature map to be the same.
In this implementation, in response to the difference between the number of channels of the second feature map and the number of channels of the first feature map, performing convolution processing on the second feature map so that the number of channels of the convolved second feature map is the same as the number of channels of the convolved first feature map, thereby making it possible to align the number of channels of the first feature map and the number of channels of the second feature map; performing bilinear interpolation on the second feature map in response to the fact that the second feature map and the first feature map are different in size, so that the size of the bilinear interpolated second feature map is the same as that of the first feature map, and therefore the sizes of the first feature map and the second feature map can be aligned; performing convolution processing on the first feature map in response to the fact that the second feature map and the first feature map have different channel numbers, so that the channel numbers of the first feature map and the second feature map subjected to convolution processing are the same, and therefore the channel numbers of the first feature map and the second feature map can be aligned; the sizes of the first feature map and the second feature map can be aligned by performing bilinear interpolation on the first feature map in response to the difference in size between the second feature map and the first feature map, and making the size of the bilinear interpolated first feature map the same as that of the second feature map.
In a possible implementation manner, the determining, according to the mask region, a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map includes:
determining a first feature block set used for generating a third feature map from the first feature map corresponding to the position information of the mask region or corresponding to the position information outside the mask region;
and selecting feature blocks complementary to the positions of the feature blocks in the first feature block set from the second feature map to obtain a second feature block set for generating a third feature map.
In this implementation, a first feature block set used for generating a third feature map is determined from the first feature map by using position information corresponding to the mask region or position information corresponding to positions other than the mask region, and a feature block complementary to the position of the feature block in the first feature block set is selected from the second feature map to obtain a second feature block set used for generating the third feature map, so that the feature block masked in the second feature map is filled with the feature block at the corresponding position in the first feature map, and thus part of prior knowledge can be provided to the second image processing model serving as a student model by using the first image processing model serving as a teacher model.
In one possible implementation manner, the generating the third feature map according to the first feature block set and the second feature block set includes:
respectively carrying out position coding on the first characteristic block set and the second characteristic block set to obtain first position information corresponding to the first characteristic block set and second position information corresponding to the second characteristic block set;
coding the first feature block set by combining the first position information to obtain a first feature block coding result;
coding the second feature block set by combining the second position information to obtain a second feature block coding result;
and generating a third feature map according to the first feature block coding result and the second feature block coding result.
In this implementation, the first feature block set and the second feature block set are respectively position-encoded to obtain first position information corresponding to the first feature block set and second position information corresponding to the second feature block set, the first feature block set is encoded in combination with the first position information to obtain a first feature block encoding result, the second feature block set is encoded in combination with the second position information to obtain a second feature block encoding result, and a third feature map is generated according to the first feature block encoding result and the second feature block encoding result, so that the third feature map is generated in combination with the position information of the feature blocks in the first feature block set and the second feature block set, and the third feature map can include the position information of the feature blocks in the first feature map and the second feature map, therefore, the second image processing model can be trained by using the prior characteristics containing the position information provided by the teacher model, and the accuracy of the trained second image processing model is improved.
In a possible implementation manner, the determining, according to the third feature map and the first feature map, a value of a loss function corresponding to the second image processing model includes:
carrying out position coding on the third feature map to obtain third position information corresponding to the third feature map;
inputting the third position information and the third feature map into a decoding network to obtain a fourth feature map;
and determining the value of the loss function corresponding to the second image processing model according to the fourth feature map and the first feature map.
In this implementation, the third feature map is subjected to position coding to obtain third position information corresponding to the third feature map, the third feature map to which the third position information is added is input to a decoding network to obtain a fourth feature map, and a value of a loss function corresponding to the second image processing model is determined according to the fourth feature map and the first feature map, so that the second image processing model is trained to improve the accuracy of the second image processing model.
In one possible implementation form of the method,
the acquiring of the first feature map and the second feature map of the training image includes: obtaining at least two first feature maps corresponding to the training images extracted from at least two first intermediate layers of the first image processing model and at least two second feature maps corresponding to the training images extracted from at least two second intermediate layers of the second image processing model, wherein the at least two first feature maps and the at least two second feature maps form at least two feature map pairs, and any one of the at least two feature map pairs comprises one first feature map and one second feature map;
generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map, including: for any one of the at least two feature map pairs, generating a third feature map corresponding to the feature map pair according to a partial feature in a first feature map of the feature map pair and a partial feature in a second feature map of the feature map pair;
determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map, including: and determining the value of the loss function corresponding to the second image processing model according to the third feature map corresponding to the at least two feature map pairs and the first feature map in the at least two feature map pairs.
In this implementation, the second image processing model is trained by using at least two feature map pairs output by at least two intermediate layer pairs of the first image processing model and the second image processing model, so that the second image processing model can learn richer hidden layer features of the first image processing model, and the difference between the second image processing model and the first image processing model can be further reduced.
In one possible implementation, the first image processing model and the second image processing model are both used for image classification;
after the second image processing model training is completed, the method further comprises:
acquiring an image to be classified;
processing the image to be classified through the second image processing model to obtain a feature map corresponding to the image to be classified;
and processing the characteristic graph corresponding to the image to be classified through the second image processing model to obtain a classification result corresponding to the image to be classified.
In the implementation manner, the images to be classified are processed through the trained second image processing model to obtain the feature maps corresponding to the images to be classified, and the feature maps corresponding to the images to be classified are processed through the second image processing model to obtain the classification results corresponding to the images to be classified, so that the accuracy of image classification of the images to be classified can be improved.
In one possible implementation, the first image processing model and the second image processing model are both used for target detection;
after the second image processing model training is completed, the method further comprises:
acquiring an image to be detected;
processing the image to be detected through the second image processing model to obtain a characteristic diagram corresponding to the image to be detected;
and processing the characteristic graph corresponding to the image to be detected through the second image processing model to obtain a target detection result corresponding to the image to be detected.
In the implementation mode, the second image processing model obtained through training is used for processing the image to be detected to obtain the characteristic diagram corresponding to the image to be detected, and the second image processing model is used for processing the characteristic diagram corresponding to the image to be detected to obtain the target detection result corresponding to the image to be detected, so that the accuracy of target detection on the image to be detected can be improved.
According to an aspect of the present disclosure, there is provided a training apparatus for an image processing model, including:
the acquisition module is used for acquiring a first feature map and a second feature map of a training image, wherein the first feature map is output through a first image processing model, and the second feature map is output through a second image processing model;
a generating module, configured to generate a third feature map according to a partial feature in the first feature map and a partial feature in the second feature map;
a determining module, configured to determine, according to the third feature map and the first feature map, a value of a loss function corresponding to the second image processing model;
and the training module is used for training the second image processing model according to the value of the loss function.
In one possible implementation, the generating module is configured to:
determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the first feature block set represents a set of feature blocks used for generating the third feature map in the first feature map, the first feature block set comprises partial features of the first feature map, the second feature block set represents a set of feature blocks used for generating the third feature map in the second feature map, and the second feature block set comprises partial features of the second feature map;
and generating the third feature map according to the first feature block set and the second feature block set.
In one possible implementation, the generating module is configured to:
determining a mask ratio for merging the first feature map and the second feature map according to the first feature map and the second feature map;
determining a mask area according to the mask proportion;
according to the mask region, determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the positions of feature blocks in the first feature block set and the second feature block set are complementary.
In one possible implementation, the generating module is configured to:
determining similarity information between the first feature map and the second feature map;
and determining a mask ratio for merging the first feature map and the second feature map according to the similarity information.
In one possible implementation, the generating module is configured to:
determining an intermediate core alignment CKA similarity index between the first feature map and the second feature map;
determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index.
In one possible implementation, the generating module is configured to:
aligning the first feature map with the second feature map;
and determining similarity information between the aligned first feature map and the aligned second feature map.
In one possible implementation, the generating module is configured to:
performing convolution processing on the second feature map in response to the fact that the number of channels of the second feature map is different from that of the channels of the first feature map, and enabling the number of channels of the second feature map after convolution processing to be the same as that of the channels of the first feature map, and/or performing bilinear interpolation on the second feature map in response to the fact that the size of the second feature map is different from that of the channels of the first feature map, and enabling the size of the second feature map after bilinear interpolation to be the same as that of the first feature map;
alternatively, the first and second electrodes may be,
and in response to the difference of the channel numbers of the second feature map and the first feature map, performing convolution processing on the first feature map to enable the channel numbers of the convolved first feature map and the convolved second feature map to be the same, and/or in response to the difference of the sizes of the second feature map and the first feature map, performing bilinear interpolation on the first feature map to enable the size of the bilinear interpolated first feature map and the size of the second feature map to be the same.
In one possible implementation, the determining module is configured to:
determining a first feature block set used for generating a third feature map from the first feature map corresponding to the position information of the mask region or corresponding to the position information outside the mask region;
and selecting feature blocks complementary to the positions of the feature blocks in the first feature block set from the second feature map to obtain a second feature block set for generating a third feature map.
In one possible implementation, the generating module is configured to:
respectively carrying out position coding on the first characteristic block set and the second characteristic block set to obtain first position information corresponding to the first characteristic block set and second position information corresponding to the second characteristic block set;
coding the first feature block set by combining the first position information to obtain a first feature block coding result;
coding the second feature block set by combining the second position information to obtain a second feature block coding result;
and generating a third feature map according to the first feature block coding result and the second feature block coding result.
In one possible implementation, the determining module is configured to:
carrying out position coding on the third feature map to obtain third position information corresponding to the third feature map;
inputting the third position information and the third feature map into a decoding network to obtain a fourth feature map;
and determining the value of the loss function corresponding to the second image processing model according to the fourth feature map and the first feature map.
In one possible implementation form of the method,
the acquisition module is configured to: obtaining at least two first feature maps corresponding to the training images extracted from at least two first intermediate layers of the first image processing model and at least two second feature maps corresponding to the training images extracted from at least two second intermediate layers of the second image processing model, wherein the at least two first feature maps and the at least two second feature maps form at least two feature map pairs, and any one of the at least two feature map pairs comprises one first feature map and one second feature map;
the generation module is configured to: for any one of the at least two feature map pairs, generating a third feature map corresponding to the feature map pair according to a partial feature in a first feature map of the feature map pair and a partial feature in a second feature map of the feature map pair;
the determination module is to: and determining the value of the loss function corresponding to the second image processing model according to the third feature map corresponding to the at least two feature map pairs and the first feature map in the at least two feature map pairs.
In one possible implementation, the first image processing model and the second image processing model are both used for image classification;
the device further comprises:
the classification module is used for acquiring an image to be classified; processing the image to be classified through the second image processing model to obtain a feature map corresponding to the image to be classified; and processing the feature map corresponding to the image to be classified through the second image processing model to obtain a classification result corresponding to the image to be classified.
In one possible implementation, the first image processing model and the second image processing model are both used for target detection;
the device further comprises:
the target detection module is used for acquiring an image to be detected; processing the image to be detected through the second image processing model to obtain a characteristic diagram corresponding to the image to be detected; and processing the characteristic graph corresponding to the image to be detected through the second image processing model to obtain a target detection result corresponding to the image to be detected.
According to an aspect of the present disclosure, there is provided an electronic device including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the above-described method.
According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.
According to an aspect of the present disclosure, there is provided a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the above method.
In the embodiment of the disclosure, a first feature map and a second feature map of a training image are obtained, wherein the first feature map is output through a first image processing model, the second feature map is output through a second image processing model, a third feature map is generated according to a part of features in the first feature map and a part of features in the second feature map, a value of a loss function corresponding to the second image processing model is determined according to the third feature map and the first feature map, and the second image processing model is trained according to the value of the loss function, so that a part of the feature map extracted by the first image processing model is used as a priori knowledge of the second image processing model, so that the second image processing model simulates the features output by the first image processing model, and thus the second image processing model can be improved along with the improvement of the performance of the first image processing model, the image processing method and the image processing system have the advantages that the gap between the teacher and student models can be solved, the problem that the performance of the second image model is not improved any more along with the improvement of the performance of the first image model can be solved, and therefore the second image processing model with better performance can be obtained through distillation by using the first image processing model with larger scale and better performance.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.
Fig. 1 shows a flowchart of a training method of an image processing model provided by an embodiment of the present disclosure.
Fig. 2 shows a block diagram of a training apparatus for an image processing model provided in an embodiment of the present disclosure.
Fig. 3 shows a block diagram of a sub-device 1900 provided by an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
The disclosed embodiments provide a training method, an apparatus, an electronic device, a storage medium, and a program product for an image processing model, by obtaining a first feature map and a second feature map of a training image, where the first feature map is output by a first image processing model, the second feature map is output by a second image processing model, generating a third feature map according to a partial feature in the first feature map and a partial feature in the second feature map, determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map, and training the second image processing model according to the value of the loss function, thereby using a part of the feature map extracted by the first image processing model as a priori knowledge of the second image processing model to make the second image processing model mimic the feature output by the first image processing model, therefore, the second image processing model can be improved along with the improvement of the performance of the first image processing model, namely, the gap between the teacher and student models can be solved, and the problem that the performance of the second image model is not improved along with the improvement of the performance of the first image model can be solved, so that the first image processing model with larger scale and better performance can be used, and the second image processing model with better performance can be obtained through distillation.
The following describes in detail a training method of an image processing model according to an embodiment of the present disclosure with reference to the drawings.
Fig. 1 shows a flowchart of a training method of an image processing model provided by an embodiment of the present disclosure. In a possible implementation manner, the subject of the training method of the image processing model may be a training apparatus of the image processing model, for example, the training method of the image processing model may be executed by a terminal device or a server or other electronic devices. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, or a wearable device. In some possible implementations, the training method of the image processing model may be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 1, the training method of the image processing model includes steps S11 to S14.
In step S11, a first feature map and a second feature map of a training image are obtained, where the first feature map is output by a first image processing model and the second feature map is output by a second image processing model.
In step S12, a third feature map is generated based on the partial features in the first feature map and the partial features in the second feature map.
In step S13, a value of a loss function corresponding to the second image processing model is determined based on the third feature map and the first feature map.
In step S14, the second image processing model is trained based on the values of the loss function.
In the embodiments of the present disclosure, the first image processing model and the second image processing model are both models for image processing, and the first image processing model and the second image processing model may be used for the same image processing task. For example, both the first image processing model and the second image processing model are used for image classification; for another example, both the first image processing model and the second image processing model are used for target detection; as another example, both the first image processing model and the second image processing model are used for image segmentation; for another example, both the first image processing model and the second image processing model are used for feature extraction; and so on.
In the disclosed embodiment, the first image processing model is a teacher model and the second image processing model is a student model, the second image processing model being more lightweight than the first image processing model. For example, the network structure of the second image processing model is simpler and/or the parameter quantity is smaller compared to the first image processing model. That is, the network structure of the first image processing model is more complex and/or the amount of parameters is larger than that of the second image processing model. The first image processing model is typically trained for better performance due to a more complex network structure and/or a larger number of parameters. As the scale of the first image processing model increases, the performance of the first image processing model will improve. Because the embodiment of the disclosure uses part of the features of the feature map extracted by the first image processing model as the prior knowledge of the second image processing model, and the second image processing model simulates the features output by the first image processing model, the embodiment of the disclosure can improve the second image processing model along with the improvement of the performance of the first image processing model, that is, can solve the gap between the teacher and student models, and can solve the problem that the performance of the second image model is not improved along with the improvement of the performance of the first image model. Since the performance of the second image processing model can be improved along with the improvement of the performance of the first image processing model, the second image processing model with better performance can be obtained by distillation by using the first image processing model with larger scale and better performance.
The embodiments of the present disclosure do not limit the network structures adopted by the first image processing model and the second image processing model. The first image processing model and the second image processing model may employ the same type of network structure, or may employ different types of network structures.
In one possible implementation, both the first image processing model and the second image processing model are used for image classification.
As an example of this implementation, the first image processing model and the second image processing model may employ the same type of network structure. For example, a first image processing model may employ ResNet152, and a second image processing model may employ ResNet 18; as another example, the first image processing model may employ WRN40-2, and the second image processing model may employ WRN 16-2; as another example, the first image processing model may employ WRN40-2, and the second image processing model may employ WRN 40-1; as another example, the first image processing model may employ ResNet56, and the second image processing model may employ ResNet 20; as another example, a first image processing model may employ ResNet110, and a second image processing model may employ ResNet 20; as another example, a first image processing model may employ ResNet110, and a second image processing model may employ ResNet 32; as another example, the first image processing model may employ VGG13, and the second image processing model may employ VGG 8; as another example, the first image processing model may employ ResNet34, and the second image processing model may employ ResNet 18; and so on.
As another example of this implementation, the first image processing model and the second image processing model may employ different types of network structures. For example, the first image processing model may employ ResNet50 and the second image processing model may employ MobileNet V2.
In another possible implementation, both the first image processing model and the second image processing model are used for object detection. As an example of this implementation, the first image processing model and the second image processing model may employ RetinaNet, respectively. As another example of this implementation, the first image processing model and the second image processing model may employ fast-RCNN, respectively. In this implementation, the first image processing model and the second image processing model may employ the same type of network structure, or may employ different types of network structures. For example, the first image processing model may employ ResNet101-FPN, and the second image processing model may employ ResNet 50-FPN; as another example, the first image processing model may employ ResNet152-FPN, and the second image processing model may employ ResNet 50-FPN; the first image processing model may employ ResNet101-FPN, and the second image processing model may employ ResNet 18-FPN; and so on.
In the embodiment of the present disclosure, the first feature map may be a feature map output by a first network layer of the first image processing model, and the second feature map may be a feature map output by a second network layer of the second image processing model, where the first network layer and the second network layer are corresponding network layers in the first image processing model and the second image processing model. The number of the first network layers can be one or more than two, and correspondingly, the number of the first characteristic diagrams can also be one or more than two; the number of the second network layers may be one or more than two, and accordingly, the number of the second feature maps may also be one or more than two.
In one possible implementation, the first network layer may include an intermediate layer of the first image processing model, and the second network layer may include an intermediate layer of the second image processing model. For example, the first network layer includes a first intermediate layer of a first image processing model and the second network layer includes a second intermediate layer of a second image processing model. The first intermediate layer and the second intermediate layer are corresponding intermediate layers in the first image processing model and the second image processing model, and the first intermediate layer and the second intermediate layer can form an intermediate layer pair. Accordingly, a first profile output by the first intermediate level and a second profile output by the second intermediate level may form a profile pair. As an example of this implementation, the first intermediate layer may be the last layer before the down-sampling layer in the first image processing model, and the second intermediate layer may be the last layer before the down-sampling layer in the second image processing model. Of course, a person skilled in the art may flexibly select the first intermediate layer and the second intermediate layer according to the requirements of the actual application scenario, which is not limited herein.
In this implementation, T intermediate layer pairs may be selected from the first image processing model and the second image processing model, and the second image processing model may be trained based on T feature map pairs output by the T intermediate layer pairs, where T is an integer greater than or equal to 1. As an example of this implementation, one may obtainObtaining T first feature maps output by the last layer before the last T down-sampling layers of the first image processing model
Figure BDA0003684631970000121
Obtaining T second feature maps output by the last layer before the last T downsampling layers of the second image processing model
Figure BDA0003684631970000122
And form T pairs of profiles.
As an example of this implementation, 2 middle layer pairs may be selected from the first image processing model and the second image processing model, the 2 middle layer pairs including: a first intermediate layer pair consisting of the last layer before the last down-sampling layer of the first image processing model and the last layer before the last down-sampling layer of the second image processing model, and a second intermediate layer pair consisting of the last layer before the penultimate down-sampling layer of the first image processing model and the last layer before the penultimate down-sampling layer of the second image processing model. Wherein, the first feature map output by the last layer before the last down-sampling layer of the first image processing model can be recorded as
Figure BDA0003684631970000123
The second feature map output by the last layer before the last down-sampled layer of the second image processing model can be written as
Figure BDA0003684631970000124
The first characteristic diagram pair corresponding to the first intermediate layer can be recorded as
Figure BDA0003684631970000125
The first feature map output by the last layer before the penultimate down-sampled layer of the first image processing model can be recorded as
Figure BDA0003684631970000126
Last layer input before last down-sampled layer of second image processing modelThe second characteristic diagram can be recorded as
Figure BDA0003684631970000127
The second characteristic diagram pair corresponding to the second intermediate layer can be recorded as
Figure BDA0003684631970000128
As another example of this implementation, 3 middle layer pairs may be selected from the first image processing model and the second image processing model, the 3 middle layer pairs including: the image processing system comprises a first image processing model, a second image processing model and a third image processing model, wherein the first image processing model comprises a last layer before a last down-sampling layer of the first image processing model and a last layer before a last down-sampling layer of the second image processing model, the second image processing model comprises a last layer before a last down-sampling layer of the first image processing model and a last layer before a last down-sampling layer of the second image processing model, and the third image processing model comprises a last layer before a last down-sampling layer of the first image processing model and a last layer before a last down-sampling layer of the second image processing model. Wherein, the first feature map output by the last layer before the last down-sampling layer of the first image processing model can be recorded as
Figure BDA0003684631970000131
The second feature map output by the last layer before the last down-sampled layer of the second image processing model can be written as
Figure BDA0003684631970000132
The first characteristic diagram pair corresponding to the first intermediate layer can be recorded as
Figure BDA0003684631970000133
The first feature map output by the last layer before the penultimate down-sampled layer of the first image processing model can be recorded as
Figure BDA0003684631970000134
Last but one of the second image processing modelThe second profile of the last layer output before two down-sampled layers can be written as
Figure BDA0003684631970000135
The corresponding second feature map pair of the second intermediate layer can be recorded as
Figure BDA0003684631970000136
The first feature map output by the last layer before the third last down-sampled layer of the first image processing model can be recorded as
Figure BDA0003684631970000137
The second feature map output by the last layer before the third last down-sampled layer of the second image processing model can be recorded as
Figure BDA0003684631970000138
The third feature map pair corresponding to the third intermediate layer can be recorded as
Figure BDA0003684631970000139
As another example of this implementation, 1 intermediate layer pair may be selected from the first image processing model and the second image processing model, the intermediate layer pair being: and the last layer before the last down-sampling layer of the first image processing model and the last layer before the last down-sampling layer of the second image processing model form a first intermediate layer pair. The first feature map output by the last layer before the last down-sampled layer of the first image processing model can be written as
Figure BDA00036846319700001310
The second feature map output by the last layer before the last down-sampled layer of the second image processing model can be written as
Figure BDA00036846319700001311
The first characteristic diagram pair corresponding to the first intermediate layer can be recorded as
Figure BDA00036846319700001312
In another possible implementation, the first network layer may include an output layer of a first image processing model, and the second network layer may include an output layer of a second image processing model. In this implementation, the first image processing model and the second image processing model may both be used for feature extraction, i.e., the outputs of the first image processing model and the second image processing model may both be feature maps.
In another possible implementation, the first network layer may include an intermediate layer and an output layer of the first image processing model, and the second network layer may include an intermediate layer and an output layer of the second image processing model. In this implementation, the first image processing model and the second image processing model may both be used for feature extraction, i.e., the outputs of the first image processing model and the second image processing model may both be feature maps.
In a possible implementation manner, the generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map includes: determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the first feature block set represents a set of feature blocks used for generating the third feature map in the first feature map, the first feature block set comprises partial features of the first feature map, the second feature block set represents a set of feature blocks used for generating the third feature map in the second feature map, and the second feature block set comprises partial features of the second feature map; and generating the third feature map according to the first feature block set and the second feature block set.
In this implementation, the feature block may represent an image block obtained by dividing the feature map. The first feature map is divided to obtain a feature block set corresponding to the first feature map, wherein any two feature blocks in the feature block set corresponding to the first feature mapDo not overlap with each other; the second feature map is divided, so that a feature block set corresponding to the second feature map can be obtained, wherein any two feature blocks in the feature block set corresponding to the second feature map are not overlapped. The number of channels of the feature block may be the same as the number of channels of the feature map. For example, the first feature map has a number of channels C 1 Then, the number of channels of any feature block in the feature block set corresponding to the first feature map may also be C 1 (ii) a The number of channels of the second characteristic diagram is C 2 Then, the number of channels of any feature block in the feature block set corresponding to the second feature map may also be C 2
In one example, the first profile may be plotted
Figure BDA0003684631970000141
And a second characteristic diagram
Figure BDA0003684631970000142
Are respectively divided into sizes P i ×P i The feature block of (1). For example, a first characteristic diagram
Figure BDA0003684631970000143
And a second characteristic diagram
Figure BDA0003684631970000144
All have a size of H i ×W i Then, the first feature map may be applied
Figure BDA0003684631970000145
And a second characteristic diagram
Figure BDA0003684631970000146
Are divided into M i A feature block, wherein M i =(H i ×W i )/(P i ×P i ). For example, the first characteristic diagram
Figure BDA0003684631970000147
And a second characteristic diagram
Figure BDA0003684631970000148
All channels of (A) are C i Then, the size of the feature block obtained by dividing may be P i ×P i The number of channels may be C i . In one example, a convolution kernel size of P may be employed i ×P i Step length of P i For the first characteristic diagram
Figure BDA0003684631970000149
And a second characteristic diagram
Figure BDA00036846319700001410
The division is performed separately.
In this implementation, part of feature blocks may be selected from a feature block set corresponding to the first feature map to form a first feature block set for generating a third feature map; part of feature blocks can be selected from the feature block set corresponding to the second feature map to form a second feature block set for generating a third feature map. Wherein the first set of feature blocks is complementary to the position of the feature blocks in the second set of feature blocks. After the first feature block set and the second feature block set are obtained, the first feature block set and the second feature block set may be processed to generate a third feature map. For example, a first feature block set and a second feature block set of a preset network may be adopted for processing to generate a third feature map. For another example, the first set of feature blocks and the second set of feature blocks may be stitched to generate a third feature map.
In this implementation, a first feature block set used for generating a third feature map is determined from the first feature map, a second feature block set used for generating the third feature map is determined from the second feature map, and the third feature map is generated according to the first feature block set and the second feature block set, so that the first feature map and the second feature map are divided by taking the feature blocks as minimum units, and the third feature map is generated based on part of the feature blocks in the first feature map and part of the feature blocks in the second feature map, so that the second image processing model is trained by using a priori feature blocks provided by a teacher model, which helps to improve the precision of the trained second image processing model.
As an example of this implementation, the determining, from the first feature map, a first feature block set for generating a third feature map, and determining, from the second feature map, a second feature block set for generating the third feature map includes: determining a mask ratio for merging the first feature map and the second feature map according to the first feature map and the second feature map; determining a mask area according to the mask proportion; according to the mask region, determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the positions of feature blocks in the first feature block set and the second feature block set are complementary.
In one example, the first profile
Figure BDA00036846319700001411
And a second characteristic diagram
Figure BDA00036846319700001412
Respectively comprise M i A feature block with mask ratio of alpha i Then, the mask region may include M i ×α i A feature block.
In this example, the mask region may be randomly determined according to the mask ratio. Alternatively, the mask region may be determined according to the mask ratio and a preset mask rule.
In this example, a mask ratio for merging the first feature map and the second feature map is determined based on the first feature map and the second feature map, determining a mask region according to the mask proportion, determining a first feature block set for generating a third feature map from the first feature map, and determining a second feature block set for generating the third feature map from the second feature map, wherein the first set of feature blocks is complementary in position to feature blocks in the second set of feature blocks, whereby, the mask ratio for merging the first feature map and the second feature map is determined based on the first feature map and the second feature map, instead of using a fixed mask ratio, that is, the mask ratio for merging the feature map pairs is dynamically adjusted according to the similarity information (e.g., similarity) of the feature map pairs. For example, different feature map pairs output by different intermediate layer pairs have different similarity information, and different mask ratios may be adopted; as another example, different feature map pairs output by the same middle layer pair in different training rounds will likely employ different mask ratios. By adopting the mask proportion determined dynamically according to the characteristic diagram, the difference between the second image processing model and the first image processing model is favorably reduced, and the precision of the trained second image processing model is favorably improved.
In one example, the determining a mask ratio for merging the first feature map and the second feature map according to the first feature map and the second feature map includes: determining similarity information between the first feature map and the second feature map; and determining a mask ratio for merging the first feature map and the second feature map according to the similarity information. In this example, the similarity information between the first feature map and the second feature map may be any information that can represent the similarity between the first feature map and the second feature map. In this example, by determining similarity information between the first feature map and the second feature map and determining a mask ratio for merging the first feature map and the second feature map according to the similarity information, the accuracy of the trained second image processing model can be improved.
In other examples, the mask ratio for merging the first feature map and the second feature map may also be determined according to a correlation between the first feature map and the second feature map, or may be determined according to other information of the first feature map and the second feature map, which is not limited herein.
In one example, the determining similarity information between the first feature map and the second feature map includes: determining a CKA (Central Kernel Alignment) similarity index between the first feature map and the second feature map; determining a mask ratio for merging the first feature map and the second feature map according to the similarity information includes: determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index.
In one example, equation 1 may be used to determine the first profile
Figure BDA0003684631970000151
And the second characteristic diagram
Figure BDA0003684631970000152
CKA similarity index between:
Figure BDA0003684631970000153
wherein the content of the first and second substances,
Figure BDA0003684631970000154
Figure BDA0003684631970000155
x 'represents a matrix in which the diagonal line of X is 0, Y' represents a matrix in which the diagonal line of Y is 0, 1 represents an identity matrix, and n represents the length H of the feature map i And width W i Tr denotes the trace of the matrix.
In one example, for merging first feature maps
Figure BDA0003684631970000161
And the second characteristic diagram
Figure BDA0003684631970000162
Mask ratio of (a) i Can be determined using equation 2:
α i 1-CKA formula 2.
In this example, by determining a CKA similarity index between the first feature map and the second feature map, and determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index, it is possible to help improve the accuracy of the trained second image processing model.
In other examples, the similarity information between the first feature map and the second feature map may also be measured by cosine similarity, and the like, which is not limited herein.
In one example, the determining similarity information between the first feature map and the second feature map includes: aligning the first feature map with the second feature map; and determining similarity information between the aligned first feature map and the aligned second feature map. In this example, aligning the first feature map with the second feature map may indicate that the number of channels and/or the size of the first feature map and the second feature map are the same. By aligning the first feature map and the second feature map and determining the similarity information between the aligned first feature map and the aligned second feature map, the similarity information between the first feature map and the second feature map can be determined more accurately.
In one example, the aligning the first feature map with the second feature map comprises: performing convolution processing on the second feature map in response to the fact that the number of channels of the second feature map is different from that of the channels of the first feature map, and enabling the number of channels of the second feature map after convolution processing to be the same as that of the channels of the first feature map, and/or performing bilinear interpolation on the second feature map in response to the fact that the size of the second feature map is different from that of the channels of the first feature map, and enabling the size of the second feature map after bilinear interpolation to be the same as that of the first feature map; or, in response to the number of channels of the second feature map being different from that of the first feature map, performing convolution processing on the first feature map so that the number of channels of the convolution processed first feature map is the same as that of the second feature map, and/or in response to the size of the second feature map being different from that of the first feature map, performing bilinear interpolation on the first feature map so that the size of the bilinear interpolated first feature map is the same as that of the second feature map.
In one example, in response to the number of channels of the second feature map being different from that of the first feature map, the second feature map may be convolved, so that the number of channels of the convolved second feature map is the same as that of the first feature map, and/or in response to the size of the second feature map being different from that of the first feature map, the second feature map may be bilinearly interpolated, so that the size of the bilinear interpolated second feature map is the same as that of the first feature map. For example, if the feature map is paired
Figure BDA0003684631970000163
The first characteristic diagram of
Figure BDA0003684631970000164
And the second characteristic diagram
Figure BDA0003684631970000165
If the number of channels is different, the second characteristic diagram can be obtained
Figure BDA0003684631970000166
Performing convolution processing to obtain the second feature map after convolution processing
Figure BDA0003684631970000167
And the first characteristic diagram
Figure BDA0003684631970000168
The number of channels is the same, wherein i is more than or equal to 1 and less than or equal to T; if the feature map pair
Figure BDA0003684631970000169
The first characteristic diagram of
Figure BDA00036846319700001610
And the second characteristic diagram
Figure BDA00036846319700001611
Can be compared to the second characteristic diagram
Figure BDA00036846319700001612
Carrying out bilinear interpolation to enable the second characteristic diagram after bilinear interpolation
Figure BDA00036846319700001613
And the first characteristic diagram
Figure BDA00036846319700001614
Are the same.
In another example, the first feature map may be convolved in response to the number of channels of the second feature map being different from that of the first feature map, so that the number of channels of the convolved first feature map is the same as that of the convolved second feature map, and/or the first feature map may be bilinear interpolated in response to the size of the second feature map being different from that of the first feature map, so that the size of the bilinear interpolated first feature map is the same as that of the second feature map. For example, if the feature map is right
Figure BDA0003684631970000171
The first characteristic diagram of
Figure BDA0003684631970000172
And the second characteristic diagram
Figure BDA0003684631970000173
If the number of channels is different, the first characteristic diagram can be obtained
Figure BDA0003684631970000174
Performing convolution processing to obtain the first feature map after convolution processing
Figure BDA0003684631970000175
And the second characteristic diagram
Figure BDA0003684631970000176
The number of channels is the same, wherein i is more than or equal to 1 and less than or equal to T; if the feature map is right
Figure BDA0003684631970000177
The first characteristic diagram of
Figure BDA0003684631970000178
And the second characteristic diagram
Figure BDA0003684631970000179
Can be applied to the first characteristic diagram
Figure BDA00036846319700001710
Performing bilinear interpolation to obtain a first feature map after bilinear interpolation
Figure BDA00036846319700001711
And the second characteristic diagram
Figure BDA00036846319700001712
Are the same size.
In this example, in response to the difference between the number of channels of the second feature map and the number of channels of the first feature map, the second feature map is convolved, and the number of channels of the convolved second feature map is made equal to the number of channels of the first feature map, thereby aligning the number of channels of the first feature map and the number of channels of the second feature map; performing bilinear interpolation on the second feature map in response to the fact that the second feature map and the first feature map are different in size, so that the size of the bilinear interpolated second feature map is the same as that of the first feature map, and therefore the sizes of the first feature map and the second feature map can be aligned; performing convolution processing on the first feature map in response to the fact that the second feature map and the first feature map have different channel numbers, and enabling the channel numbers of the convolution processed first feature map and the convolution processed second feature map to be the same, so that the channel numbers of the first feature map and the second feature map can be aligned; the sizes of the first feature map and the second feature map can be aligned by performing bilinear interpolation on the first feature map in response to the difference in size between the second feature map and the first feature map, and making the size of the bilinear interpolated first feature map the same as that of the second feature map.
In one example, the determining, according to the mask region, a first feature block set for generating a third feature map from the first feature map and a second feature block set for generating the third feature map from the second feature map includes: determining a first feature block set used for generating a third feature map from the first feature map corresponding to the position information of the mask region or corresponding to the position information outside the mask region; and selecting feature blocks complementary to the positions of the feature blocks in the first feature block set from the second feature map to obtain a second feature block set for generating a third feature map.
In one example, a first feature block set used for generating a third feature map may be determined from the first feature map corresponding to the position information of the mask region; and selecting feature blocks complementary to the positions of the feature blocks in the first feature block set from the second feature map to obtain a second feature block set for generating a third feature map. That is, a first feature block set used for generating a third feature map may be determined according to the feature blocks belonging to the mask region in the first feature map, in accordance with the position information of the mask region; and corresponding to the position information outside the mask region, determining a second feature block set for generating the third feature map according to the feature blocks which do not belong to the mask region in the second feature map.
In another example, a first feature block set used for generating a third feature map may be determined from the first feature map corresponding to location information outside the mask region; and selecting feature blocks complementary to the positions of the feature blocks in the first feature block set from the second feature map to obtain a second feature block set for generating a third feature map. That is, a first feature block set for generating a third feature map may be determined from the first feature map in accordance with position information other than the mask region; and determining a second feature block set used for generating the third feature map from the second feature map corresponding to the position information of the mask region.
In this example, a first feature block set for generating a third feature map is determined from the first feature map by position information corresponding to the mask region or position information corresponding to positions other than the mask region, and a feature block complementary to the position of the feature block in the first feature block set is selected from the second feature map to obtain a second feature block set for generating the third feature map, so that the feature block masked in the second feature map is filled with the feature block at the corresponding position in the first feature map, thereby providing part of a priori knowledge to the second image processing model as a student model through the first image processing model as a teacher model.
As another example of this implementation, the determining, from the first feature map, a first feature block set used for generating a third feature map, and determining, from the second feature map, a second feature block set used for generating the third feature map includes: determining a mask area according to a preset mask proportion; according to the mask region, determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map. In this example, the mask ratio may be a fixed value.
As another example of this implementation, the determining, from the first feature map, a first feature block set for generating a third feature map, and determining, from the second feature map, a second feature block set for generating the third feature map includes: randomly determining a first feature block set for generating a third feature map from the first feature map; determining a second feature block set used for generating the third feature map from the second feature map according to the position of the feature block in the first feature block set; wherein the positions of the feature blocks in the first feature block set and the second feature block set are complementary.
As another example of this implementation, the determining, from the first feature map, a first feature block set for generating a third feature map, and determining, from the second feature map, a second feature block set for generating the third feature map includes: dividing a feature block set corresponding to the first feature map into a plurality of first feature block subsets, wherein any one of the first feature block subsets comprises at least two adjacent feature blocks; respectively selecting a preset number of feature blocks from the plurality of first feature block subsets to obtain a first feature block set; and selecting a feature block complementary to the position of the feature block in the first feature block set from the feature block set corresponding to the second feature map to obtain a second feature block set. For example, each first feature block subset may include 4 adjacent feature blocks, and the preset number may be 1.
As an example of this implementation, the generating the third feature map according to the first feature block set and the second feature block set includes: respectively carrying out position coding on the first characteristic block set and the second characteristic block set to obtain first position information corresponding to the first characteristic block set and second position information corresponding to the second characteristic block set; coding the first feature block set by combining the first position information to obtain a first feature block coding result; combining the second position information to encode the second feature block set to obtain a second feature block encoding result; and generating a third feature map according to the first feature block coding result and the second feature block coding result.
In this example, a position coding manner such as sine and cosine position coding may be adopted to perform position coding on the first feature block set to obtain first position information, and to perform position coding on the second feature block set to obtain second position information. The first position information may represent a position encoding result corresponding to the first feature block set, and the second position information may represent a position encoding result corresponding to the second feature block set.
In this example, a first coding network corresponding to the first image processing model may be used to code the first feature block set to which the first position information is added, so as to obtain a first feature block coding result(ii) a The second feature block set added with the second position information may be encoded by using a second encoding network corresponding to the second image processing model, so as to obtain a second feature block encoding result. For example, the first feature block encoding result may be denoted as f i t The second feature block encoding result can be noted as f i s . In one example, the first coding network and the second coding network may each be a multi-headed self-attention network of 6 layers. Of course, the first coding network and the second coding network may also adopt other network structures, for example, the number of layers of the self-attention network may be smaller or larger, and is not limited herein. The parameters of the first coding network and the second coding network may be updated with the training of the second image processing model, i.e. the first coding network and the second coding network may be trained together with the second image processing model.
In this example, after the first feature block encoding result and the second feature block encoding result are obtained, the first feature block encoding result and the second feature block encoding result may be combined according to the relative position information between the feature blocks in the second feature map to obtain a third feature map.
In this example, the first feature block set and the second feature block set are respectively position-encoded to obtain first position information corresponding to the first feature block set and second position information corresponding to the second feature block set, the first feature block set is encoded in combination with the first position information to obtain a first feature block encoding result, the second feature block set is encoded in combination with the second position information to obtain a second feature block encoding result, and a third feature map is generated according to the first feature block encoding result and the second feature block encoding result, so that the third feature map is generated in combination with the position information of the feature blocks in the first feature block set and the second feature block set, and the third feature map can include the position information of the feature blocks in the first feature map and the second feature map, therefore, the second image processing model can be trained by using the prior characteristics containing the position information provided by the teacher model, and the accuracy of the trained second image processing model is improved.
In another possible implementation manner, the generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map includes: determining a first pixel set used for generating a third feature map from the first feature map, and determining a second pixel set used for generating the third feature map from the second feature map, wherein the first pixel set represents a set of pixels used for generating the third feature map in the first feature map, and the second pixel set represents a set of pixels used for generating the third feature map in the second feature map; and generating the third feature map according to the first pixel set and the second pixel set. In this implementation, the first feature map and the second feature map may be divided in a pixel as a minimum unit.
In a possible implementation manner, the determining, according to the third feature map and the first feature map, a value of a loss function corresponding to the second image processing model includes: carrying out position coding on the third feature map to obtain third position information corresponding to the third feature map; inputting the third position information and the third feature map into a decoding network to obtain a fourth feature map; and determining the value of the loss function corresponding to the second image processing model according to the fourth feature map and the first feature map.
In this implementation, a position coding method such as sine and cosine position coding may be adopted to perform position coding on the third feature map, so as to obtain third position information corresponding to the third feature map. In this implementation, the decoding network may consist of 6 layers of multi-headed self-attention layers and one layer of multi-layered perceptrons. Of course, the decoding network may also adopt other network structures, and is not limited herein. The parameters of the decoding network may be updated with the training of the second image processing model, i.e. the decoding network may be trained together with the second image processing model. In this implementation, the value of the loss function corresponding to the second image processing model may be determined using the L2 loss or the L1 loss, or the like. In the case that there are at least two feature map pairs, the value of the loss function corresponding to the second image processing model may be determined from the at least two feature map pairs.
In this implementation, the third feature map is subjected to position coding to obtain third position information corresponding to the third feature map, the third feature map to which the third position information is added is input to a decoding network to obtain a fourth feature map, and a value of a loss function corresponding to the second image processing model is determined according to the fourth feature map and the first feature map, so that the second image processing model is trained to improve the accuracy of the second image processing model.
In a possible implementation manner, the acquiring the first feature map and the second feature map of the training image includes: obtaining at least two first feature maps corresponding to the training images extracted by at least two first intermediate layers of the first image processing model and at least two second feature maps corresponding to the training images extracted by at least two second intermediate layers of the second image processing model, wherein the at least two first feature maps and the at least two second feature maps form at least two feature map pairs, and any one feature map pair in the at least two feature map pairs comprises a first feature map and a second feature map; generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map, including: for any one of the at least two feature map pairs, generating a third feature map corresponding to the feature map pair according to a partial feature in a first feature map of the feature map pair and a partial feature in a second feature map of the feature map pair; determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map, including: and determining the value of the loss function corresponding to the second image processing model according to the third feature map corresponding to the at least two feature map pairs and the first feature map in the at least two feature map pairs.
In this implementation, the second image processing model is trained using at least two pairs of feature maps. For any feature map pair, position coding may be performed on a third feature map corresponding to the feature map pair to obtain third position information corresponding to the third feature map, the third position information and the third feature map may be input to a decoding network to obtain a fourth feature map, and difference information between the fourth feature map and the first feature map in the feature map pair may be determined. Similarly, difference information between the fourth feature map and the first feature map corresponding to each feature map pair may be determined. After determining the difference information corresponding to each of the at least two feature map pairs, the value of the loss function corresponding to the second image processing model may be determined according to a weighted sum of the difference information corresponding to the at least two feature map pairs. In this implementation, the second image processing model is trained by using at least two feature map pairs output by at least two intermediate layer pairs of the first image processing model and the second image processing model, so that the second image processing model can learn richer hidden layer features of the first image processing model, and the difference between the second image processing model and the first image processing model can be further reduced.
In a possible implementation manner, during the training process of the second image processing model, supervision can be performed in combination with a logits distillation method, so that a distillation effect can be provided.
In a possible implementation manner, in the process of training the second image processing model, the second image processing model may be supervised by combining difference information between a prediction result of the second image processing model and annotation data corresponding to the training image, so as to improve the accuracy of the second image processing model.
In one possible implementation, the first image processing model and the second image processing model are both used for image classification; after the second image processing model training is completed, the method further comprises: acquiring an image to be classified; processing the image to be classified through the second image processing model to obtain a feature map corresponding to the image to be classified; and processing the characteristic graph corresponding to the image to be classified through the second image processing model to obtain a classification result corresponding to the image to be classified. In this implementation, the image to be classified may be any image that needs to be classified. And processing the image to be classified through the trained second image processing model to obtain a feature map corresponding to the image to be classified, and processing the feature map corresponding to the image to be classified through the second image processing model to obtain a classification result corresponding to the image to be classified, so that the accuracy of image classification of the image to be classified can be improved.
In one possible implementation, the first image processing model and the second image processing model are both used for target detection; after the second image processing model training is completed, the method further comprises: acquiring an image to be detected; processing the image to be detected through the second image processing model to obtain a characteristic diagram corresponding to the image to be detected; and processing the characteristic graph corresponding to the image to be detected through the second image processing model to obtain a target detection result corresponding to the image to be detected. In this implementation, the image to be detected may be any image that needs to be subjected to target detection. And processing the image to be detected through a second image processing model obtained through training to obtain a characteristic diagram corresponding to the image to be detected, and processing the characteristic diagram corresponding to the image to be detected through the second image processing model to obtain a target detection result corresponding to the image to be detected, so that the accuracy of target detection on the image to be detected can be improved.
The training method of the image processing model provided by the embodiment of the disclosure can be applied to the technical fields of computer vision and the like.
The following describes a training method of an image processing model provided by an embodiment of the present disclosure through a specific application scenario. In this application scenario, the first image processing model may employ ResNet152 and the second image processing model may employ ResNet 18.
The training image may be input into the first image processing model to obtain an output of a last layer before a last down-sampled layer of the first image processing modelFirst characteristic diagram
Figure BDA0003684631970000211
And a first feature map output from a last layer preceding a penultimate down-sampled layer of the first image processing model
Figure BDA0003684631970000212
The training image may be input into a second image processing model to obtain a second feature map output from a last layer of the second image processing model prior to a last down-sampled layer
Figure BDA0003684631970000213
And a second feature map output from a last layer preceding a penultimate down-sampled layer of the second image processing model
Figure BDA0003684631970000214
Wherein, the first characteristic diagram
Figure BDA0003684631970000215
And a second characteristic diagram
Figure BDA0003684631970000216
Form a first characteristic diagram pair
Figure BDA0003684631970000217
First characteristic diagram
Figure BDA0003684631970000218
And a second characteristic diagram
Figure BDA0003684631970000219
Form a second characteristic diagram pair
Figure BDA00036846319700002110
For the first feature map pair
Figure BDA00036846319700002111
In that
Figure BDA00036846319700002112
And
Figure BDA00036846319700002113
under the condition that the channel numbers of the channels are different, the channel numbers can be matched
Figure BDA00036846319700002114
Performing convolution processing to obtain a convolution processed product
Figure BDA00036846319700002115
And the above-mentioned
Figure BDA00036846319700002116
The number of channels is the same. In that
Figure BDA00036846319700002117
And
Figure BDA00036846319700002118
in the case of different sizes of (2), can be used
Figure BDA00036846319700002119
Performing bilinear interpolation to obtain bilinear interpolated value
Figure BDA00036846319700002120
And the above-mentioned
Figure BDA00036846319700002121
Are the same size.
In alignment with
Figure BDA00036846319700002122
And with
Figure BDA00036846319700002123
Thereafter, can be
Figure BDA00036846319700002124
And
Figure BDA00036846319700002125
are respectively divided into size P 1 ×P 1 The feature block of (1). E.g. after alignment
Figure BDA00036846319700002126
And
Figure BDA00036846319700002127
all have a size of H 1 ×W 1 Then, the process of the present invention,
Figure BDA00036846319700002128
and
Figure BDA00036846319700002129
can be divided into M 1 =(H 1 ×W 1 )/(P 1 ×P 1 ) A feature block.
The determination for merging may be made according to equations 1, 2 and 3 above
Figure BDA00036846319700002130
And
Figure BDA00036846319700002131
mask ratio of alpha i . May be based on a mask ratio alpha i The mask region is randomly determined. Can be based on
Figure BDA00036846319700002132
To the feature block belonging to said mask area, determined for generating a third feature map F 1 A first set of feature blocks; can be based on
Figure BDA00036846319700002133
Of feature blocks not belonging to said mask area, determined for generating F 1 The second set of feature blocks.
The first feature block set may be subjected to position coding to obtain first position information corresponding to the first feature block set, and a first coding network corresponding to the first image processing model may be adoptedAnd coding the first characteristic block set added with the first position information to obtain a first characteristic block coding result
Figure BDA0003684631970000221
The second feature block set can be subjected to position coding to obtain second position information corresponding to the second feature block set, and the second feature block set added with the second position information can be coded by adopting a second coding network corresponding to a second image processing model to obtain a second feature block coding result
Figure BDA0003684631970000222
Wherein, the first coding network and the second coding network can be multi-head self-attention networks with 6 layers respectively. Is obtained by
Figure BDA0003684631970000223
And
Figure BDA0003684631970000224
then, can be based on
Figure BDA0003684631970000225
Relative position information between feature blocks in (1), merging
Figure BDA0003684631970000226
And
Figure BDA0003684631970000227
obtaining a third characteristic diagram F 1
Can be paired with F 1 Position coding is carried out to obtain F 1 Corresponding third location information. F after adding the third position information 1 Inputting into a decoding network to obtain a fourth feature map
Figure BDA0003684631970000228
Similarly, for the second profile pair
Figure BDA0003684631970000229
A fourth characteristic diagram can be obtained
Figure BDA00036846319700002210
According to
Figure BDA00036846319700002211
And with
Figure BDA00036846319700002212
Difference information therebetween, and
Figure BDA00036846319700002213
and
Figure BDA00036846319700002214
the difference information between the first and second image processing models can be obtained to obtain a first loss function corresponding to the second image processing model
Figure BDA00036846319700002215
The value of (c).
In addition, the method of logits distillation can be combined, and the difference information between the logits output by the second image processing model and the logits output by the first image processing model is used for determining the second loss function corresponding to the second image processing model
Figure BDA00036846319700002216
And obtaining a third loss function corresponding to the second image processing model according to difference information between the prediction result of the second image processing model and the labeled data corresponding to the training image
Figure BDA00036846319700002217
The value of (c).
In one example, equation 4 may be used to determine a corresponding loss function for the second image processing model
Figure BDA00036846319700002218
The value of (c):
Figure BDA00036846319700002219
wherein alpha represents
Figure BDA00036846319700002220
Corresponding weight, beta represents
Figure BDA00036846319700002221
The corresponding weights, α and β, may be determined empirically.
It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.
In addition, the present disclosure also provides a training apparatus for an image processing model, an electronic device, a computer-readable storage medium, and a computer program product, which can be used to implement any one of the training methods for an image processing model provided by the present disclosure, and corresponding technical solutions and technical effects can be referred to in corresponding descriptions of the method sections and are not described again.
Fig. 2 shows a block diagram of a training apparatus for an image processing model provided in an embodiment of the present disclosure. As shown in fig. 2, the training apparatus for the image processing model includes:
an obtaining module 21, configured to obtain a first feature map and a second feature map of a training image, where the first feature map is output through a first image processing model, and the second feature map is output through a second image processing model;
a generating module 22, configured to generate a third feature map according to a partial feature in the first feature map and a partial feature in the second feature map;
a determining module 23, configured to determine, according to the third feature map and the first feature map, a value of a loss function corresponding to the second image processing model;
a training module 24, configured to train the second image processing model according to the value of the loss function.
In one possible implementation, the generating module 22 is configured to:
determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the first feature block set represents a set of feature blocks used for generating the third feature map in the first feature map, the first feature block set comprises partial features of the first feature map, the second feature block set represents a set of feature blocks used for generating the third feature map in the second feature map, and the second feature block set comprises partial features of the second feature map;
and generating the third feature map according to the first feature block set and the second feature block set.
In one possible implementation, the generating module 22 is configured to:
determining a mask ratio for merging the first feature map and the second feature map according to the first feature map and the second feature map;
determining a mask area according to the mask proportion;
according to the mask region, determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the positions of feature blocks in the first feature block set and the second feature block set are complementary.
In one possible implementation, the generating module 22 is configured to:
determining similarity information between the first feature map and the second feature map;
and determining a mask ratio for merging the first feature map and the second feature map according to the similarity information.
In one possible implementation, the generating module 22 is configured to:
determining an intermediate core alignment CKA similarity index between the first feature map and the second feature map;
determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index.
In one possible implementation, the generating module 22 is configured to:
aligning the first feature map with the second feature map;
and determining similarity information between the aligned first feature map and the aligned second feature map.
In one possible implementation, the generating module 22 is configured to:
in response to the difference between the channel numbers of the second feature map and the first feature map, performing convolution processing on the second feature map to enable the channel numbers of the convolved second feature map and the first feature map to be the same, and/or in response to the difference between the sizes of the second feature map and the first feature map, performing bilinear interpolation on the second feature map to enable the size of the bilinear interpolated second feature map to be the same as the size of the first feature map;
alternatively, the first and second electrodes may be,
and in response to the difference between the channel numbers of the second feature map and the first feature map, performing convolution processing on the first feature map to enable the channel numbers of the convoluted first feature map and the second feature map to be the same, and/or in response to the difference between the sizes of the second feature map and the first feature map, performing bilinear interpolation on the first feature map to enable the sizes of the bilinear interpolated first feature map and the second feature map to be the same.
In a possible implementation manner, the determining module 23 is configured to:
determining a first feature block set used for generating a third feature map from the first feature map corresponding to the position information of the mask region or corresponding to the position information outside the mask region;
and selecting feature blocks complementary to the positions of the feature blocks in the first feature block set from the second feature map to obtain a second feature block set for generating a third feature map.
In one possible implementation, the generating module 22 is configured to:
respectively carrying out position coding on the first characteristic block set and the second characteristic block set to obtain first position information corresponding to the first characteristic block set and second position information corresponding to the second characteristic block set;
coding the first feature block set by combining the first position information to obtain a first feature block coding result;
coding the second feature block set by combining the second position information to obtain a second feature block coding result;
and generating a third feature map according to the first feature block coding result and the second feature block coding result.
In a possible implementation manner, the determining module 23 is configured to:
carrying out position coding on the third feature map to obtain third position information corresponding to the third feature map;
inputting the third position information and the third feature map into a decoding network to obtain a fourth feature map;
and determining the value of the loss function corresponding to the second image processing model according to the fourth feature map and the first feature map.
In one possible implementation of the method according to the invention,
the obtaining module 21 is configured to: obtaining at least two first feature maps corresponding to the training images extracted by at least two first intermediate layers of the first image processing model and at least two second feature maps corresponding to the training images extracted by at least two second intermediate layers of the second image processing model, wherein the at least two first feature maps and the at least two second feature maps form at least two feature map pairs, and any one feature map pair in the at least two feature map pairs comprises a first feature map and a second feature map;
the generating module 22 is configured to: for any one of the at least two feature map pairs, generating a third feature map corresponding to the feature map pair according to a partial feature in a first feature map of the feature map pair and a partial feature in a second feature map of the feature map pair;
the determining module 23 is configured to: and determining the value of the loss function corresponding to the second image processing model according to the third feature map corresponding to the at least two feature map pairs and the first feature map in the at least two feature map pairs.
In one possible implementation, the first image processing model and the second image processing model are both used for image classification;
the device further comprises:
the classification module is used for acquiring an image to be classified; processing the image to be classified through the second image processing model to obtain a feature map corresponding to the image to be classified; and processing the characteristic graph corresponding to the image to be classified through the second image processing model to obtain a classification result corresponding to the image to be classified.
In one possible implementation, the first image processing model and the second image processing model are both used for target detection;
the device further comprises:
the target detection module is used for acquiring an image to be detected; processing the image to be detected through the second image processing model to obtain a characteristic diagram corresponding to the image to be detected; and processing the characteristic graph corresponding to the image to be detected through the second image processing model to obtain a target detection result corresponding to the image to be detected.
In the embodiment of the disclosure, a first feature map and a second feature map of a training image are obtained, wherein the first feature map is output through a first image processing model, the second feature map is output through a second image processing model, a third feature map is generated according to a part of features in the first feature map and a part of features in the second feature map, a value of a loss function corresponding to the second image processing model is determined according to the third feature map and the first feature map, and the second image processing model is trained according to the value of the loss function, so that a part of the feature map extracted by the first image processing model is used as a priori knowledge of the second image processing model, so that the second image processing model simulates the features output by the first image processing model, and thus the second image processing model can be improved along with the improvement of the performance of the first image processing model, the image processing method and the image processing system have the advantages that the gap between the teacher and student models can be solved, the problem that the performance of the second image model is not improved any more along with the improvement of the performance of the first image model can be solved, and therefore the second image processing model with better performance can be obtained through distillation by using the first image processing model with larger scale and better performance.
In some embodiments, functions or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementations and technical effects thereof may refer to the description of the above method embodiments, which are not described herein again for brevity.
Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-described method. The computer-readable storage medium may be a non-volatile computer-readable storage medium, or may be a volatile computer-readable storage medium.
Embodiments of the present disclosure also provide a computer program, which includes computer readable code, when the computer readable code runs in an electronic device, a processor in the electronic device executes the above method.
The disclosed embodiments also provide a computer program product comprising computer readable code or a non-volatile computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the above method.
An embodiment of the present disclosure further provides an electronic device, including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the above-described methods.
The electronic device may be provided as a terminal, server, or other form of device.
Fig. 3 shows a block diagram of a sub-device 1900 provided by an embodiment of the present disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 3, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the methods described above.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932 TM ) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X) TM ) Multi-user, multi-process computer operating system (Unix) TM ) Free and open native code Unix-like operating System (Linux) TM ) Open native code Unix-like operating System (FreeBSD) TM ) Or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.
The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.
The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.
If the technical scheme of the embodiment of the disclosure relates to personal information, a product applying the technical scheme of the embodiment of the disclosure clearly informs personal information processing rules before processing the personal information, and obtains personal independent consent. If the technical scheme of the embodiment of the disclosure relates to sensitive personal information, a product applying the technical scheme of the embodiment of the disclosure obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'express consent'. For example, at a personal information collection device such as a camera, a clear and significant identifier is set to inform that the personal information collection range is entered, the personal information is collected, and if the person voluntarily enters the collection range, the person is considered as agreeing to collect the personal information; or on the device for processing the personal information, under the condition of informing the personal information processing rule by using obvious identification/information, obtaining personal authorization by modes of popping window information or asking a person to upload personal information of the person by himself, and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing method, and a type of personal information to be processed.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (17)

1. A method for training an image processing model, comprising:
acquiring a first feature map and a second feature map of a training image, wherein the first feature map is output through a first image processing model, and the second feature map is output through a second image processing model;
generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map;
determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map;
and training the second image processing model according to the value of the loss function.
2. The method according to claim 1, wherein generating a third feature map from the partial features in the first feature map and the partial features in the second feature map comprises:
determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the first feature block set represents a set of feature blocks used for generating the third feature map in the first feature map, the first feature block set comprises partial features of the first feature map, the second feature block set represents a set of feature blocks used for generating the third feature map in the second feature map, and the second feature block set comprises partial features of the second feature map;
and generating the third feature map according to the first feature block set and the second feature block set.
3. The method of claim 2, wherein determining a first set of feature blocks from the first feature map for generating a third feature map and determining a second set of feature blocks from the second feature map for generating the third feature map comprises:
determining a mask ratio for merging the first feature map and the second feature map according to the first feature map and the second feature map;
determining a mask area according to the mask proportion;
according to the mask region, determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the positions of feature blocks in the first feature block set and the second feature block set are complementary.
4. The method of claim 3, wherein determining a masking scale for merging the first feature map and the second feature map according to the first feature map and the second feature map comprises:
determining similarity information between the first feature map and the second feature map;
and determining a mask ratio for merging the first feature map and the second feature map according to the similarity information.
5. The method of claim 4,
the determining similarity information between the first feature map and the second feature map includes: determining an intermediate core alignment CKA similarity index between the first feature map and the second feature map;
determining a mask ratio for merging the first feature map and the second feature map according to the similarity information includes: determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index.
6. The method according to claim 4 or 5, wherein the determining similarity information between the first feature map and the second feature map comprises:
aligning the first feature map with the second feature map;
and determining similarity information between the aligned first feature map and the aligned second feature map.
7. The method of claim 6, wherein said aligning the first feature map with the second feature map comprises:
performing convolution processing on the second feature map in response to the fact that the number of channels of the second feature map is different from that of the channels of the first feature map, and enabling the number of channels of the second feature map after convolution processing to be the same as that of the channels of the first feature map, and/or performing bilinear interpolation on the second feature map in response to the fact that the size of the second feature map is different from that of the channels of the first feature map, and enabling the size of the second feature map after bilinear interpolation to be the same as that of the first feature map;
alternatively, the first and second electrodes may be,
and in response to the difference of the channel numbers of the second feature map and the first feature map, performing convolution processing on the first feature map to enable the channel numbers of the convolved first feature map and the convolved second feature map to be the same, and/or in response to the difference of the sizes of the second feature map and the first feature map, performing bilinear interpolation on the first feature map to enable the size of the bilinear interpolated first feature map and the size of the second feature map to be the same.
8. The method according to any one of claims 3 to 7, wherein the determining, according to the mask region, a first feature block set for generating a third feature map from the first feature map and a second feature block set for generating the third feature map from the second feature map comprises:
determining a first feature block set used for generating a third feature map from the first feature map corresponding to the position information of the mask region or corresponding to the position information outside the mask region;
and selecting feature blocks complementary to the positions of the feature blocks in the first feature block set from the second feature map to obtain a second feature block set for generating a third feature map.
9. The method according to any one of claims 2 to 8, wherein the generating the third feature map from the first set of feature blocks and the second set of feature blocks comprises:
respectively carrying out position coding on the first characteristic block set and the second characteristic block set to obtain first position information corresponding to the first characteristic block set and second position information corresponding to the second characteristic block set;
coding the first feature block set by combining the first position information to obtain a first feature block coding result;
combining the second position information to encode the second feature block set to obtain a second feature block encoding result;
and generating a third feature map according to the first feature block coding result and the second feature block coding result.
10. The method according to any one of claims 1 to 9, wherein determining the value of the loss function corresponding to the second image processing model according to the third feature map and the first feature map comprises:
carrying out position coding on the third feature map to obtain third position information corresponding to the third feature map;
inputting the third position information and the third feature map into a decoding network to obtain a fourth feature map;
and determining the value of the loss function corresponding to the second image processing model according to the fourth feature map and the first feature map.
11. The method according to any one of claims 1 to 10,
the acquiring of the first feature map and the second feature map of the training image includes: obtaining at least two first feature maps corresponding to the training images extracted by at least two first intermediate layers of the first image processing model and at least two second feature maps corresponding to the training images extracted by at least two second intermediate layers of the second image processing model, wherein the at least two first feature maps and the at least two second feature maps form at least two feature map pairs, and any one feature map pair in the at least two feature map pairs comprises a first feature map and a second feature map;
generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map, including: for any one of the at least two feature map pairs, generating a third feature map corresponding to the feature map pair according to a partial feature in a first feature map of the feature map pair and a partial feature in a second feature map of the feature map pair;
determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map, including: and determining the value of the loss function corresponding to the second image processing model according to the third feature map corresponding to the at least two feature map pairs and the first feature map in the at least two feature map pairs.
12. The method of any of claims 1 to 11, wherein the first image processing model and the second image processing model are both used for image classification;
after the second image processing model training is completed, the method further comprises:
acquiring an image to be classified;
processing the image to be classified through the second image processing model to obtain a feature map corresponding to the image to be classified;
and processing the characteristic graph corresponding to the image to be classified through the second image processing model to obtain a classification result corresponding to the image to be classified.
13. The method according to any of claims 1 to 11, wherein the first image processing model and the second image processing model are both used for object detection;
after the second image processing model training is completed, the method further comprises:
acquiring an image to be detected;
processing the image to be detected through the second image processing model to obtain a characteristic diagram corresponding to the image to be detected;
and processing the characteristic graph corresponding to the image to be detected through the second image processing model to obtain a target detection result corresponding to the image to be detected.
14. An apparatus for training an image processing model, comprising:
the acquisition module is used for acquiring a first feature map and a second feature map of a training image, wherein the first feature map is output through a first image processing model, and the second feature map is output through a second image processing model;
a generating module, configured to generate a third feature map according to a partial feature in the first feature map and a partial feature in the second feature map;
a determining module, configured to determine, according to the third feature map and the first feature map, a value of a loss function corresponding to the second image processing model;
and the training module is used for training the second image processing model according to the value of the loss function.
15. An electronic device, comprising:
one or more processors;
a memory for storing executable instructions;
wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the method of any one of claims 1 to 13.
16. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 13.
17. A computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code which, when run in an electronic device, causes a processor in the electronic device to perform the method of any of claims 1 to 13.
CN202210647722.0A 2022-06-08 2022-06-08 Method, apparatus, device, medium and program product for training image processing model Pending CN114998694A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210647722.0A CN114998694A (en) 2022-06-08 2022-06-08 Method, apparatus, device, medium and program product for training image processing model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210647722.0A CN114998694A (en) 2022-06-08 2022-06-08 Method, apparatus, device, medium and program product for training image processing model

Publications (1)

Publication Number Publication Date
CN114998694A true CN114998694A (en) 2022-09-02

Family

ID=83033251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210647722.0A Pending CN114998694A (en) 2022-06-08 2022-06-08 Method, apparatus, device, medium and program product for training image processing model

Country Status (1)

Country Link
CN (1) CN114998694A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024066111A1 (en) * 2022-09-28 2024-04-04 北京大学 Image processing model training method and apparatus, image processing method and apparatus, and device and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024066111A1 (en) * 2022-09-28 2024-04-04 北京大学 Image processing model training method and apparatus, image processing method and apparatus, and device and medium

Similar Documents

Publication Publication Date Title
CN112184738B (en) Image segmentation method, device, equipment and storage medium
CN108427939B (en) Model generation method and device
US11514263B2 (en) Method and apparatus for processing image
CN112712069B (en) Question judging method and device, electronic equipment and storage medium
CN112800276B (en) Video cover determining method, device, medium and equipment
CN111062964A (en) Image segmentation method and related device
CN112381717A (en) Image processing method, model training method, device, medium, and apparatus
CN112330788A (en) Image processing method, image processing device, readable medium and electronic equipment
CN112631947A (en) Application program test control method and device, electronic equipment and storage medium
CN114170468A (en) Text recognition method, storage medium and computer terminal
US20230351558A1 (en) Generating an inpainted image from a masked image using a patch-based encoder
CN114998694A (en) Method, apparatus, device, medium and program product for training image processing model
CN109816023B (en) Method and device for generating picture label model
CN114969512A (en) Object recommendation method and device and electronic equipment
CN114913061A (en) Image processing method and device, storage medium and electronic equipment
CN111898338B (en) Text generation method and device and electronic equipment
CN113610034A (en) Method, device, storage medium and electronic equipment for identifying person entity in video
CN115546769B (en) Road image recognition method, device, equipment and computer readable medium
CN114627353B (en) Image description generation method, device, equipment, medium and product
CN115757933A (en) Recommendation information generation method, device, equipment, medium and program product
CN115375656A (en) Training method, segmentation method, device, medium, and apparatus for polyp segmentation model
CN115115836A (en) Image recognition method, image recognition device, storage medium and electronic equipment
CN115272667A (en) Farmland image segmentation model training method and device, electronic equipment and medium
CN111353536B (en) Image labeling method and device, readable medium and electronic equipment
CN115100417A (en) Image processing method, storage medium, and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination