CN114998694A

CN114998694A - Method, apparatus, device, medium and program product for training image processing model

Info

Publication number: CN114998694A
Application number: CN202210647722.0A
Authority: CN
Inventors: 杨昆霖; 邱增玉; 宗道明; 侯军; 伊帅
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-09-02

Abstract

The present disclosure relates to a method, apparatus, device, medium, and program product for training an image processing model. The method comprises the following steps: acquiring a first feature map and a second feature map of a training image, wherein the first feature map is output through a first image processing model, and the second feature map is output through a second image processing model; generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map; determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map; and training the second image processing model according to the value of the loss function.

Description

Training method, device, equipment, medium and program product of image processing model

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for training an image processing model, an electronic device, a storage medium, and a program product.

Background

Knowledge Distillation (KD) refers to distilling and extracting Knowledge contained in a trained teacher model into student models. The general knowledge distillation method utilizes the last output logits of the teacher model as the knowledge for the student model to learn, which is commonly referred to as "soft label" or "dark knowledge". The related art also proposes a feature-based knowledge distillation method that takes a feature map output from the middle layer of the teacher model as knowledge for the student model to learn.

In the related art, the performance of the teacher model cannot be improved after distillation. That is, there is a gap between teacher and student models (i.e., teacher model and student model), and there is a bottleneck in improving the performance of student models, and it is difficult to continue optimization after improving to a certain extent.

For mobile terminals that lack computing power, only a small scale student model can typically be deployed. In the related art, the performance of the student model is poor, so that the student model is difficult to fall on the ground, and the low-precision student model is difficult to meet the requirements of users. If the gap between the teacher model and the student model can be broken, the performance of the student model can be improved along with the improvement of the performance of the teacher model, the landing of more student models can be accelerated, and more comfortable experience is brought to users.

Disclosure of Invention

The present disclosure provides a training technical solution of an image processing model.

According to an aspect of the present disclosure, there is provided a training method of an image processing model, including:

acquiring a first feature map and a second feature map of a training image, wherein the first feature map is output through a first image processing model, and the second feature map is output through a second image processing model;

generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map;

determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map;

and training the second image processing model according to the value of the loss function.

The method comprises the steps of obtaining a first feature map and a second feature map of a training image, wherein the first feature map is output through a first image processing model, the second feature map is output through a second image processing model, generating a third feature map according to partial features in the first feature map and partial features in the second feature map, determining values of loss functions corresponding to the second image processing model according to the third feature map and the first feature map, and training the second image processing model according to the values of the loss functions, so that a part of the feature map extracted by the first image processing model is used as a priori knowledge of the second image processing model to enable the second image processing model to imitate the features output by the first image processing model, and the second image processing model can be promoted along with the promotion of the performance of the first image processing model, the method can solve the gap between the teacher model and the student model, and can solve the problem that the performance of the second image model is not improved any more along with the improvement of the performance of the first image model, so that the second image processing model with better performance can be obtained by distillation by using the first image processing model with larger scale and better performance.

In a possible implementation manner, the generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map includes:

determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the first feature block set represents a set of feature blocks used for generating the third feature map in the first feature map, the first feature block set comprises partial features of the first feature map, the second feature block set represents a set of feature blocks used for generating the third feature map in the second feature map, and the second feature block set comprises partial features of the second feature map;

and generating the third feature map according to the first feature block set and the second feature block set.

In this implementation, a first feature block set used for generating a third feature map is determined from the first feature map, a second feature block set used for generating the third feature map is determined from the second feature map, and the third feature map is generated according to the first feature block set and the second feature block set, so that the first feature map and the second feature map are divided by taking the feature block as a minimum unit, and the third feature map is generated based on a part of feature blocks in the first feature map and a part of feature blocks in the second feature map, so that the second image processing model is trained by using a priori feature blocks provided by a teacher model, which helps to improve the accuracy of the trained second image processing model.

In a possible implementation manner, the determining, from the first feature map, a first feature block set used for generating a third feature map, and determining, from the second feature map, a second feature block set used for generating the third feature map includes:

determining a mask ratio for merging the first feature map and the second feature map according to the first feature map and the second feature map;

determining a mask area according to the mask proportion;

according to the mask region, determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the positions of feature blocks in the first feature block set and the second feature block set are complementary.

In this implementation, a mask ratio for merging the first feature map and the second feature map is determined according to the first feature map and the second feature map, determining a mask region according to the mask proportion, determining a first feature block set for generating a third feature map from the first feature map, and determining a second feature block set for generating the third feature map from the second feature map, wherein the first set of feature blocks are complementary in position to feature blocks in the second set of feature blocks, whereby, the mask ratio for merging the first feature map and the second feature map is determined based on the first feature map and the second feature map, instead of using a fixed mask ratio, that is, the mask ratio for merging the feature map pairs is dynamically adjusted according to the similarity information (e.g., similarity) of the feature map pairs. For example, different feature map pairs output by different intermediate layer pairs have different similarity information, and different mask ratios may be adopted; as another example, different feature map pairs output by the same middle layer pair in different training rounds will likely employ different mask ratios. By adopting the mask proportion determined dynamically according to the characteristic diagram, the difference between the second image processing model and the first image processing model is favorably reduced, and the precision of the trained second image processing model is favorably improved.

In a possible implementation manner, the determining, according to the first feature map and the second feature map, a mask ratio for merging the first feature map and the second feature map includes:

determining similarity information between the first feature map and the second feature map;

and determining a mask ratio for merging the first feature map and the second feature map according to the similarity information.

In this implementation, similarity information between the first feature map and the second feature map is determined, and a mask ratio for combining the first feature map and the second feature map is determined according to the similarity information, thereby facilitating improvement of the accuracy of the trained second image processing model.

In one possible implementation form of the method,

the determining similarity information between the first feature map and the second feature map includes: determining an intermediate core alignment CKA similarity index between the first feature map and the second feature map;

determining a mask ratio for merging the first feature map and the second feature map according to the similarity information includes: determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index.

In this implementation, by determining the CKA similarity index between the first feature map and the second feature map and determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index, the accuracy of the trained second image processing model is improved.

In a possible implementation manner, the determining similarity information between the first feature map and the second feature map includes:

aligning the first feature map with the second feature map;

and determining similarity information between the aligned first feature map and the aligned second feature map.

In this implementation, the similarity information between the first feature map and the second feature map can be determined more accurately by aligning the first feature map and the second feature map and determining the similarity information between the aligned first feature map and the aligned second feature map.

In one possible implementation, the aligning the first feature map and the second feature map includes:

in response to the difference between the channel numbers of the second feature map and the first feature map, performing convolution processing on the second feature map to enable the channel numbers of the convolved second feature map and the first feature map to be the same, and/or in response to the difference between the sizes of the second feature map and the first feature map, performing bilinear interpolation on the second feature map to enable the size of the bilinear interpolated second feature map to be the same as the size of the first feature map;

alternatively, the first and second liquid crystal display panels may be,

and in response to the difference of the channel numbers of the second feature map and the first feature map, performing convolution processing on the first feature map to enable the channel numbers of the convolved first feature map and the convolved second feature map to be the same, and/or in response to the difference of the sizes of the second feature map and the first feature map, performing bilinear interpolation on the first feature map to enable the size of the bilinear interpolated first feature map and the size of the second feature map to be the same.

In this implementation, in response to the difference between the number of channels of the second feature map and the number of channels of the first feature map, performing convolution processing on the second feature map so that the number of channels of the convolved second feature map is the same as the number of channels of the convolved first feature map, thereby making it possible to align the number of channels of the first feature map and the number of channels of the second feature map; performing bilinear interpolation on the second feature map in response to the fact that the second feature map and the first feature map are different in size, so that the size of the bilinear interpolated second feature map is the same as that of the first feature map, and therefore the sizes of the first feature map and the second feature map can be aligned; performing convolution processing on the first feature map in response to the fact that the second feature map and the first feature map have different channel numbers, so that the channel numbers of the first feature map and the second feature map subjected to convolution processing are the same, and therefore the channel numbers of the first feature map and the second feature map can be aligned; the sizes of the first feature map and the second feature map can be aligned by performing bilinear interpolation on the first feature map in response to the difference in size between the second feature map and the first feature map, and making the size of the bilinear interpolated first feature map the same as that of the second feature map.

In a possible implementation manner, the determining, according to the mask region, a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map includes:

determining a first feature block set used for generating a third feature map from the first feature map corresponding to the position information of the mask region or corresponding to the position information outside the mask region;

and selecting feature blocks complementary to the positions of the feature blocks in the first feature block set from the second feature map to obtain a second feature block set for generating a third feature map.

In this implementation, a first feature block set used for generating a third feature map is determined from the first feature map by using position information corresponding to the mask region or position information corresponding to positions other than the mask region, and a feature block complementary to the position of the feature block in the first feature block set is selected from the second feature map to obtain a second feature block set used for generating the third feature map, so that the feature block masked in the second feature map is filled with the feature block at the corresponding position in the first feature map, and thus part of prior knowledge can be provided to the second image processing model serving as a student model by using the first image processing model serving as a teacher model.

In one possible implementation manner, the generating the third feature map according to the first feature block set and the second feature block set includes:

respectively carrying out position coding on the first characteristic block set and the second characteristic block set to obtain first position information corresponding to the first characteristic block set and second position information corresponding to the second characteristic block set;

coding the first feature block set by combining the first position information to obtain a first feature block coding result;

coding the second feature block set by combining the second position information to obtain a second feature block coding result;

and generating a third feature map according to the first feature block coding result and the second feature block coding result.

In this implementation, the first feature block set and the second feature block set are respectively position-encoded to obtain first position information corresponding to the first feature block set and second position information corresponding to the second feature block set, the first feature block set is encoded in combination with the first position information to obtain a first feature block encoding result, the second feature block set is encoded in combination with the second position information to obtain a second feature block encoding result, and a third feature map is generated according to the first feature block encoding result and the second feature block encoding result, so that the third feature map is generated in combination with the position information of the feature blocks in the first feature block set and the second feature block set, and the third feature map can include the position information of the feature blocks in the first feature map and the second feature map, therefore, the second image processing model can be trained by using the prior characteristics containing the position information provided by the teacher model, and the accuracy of the trained second image processing model is improved.

In a possible implementation manner, the determining, according to the third feature map and the first feature map, a value of a loss function corresponding to the second image processing model includes:

carrying out position coding on the third feature map to obtain third position information corresponding to the third feature map;

inputting the third position information and the third feature map into a decoding network to obtain a fourth feature map;

and determining the value of the loss function corresponding to the second image processing model according to the fourth feature map and the first feature map.

In this implementation, the third feature map is subjected to position coding to obtain third position information corresponding to the third feature map, the third feature map to which the third position information is added is input to a decoding network to obtain a fourth feature map, and a value of a loss function corresponding to the second image processing model is determined according to the fourth feature map and the first feature map, so that the second image processing model is trained to improve the accuracy of the second image processing model.

In one possible implementation form of the method,

the acquiring of the first feature map and the second feature map of the training image includes: obtaining at least two first feature maps corresponding to the training images extracted from at least two first intermediate layers of the first image processing model and at least two second feature maps corresponding to the training images extracted from at least two second intermediate layers of the second image processing model, wherein the at least two first feature maps and the at least two second feature maps form at least two feature map pairs, and any one of the at least two feature map pairs comprises one first feature map and one second feature map;

generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map, including: for any one of the at least two feature map pairs, generating a third feature map corresponding to the feature map pair according to a partial feature in a first feature map of the feature map pair and a partial feature in a second feature map of the feature map pair;

determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map, including: and determining the value of the loss function corresponding to the second image processing model according to the third feature map corresponding to the at least two feature map pairs and the first feature map in the at least two feature map pairs.

In this implementation, the second image processing model is trained by using at least two feature map pairs output by at least two intermediate layer pairs of the first image processing model and the second image processing model, so that the second image processing model can learn richer hidden layer features of the first image processing model, and the difference between the second image processing model and the first image processing model can be further reduced.

In one possible implementation, the first image processing model and the second image processing model are both used for image classification;

after the second image processing model training is completed, the method further comprises:

acquiring an image to be classified;

processing the image to be classified through the second image processing model to obtain a feature map corresponding to the image to be classified;

and processing the characteristic graph corresponding to the image to be classified through the second image processing model to obtain a classification result corresponding to the image to be classified.

In the implementation manner, the images to be classified are processed through the trained second image processing model to obtain the feature maps corresponding to the images to be classified, and the feature maps corresponding to the images to be classified are processed through the second image processing model to obtain the classification results corresponding to the images to be classified, so that the accuracy of image classification of the images to be classified can be improved.

In one possible implementation, the first image processing model and the second image processing model are both used for target detection;

acquiring an image to be detected;

processing the image to be detected through the second image processing model to obtain a characteristic diagram corresponding to the image to be detected;

and processing the characteristic graph corresponding to the image to be detected through the second image processing model to obtain a target detection result corresponding to the image to be detected.

In the implementation mode, the second image processing model obtained through training is used for processing the image to be detected to obtain the characteristic diagram corresponding to the image to be detected, and the second image processing model is used for processing the characteristic diagram corresponding to the image to be detected to obtain the target detection result corresponding to the image to be detected, so that the accuracy of target detection on the image to be detected can be improved.

According to an aspect of the present disclosure, there is provided a training apparatus for an image processing model, including:

the acquisition module is used for acquiring a first feature map and a second feature map of a training image, wherein the first feature map is output through a first image processing model, and the second feature map is output through a second image processing model;

a generating module, configured to generate a third feature map according to a partial feature in the first feature map and a partial feature in the second feature map;

a determining module, configured to determine, according to the third feature map and the first feature map, a value of a loss function corresponding to the second image processing model;

and the training module is used for training the second image processing model according to the value of the loss function.

In one possible implementation, the generating module is configured to:

determining a mask area according to the mask proportion;

In one possible implementation, the generating module is configured to:

determining an intermediate core alignment CKA similarity index between the first feature map and the second feature map;

determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index.

In one possible implementation, the generating module is configured to:

aligning the first feature map with the second feature map;

In one possible implementation, the generating module is configured to:

performing convolution processing on the second feature map in response to the fact that the number of channels of the second feature map is different from that of the channels of the first feature map, and enabling the number of channels of the second feature map after convolution processing to be the same as that of the channels of the first feature map, and/or performing bilinear interpolation on the second feature map in response to the fact that the size of the second feature map is different from that of the channels of the first feature map, and enabling the size of the second feature map after bilinear interpolation to be the same as that of the first feature map;

alternatively, the first and second electrodes may be,

In one possible implementation, the determining module is configured to:

In one possible implementation, the generating module is configured to:

In one possible implementation, the determining module is configured to:

In one possible implementation form of the method,

the acquisition module is configured to: obtaining at least two first feature maps corresponding to the training images extracted from at least two first intermediate layers of the first image processing model and at least two second feature maps corresponding to the training images extracted from at least two second intermediate layers of the second image processing model, wherein the at least two first feature maps and the at least two second feature maps form at least two feature map pairs, and any one of the at least two feature map pairs comprises one first feature map and one second feature map;

the generation module is configured to: for any one of the at least two feature map pairs, generating a third feature map corresponding to the feature map pair according to a partial feature in a first feature map of the feature map pair and a partial feature in a second feature map of the feature map pair;

the determination module is to: and determining the value of the loss function corresponding to the second image processing model according to the third feature map corresponding to the at least two feature map pairs and the first feature map in the at least two feature map pairs.

the device further comprises:

the classification module is used for acquiring an image to be classified; processing the image to be classified through the second image processing model to obtain a feature map corresponding to the image to be classified; and processing the feature map corresponding to the image to be classified through the second image processing model to obtain a classification result corresponding to the image to be classified.

the device further comprises:

the target detection module is used for acquiring an image to be detected; processing the image to be detected through the second image processing model to obtain a characteristic diagram corresponding to the image to be detected; and processing the characteristic graph corresponding to the image to be detected through the second image processing model to obtain a target detection result corresponding to the image to be detected.

According to an aspect of the present disclosure, there is provided an electronic device including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

According to an aspect of the present disclosure, there is provided a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the above method.

In the embodiment of the disclosure, a first feature map and a second feature map of a training image are obtained, wherein the first feature map is output through a first image processing model, the second feature map is output through a second image processing model, a third feature map is generated according to a part of features in the first feature map and a part of features in the second feature map, a value of a loss function corresponding to the second image processing model is determined according to the third feature map and the first feature map, and the second image processing model is trained according to the value of the loss function, so that a part of the feature map extracted by the first image processing model is used as a priori knowledge of the second image processing model, so that the second image processing model simulates the features output by the first image processing model, and thus the second image processing model can be improved along with the improvement of the performance of the first image processing model, the image processing method and the image processing system have the advantages that the gap between the teacher and student models can be solved, the problem that the performance of the second image model is not improved any more along with the improvement of the performance of the first image model can be solved, and therefore the second image processing model with better performance can be obtained through distillation by using the first image processing model with larger scale and better performance.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flowchart of a training method of an image processing model provided by an embodiment of the present disclosure.

Fig. 2 shows a block diagram of a training apparatus for an image processing model provided in an embodiment of the present disclosure.

Fig. 3 shows a block diagram of a sub-device 1900 provided by an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

The disclosed embodiments provide a training method, an apparatus, an electronic device, a storage medium, and a program product for an image processing model, by obtaining a first feature map and a second feature map of a training image, where the first feature map is output by a first image processing model, the second feature map is output by a second image processing model, generating a third feature map according to a partial feature in the first feature map and a partial feature in the second feature map, determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map, and training the second image processing model according to the value of the loss function, thereby using a part of the feature map extracted by the first image processing model as a priori knowledge of the second image processing model to make the second image processing model mimic the feature output by the first image processing model, therefore, the second image processing model can be improved along with the improvement of the performance of the first image processing model, namely, the gap between the teacher and student models can be solved, and the problem that the performance of the second image model is not improved along with the improvement of the performance of the first image model can be solved, so that the first image processing model with larger scale and better performance can be used, and the second image processing model with better performance can be obtained through distillation.

The following describes in detail a training method of an image processing model according to an embodiment of the present disclosure with reference to the drawings.

Fig. 1 shows a flowchart of a training method of an image processing model provided by an embodiment of the present disclosure. In a possible implementation manner, the subject of the training method of the image processing model may be a training apparatus of the image processing model, for example, the training method of the image processing model may be executed by a terminal device or a server or other electronic devices. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, or a wearable device. In some possible implementations, the training method of the image processing model may be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 1, the training method of the image processing model includes steps S11 to S14.

In step S11, a first feature map and a second feature map of a training image are obtained, where the first feature map is output by a first image processing model and the second feature map is output by a second image processing model.

In step S12, a third feature map is generated based on the partial features in the first feature map and the partial features in the second feature map.

In step S13, a value of a loss function corresponding to the second image processing model is determined based on the third feature map and the first feature map.

In step S14, the second image processing model is trained based on the values of the loss function.

In the embodiments of the present disclosure, the first image processing model and the second image processing model are both models for image processing, and the first image processing model and the second image processing model may be used for the same image processing task. For example, both the first image processing model and the second image processing model are used for image classification; for another example, both the first image processing model and the second image processing model are used for target detection; as another example, both the first image processing model and the second image processing model are used for image segmentation; for another example, both the first image processing model and the second image processing model are used for feature extraction; and so on.

In the disclosed embodiment, the first image processing model is a teacher model and the second image processing model is a student model, the second image processing model being more lightweight than the first image processing model. For example, the network structure of the second image processing model is simpler and/or the parameter quantity is smaller compared to the first image processing model. That is, the network structure of the first image processing model is more complex and/or the amount of parameters is larger than that of the second image processing model. The first image processing model is typically trained for better performance due to a more complex network structure and/or a larger number of parameters. As the scale of the first image processing model increases, the performance of the first image processing model will improve. Because the embodiment of the disclosure uses part of the features of the feature map extracted by the first image processing model as the prior knowledge of the second image processing model, and the second image processing model simulates the features output by the first image processing model, the embodiment of the disclosure can improve the second image processing model along with the improvement of the performance of the first image processing model, that is, can solve the gap between the teacher and student models, and can solve the problem that the performance of the second image model is not improved along with the improvement of the performance of the first image model. Since the performance of the second image processing model can be improved along with the improvement of the performance of the first image processing model, the second image processing model with better performance can be obtained by distillation by using the first image processing model with larger scale and better performance.

The embodiments of the present disclosure do not limit the network structures adopted by the first image processing model and the second image processing model. The first image processing model and the second image processing model may employ the same type of network structure, or may employ different types of network structures.

In one possible implementation, both the first image processing model and the second image processing model are used for image classification.

As an example of this implementation, the first image processing model and the second image processing model may employ the same type of network structure. For example, a first image processing model may employ ResNet152, and a second image processing model may employ ResNet 18; as another example, the first image processing model may employ WRN40-2, and the second image processing model may employ WRN 16-2; as another example, the first image processing model may employ WRN40-2, and the second image processing model may employ WRN 40-1; as another example, the first image processing model may employ ResNet56, and the second image processing model may employ ResNet 20; as another example, a first image processing model may employ ResNet110, and a second image processing model may employ ResNet 20; as another example, a first image processing model may employ ResNet110, and a second image processing model may employ ResNet 32; as another example, the first image processing model may employ VGG13, and the second image processing model may employ VGG 8; as another example, the first image processing model may employ ResNet34, and the second image processing model may employ ResNet 18; and so on.

As another example of this implementation, the first image processing model and the second image processing model may employ different types of network structures. For example, the first image processing model may employ ResNet50 and the second image processing model may employ MobileNet V2.

In another possible implementation, both the first image processing model and the second image processing model are used for object detection. As an example of this implementation, the first image processing model and the second image processing model may employ RetinaNet, respectively. As another example of this implementation, the first image processing model and the second image processing model may employ fast-RCNN, respectively. In this implementation, the first image processing model and the second image processing model may employ the same type of network structure, or may employ different types of network structures. For example, the first image processing model may employ ResNet101-FPN, and the second image processing model may employ ResNet 50-FPN; as another example, the first image processing model may employ ResNet152-FPN, and the second image processing model may employ ResNet 50-FPN; the first image processing model may employ ResNet101-FPN, and the second image processing model may employ ResNet 18-FPN; and so on.

In the embodiment of the present disclosure, the first feature map may be a feature map output by a first network layer of the first image processing model, and the second feature map may be a feature map output by a second network layer of the second image processing model, where the first network layer and the second network layer are corresponding network layers in the first image processing model and the second image processing model. The number of the first network layers can be one or more than two, and correspondingly, the number of the first characteristic diagrams can also be one or more than two; the number of the second network layers may be one or more than two, and accordingly, the number of the second feature maps may also be one or more than two.

In one possible implementation, the first network layer may include an intermediate layer of the first image processing model, and the second network layer may include an intermediate layer of the second image processing model. For example, the first network layer includes a first intermediate layer of a first image processing model and the second network layer includes a second intermediate layer of a second image processing model. The first intermediate layer and the second intermediate layer are corresponding intermediate layers in the first image processing model and the second image processing model, and the first intermediate layer and the second intermediate layer can form an intermediate layer pair. Accordingly, a first profile output by the first intermediate level and a second profile output by the second intermediate level may form a profile pair. As an example of this implementation, the first intermediate layer may be the last layer before the down-sampling layer in the first image processing model, and the second intermediate layer may be the last layer before the down-sampling layer in the second image processing model. Of course, a person skilled in the art may flexibly select the first intermediate layer and the second intermediate layer according to the requirements of the actual application scenario, which is not limited herein.

In this implementation, T intermediate layer pairs may be selected from the first image processing model and the second image processing model, and the second image processing model may be trained based on T feature map pairs output by the T intermediate layer pairs, where T is an integer greater than or equal to 1. As an example of this implementation, one may obtainObtaining T first feature maps output by the last layer before the last T down-sampling layers of the first image processing model

Obtaining T second feature maps output by the last layer before the last T downsampling layers of the second image processing model

And form T pairs of profiles.

As an example of this implementation, 2 middle layer pairs may be selected from the first image processing model and the second image processing model, the 2 middle layer pairs including: a first intermediate layer pair consisting of the last layer before the last down-sampling layer of the first image processing model and the last layer before the last down-sampling layer of the second image processing model, and a second intermediate layer pair consisting of the last layer before the penultimate down-sampling layer of the first image processing model and the last layer before the penultimate down-sampling layer of the second image processing model. Wherein, the first feature map output by the last layer before the last down-sampling layer of the first image processing model can be recorded as

The second feature map output by the last layer before the last down-sampled layer of the second image processing model can be written as

The first characteristic diagram pair corresponding to the first intermediate layer can be recorded as

The first feature map output by the last layer before the penultimate down-sampled layer of the first image processing model can be recorded as

Last layer input before last down-sampled layer of second image processing modelThe second characteristic diagram can be recorded as

The second characteristic diagram pair corresponding to the second intermediate layer can be recorded as

As another example of this implementation, 3 middle layer pairs may be selected from the first image processing model and the second image processing model, the 3 middle layer pairs including: the image processing system comprises a first image processing model, a second image processing model and a third image processing model, wherein the first image processing model comprises a last layer before a last down-sampling layer of the first image processing model and a last layer before a last down-sampling layer of the second image processing model, the second image processing model comprises a last layer before a last down-sampling layer of the first image processing model and a last layer before a last down-sampling layer of the second image processing model, and the third image processing model comprises a last layer before a last down-sampling layer of the first image processing model and a last layer before a last down-sampling layer of the second image processing model. Wherein, the first feature map output by the last layer before the last down-sampling layer of the first image processing model can be recorded as

Last but one of the second image processing modelThe second profile of the last layer output before two down-sampled layers can be written as

The corresponding second feature map pair of the second intermediate layer can be recorded as

The first feature map output by the last layer before the third last down-sampled layer of the first image processing model can be recorded as

The second feature map output by the last layer before the third last down-sampled layer of the second image processing model can be recorded as

The third feature map pair corresponding to the third intermediate layer can be recorded as

As another example of this implementation, 1 intermediate layer pair may be selected from the first image processing model and the second image processing model, the intermediate layer pair being: and the last layer before the last down-sampling layer of the first image processing model and the last layer before the last down-sampling layer of the second image processing model form a first intermediate layer pair. The first feature map output by the last layer before the last down-sampled layer of the first image processing model can be written as

In another possible implementation, the first network layer may include an output layer of a first image processing model, and the second network layer may include an output layer of a second image processing model. In this implementation, the first image processing model and the second image processing model may both be used for feature extraction, i.e., the outputs of the first image processing model and the second image processing model may both be feature maps.

In another possible implementation, the first network layer may include an intermediate layer and an output layer of the first image processing model, and the second network layer may include an intermediate layer and an output layer of the second image processing model. In this implementation, the first image processing model and the second image processing model may both be used for feature extraction, i.e., the outputs of the first image processing model and the second image processing model may both be feature maps.

In a possible implementation manner, the generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map includes: determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the first feature block set represents a set of feature blocks used for generating the third feature map in the first feature map, the first feature block set comprises partial features of the first feature map, the second feature block set represents a set of feature blocks used for generating the third feature map in the second feature map, and the second feature block set comprises partial features of the second feature map; and generating the third feature map according to the first feature block set and the second feature block set.

In this implementation, the feature block may represent an image block obtained by dividing the feature map. The first feature map is divided to obtain a feature block set corresponding to the first feature map, wherein any two feature blocks in the feature block set corresponding to the first feature mapDo not overlap with each other; the second feature map is divided, so that a feature block set corresponding to the second feature map can be obtained, wherein any two feature blocks in the feature block set corresponding to the second feature map are not overlapped. The number of channels of the feature block may be the same as the number of channels of the feature map. For example, the first feature map has a number of channels C ₁ Then, the number of channels of any feature block in the feature block set corresponding to the first feature map may also be C ₁ (ii) a The number of channels of the second characteristic diagram is C ₂ Then, the number of channels of any feature block in the feature block set corresponding to the second feature map may also be C ₂ 。

In one example, the first profile may be plotted

And a second characteristic diagram

Are respectively divided into sizes P ⁱ ×P ⁱ The feature block of (1). For example, a first characteristic diagram

And a second characteristic diagram

All have a size of H ⁱ ×W ⁱ Then, the first feature map may be applied

And a second characteristic diagram

Are divided into M ⁱ A feature block, wherein M ⁱ ＝(H ⁱ ×W ⁱ )/(P ⁱ ×P ⁱ ). For example, the first characteristic diagram

And a second characteristic diagram

All channels of (A) are C _i Then, the size of the feature block obtained by dividing may be P ⁱ ×P ⁱ The number of channels may be C _i . In one example, a convolution kernel size of P may be employed ⁱ ×P ⁱ Step length of P ⁱ For the first characteristic diagram

And a second characteristic diagram

The division is performed separately.

In this implementation, part of feature blocks may be selected from a feature block set corresponding to the first feature map to form a first feature block set for generating a third feature map; part of feature blocks can be selected from the feature block set corresponding to the second feature map to form a second feature block set for generating a third feature map. Wherein the first set of feature blocks is complementary to the position of the feature blocks in the second set of feature blocks. After the first feature block set and the second feature block set are obtained, the first feature block set and the second feature block set may be processed to generate a third feature map. For example, a first feature block set and a second feature block set of a preset network may be adopted for processing to generate a third feature map. For another example, the first set of feature blocks and the second set of feature blocks may be stitched to generate a third feature map.

In this implementation, a first feature block set used for generating a third feature map is determined from the first feature map, a second feature block set used for generating the third feature map is determined from the second feature map, and the third feature map is generated according to the first feature block set and the second feature block set, so that the first feature map and the second feature map are divided by taking the feature blocks as minimum units, and the third feature map is generated based on part of the feature blocks in the first feature map and part of the feature blocks in the second feature map, so that the second image processing model is trained by using a priori feature blocks provided by a teacher model, which helps to improve the precision of the trained second image processing model.

As an example of this implementation, the determining, from the first feature map, a first feature block set for generating a third feature map, and determining, from the second feature map, a second feature block set for generating the third feature map includes: determining a mask ratio for merging the first feature map and the second feature map according to the first feature map and the second feature map; determining a mask area according to the mask proportion; according to the mask region, determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map, wherein the positions of feature blocks in the first feature block set and the second feature block set are complementary.

In one example, the first profile

And a second characteristic diagram

Respectively comprise M ⁱ A feature block with mask ratio of alpha ⁱ Then, the mask region may include M ⁱ ×α ⁱ A feature block.

In this example, the mask region may be randomly determined according to the mask ratio. Alternatively, the mask region may be determined according to the mask ratio and a preset mask rule.

In this example, a mask ratio for merging the first feature map and the second feature map is determined based on the first feature map and the second feature map, determining a mask region according to the mask proportion, determining a first feature block set for generating a third feature map from the first feature map, and determining a second feature block set for generating the third feature map from the second feature map, wherein the first set of feature blocks is complementary in position to feature blocks in the second set of feature blocks, whereby, the mask ratio for merging the first feature map and the second feature map is determined based on the first feature map and the second feature map, instead of using a fixed mask ratio, that is, the mask ratio for merging the feature map pairs is dynamically adjusted according to the similarity information (e.g., similarity) of the feature map pairs. For example, different feature map pairs output by different intermediate layer pairs have different similarity information, and different mask ratios may be adopted; as another example, different feature map pairs output by the same middle layer pair in different training rounds will likely employ different mask ratios. By adopting the mask proportion determined dynamically according to the characteristic diagram, the difference between the second image processing model and the first image processing model is favorably reduced, and the precision of the trained second image processing model is favorably improved.

In one example, the determining a mask ratio for merging the first feature map and the second feature map according to the first feature map and the second feature map includes: determining similarity information between the first feature map and the second feature map; and determining a mask ratio for merging the first feature map and the second feature map according to the similarity information. In this example, the similarity information between the first feature map and the second feature map may be any information that can represent the similarity between the first feature map and the second feature map. In this example, by determining similarity information between the first feature map and the second feature map and determining a mask ratio for merging the first feature map and the second feature map according to the similarity information, the accuracy of the trained second image processing model can be improved.

In other examples, the mask ratio for merging the first feature map and the second feature map may also be determined according to a correlation between the first feature map and the second feature map, or may be determined according to other information of the first feature map and the second feature map, which is not limited herein.

In one example, the determining similarity information between the first feature map and the second feature map includes: determining a CKA (Central Kernel Alignment) similarity index between the first feature map and the second feature map; determining a mask ratio for merging the first feature map and the second feature map according to the similarity information includes: determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index.

In one example, equation 1 may be used to determine the first profile

And the second characteristic diagram

CKA similarity index between:

wherein the content of the first and second substances,

x 'represents a matrix in which the diagonal line of X is 0, Y' represents a matrix in which the diagonal line of Y is 0, 1 represents an identity matrix, and n represents the length H of the feature map ⁱ And width W ⁱ Tr denotes the trace of the matrix.

In one example, for merging first feature maps

And the second characteristic diagram

Mask ratio of (a) ⁱ Can be determined using equation 2:

α ⁱ 1-CKA formula 2.

In this example, by determining a CKA similarity index between the first feature map and the second feature map, and determining a mask ratio for merging the first feature map and the second feature map according to the CKA similarity index, it is possible to help improve the accuracy of the trained second image processing model.

In other examples, the similarity information between the first feature map and the second feature map may also be measured by cosine similarity, and the like, which is not limited herein.

In one example, the determining similarity information between the first feature map and the second feature map includes: aligning the first feature map with the second feature map; and determining similarity information between the aligned first feature map and the aligned second feature map. In this example, aligning the first feature map with the second feature map may indicate that the number of channels and/or the size of the first feature map and the second feature map are the same. By aligning the first feature map and the second feature map and determining the similarity information between the aligned first feature map and the aligned second feature map, the similarity information between the first feature map and the second feature map can be determined more accurately.

In one example, the aligning the first feature map with the second feature map comprises: performing convolution processing on the second feature map in response to the fact that the number of channels of the second feature map is different from that of the channels of the first feature map, and enabling the number of channels of the second feature map after convolution processing to be the same as that of the channels of the first feature map, and/or performing bilinear interpolation on the second feature map in response to the fact that the size of the second feature map is different from that of the channels of the first feature map, and enabling the size of the second feature map after bilinear interpolation to be the same as that of the first feature map; or, in response to the number of channels of the second feature map being different from that of the first feature map, performing convolution processing on the first feature map so that the number of channels of the convolution processed first feature map is the same as that of the second feature map, and/or in response to the size of the second feature map being different from that of the first feature map, performing bilinear interpolation on the first feature map so that the size of the bilinear interpolated first feature map is the same as that of the second feature map.

In one example, in response to the number of channels of the second feature map being different from that of the first feature map, the second feature map may be convolved, so that the number of channels of the convolved second feature map is the same as that of the first feature map, and/or in response to the size of the second feature map being different from that of the first feature map, the second feature map may be bilinearly interpolated, so that the size of the bilinear interpolated second feature map is the same as that of the first feature map. For example, if the feature map is paired

The first characteristic diagram of

And the second characteristic diagram

If the number of channels is different, the second characteristic diagram can be obtained

Performing convolution processing to obtain the second feature map after convolution processing

And the first characteristic diagram

The number of channels is the same, wherein i is more than or equal to 1 and less than or equal to T; if the feature map pair

The first characteristic diagram of

And the second characteristic diagram

Can be compared to the second characteristic diagram

Carrying out bilinear interpolation to enable the second characteristic diagram after bilinear interpolation

And the first characteristic diagram

Are the same.

In another example, the first feature map may be convolved in response to the number of channels of the second feature map being different from that of the first feature map, so that the number of channels of the convolved first feature map is the same as that of the convolved second feature map, and/or the first feature map may be bilinear interpolated in response to the size of the second feature map being different from that of the first feature map, so that the size of the bilinear interpolated first feature map is the same as that of the second feature map. For example, if the feature map is right

The first characteristic diagram of

And the second characteristic diagram

If the number of channels is different, the first characteristic diagram can be obtained

Performing convolution processing to obtain the first feature map after convolution processing

And the second characteristic diagram

The number of channels is the same, wherein i is more than or equal to 1 and less than or equal to T; if the feature map is right

The first characteristic diagram of

And the second characteristic diagram

Can be applied to the first characteristic diagram

Performing bilinear interpolation to obtain a first feature map after bilinear interpolation

And the second characteristic diagram

Are the same size.

In this example, in response to the difference between the number of channels of the second feature map and the number of channels of the first feature map, the second feature map is convolved, and the number of channels of the convolved second feature map is made equal to the number of channels of the first feature map, thereby aligning the number of channels of the first feature map and the number of channels of the second feature map; performing bilinear interpolation on the second feature map in response to the fact that the second feature map and the first feature map are different in size, so that the size of the bilinear interpolated second feature map is the same as that of the first feature map, and therefore the sizes of the first feature map and the second feature map can be aligned; performing convolution processing on the first feature map in response to the fact that the second feature map and the first feature map have different channel numbers, and enabling the channel numbers of the convolution processed first feature map and the convolution processed second feature map to be the same, so that the channel numbers of the first feature map and the second feature map can be aligned; the sizes of the first feature map and the second feature map can be aligned by performing bilinear interpolation on the first feature map in response to the difference in size between the second feature map and the first feature map, and making the size of the bilinear interpolated first feature map the same as that of the second feature map.

In one example, the determining, according to the mask region, a first feature block set for generating a third feature map from the first feature map and a second feature block set for generating the third feature map from the second feature map includes: determining a first feature block set used for generating a third feature map from the first feature map corresponding to the position information of the mask region or corresponding to the position information outside the mask region; and selecting feature blocks complementary to the positions of the feature blocks in the first feature block set from the second feature map to obtain a second feature block set for generating a third feature map.

In one example, a first feature block set used for generating a third feature map may be determined from the first feature map corresponding to the position information of the mask region; and selecting feature blocks complementary to the positions of the feature blocks in the first feature block set from the second feature map to obtain a second feature block set for generating a third feature map. That is, a first feature block set used for generating a third feature map may be determined according to the feature blocks belonging to the mask region in the first feature map, in accordance with the position information of the mask region; and corresponding to the position information outside the mask region, determining a second feature block set for generating the third feature map according to the feature blocks which do not belong to the mask region in the second feature map.

In another example, a first feature block set used for generating a third feature map may be determined from the first feature map corresponding to location information outside the mask region; and selecting feature blocks complementary to the positions of the feature blocks in the first feature block set from the second feature map to obtain a second feature block set for generating a third feature map. That is, a first feature block set for generating a third feature map may be determined from the first feature map in accordance with position information other than the mask region; and determining a second feature block set used for generating the third feature map from the second feature map corresponding to the position information of the mask region.

In this example, a first feature block set for generating a third feature map is determined from the first feature map by position information corresponding to the mask region or position information corresponding to positions other than the mask region, and a feature block complementary to the position of the feature block in the first feature block set is selected from the second feature map to obtain a second feature block set for generating the third feature map, so that the feature block masked in the second feature map is filled with the feature block at the corresponding position in the first feature map, thereby providing part of a priori knowledge to the second image processing model as a student model through the first image processing model as a teacher model.

As another example of this implementation, the determining, from the first feature map, a first feature block set used for generating a third feature map, and determining, from the second feature map, a second feature block set used for generating the third feature map includes: determining a mask area according to a preset mask proportion; according to the mask region, determining a first feature block set used for generating a third feature map from the first feature map, and determining a second feature block set used for generating the third feature map from the second feature map. In this example, the mask ratio may be a fixed value.

As another example of this implementation, the determining, from the first feature map, a first feature block set for generating a third feature map, and determining, from the second feature map, a second feature block set for generating the third feature map includes: randomly determining a first feature block set for generating a third feature map from the first feature map; determining a second feature block set used for generating the third feature map from the second feature map according to the position of the feature block in the first feature block set; wherein the positions of the feature blocks in the first feature block set and the second feature block set are complementary.

As another example of this implementation, the determining, from the first feature map, a first feature block set for generating a third feature map, and determining, from the second feature map, a second feature block set for generating the third feature map includes: dividing a feature block set corresponding to the first feature map into a plurality of first feature block subsets, wherein any one of the first feature block subsets comprises at least two adjacent feature blocks; respectively selecting a preset number of feature blocks from the plurality of first feature block subsets to obtain a first feature block set; and selecting a feature block complementary to the position of the feature block in the first feature block set from the feature block set corresponding to the second feature map to obtain a second feature block set. For example, each first feature block subset may include 4 adjacent feature blocks, and the preset number may be 1.

As an example of this implementation, the generating the third feature map according to the first feature block set and the second feature block set includes: respectively carrying out position coding on the first characteristic block set and the second characteristic block set to obtain first position information corresponding to the first characteristic block set and second position information corresponding to the second characteristic block set; coding the first feature block set by combining the first position information to obtain a first feature block coding result; combining the second position information to encode the second feature block set to obtain a second feature block encoding result; and generating a third feature map according to the first feature block coding result and the second feature block coding result.

In this example, a position coding manner such as sine and cosine position coding may be adopted to perform position coding on the first feature block set to obtain first position information, and to perform position coding on the second feature block set to obtain second position information. The first position information may represent a position encoding result corresponding to the first feature block set, and the second position information may represent a position encoding result corresponding to the second feature block set.

In this example, a first coding network corresponding to the first image processing model may be used to code the first feature block set to which the first position information is added, so as to obtain a first feature block coding result(ii) a The second feature block set added with the second position information may be encoded by using a second encoding network corresponding to the second image processing model, so as to obtain a second feature block encoding result. For example, the first feature block encoding result may be denoted as f _i ^t The second feature block encoding result can be noted as f _i ^s . In one example, the first coding network and the second coding network may each be a multi-headed self-attention network of 6 layers. Of course, the first coding network and the second coding network may also adopt other network structures, for example, the number of layers of the self-attention network may be smaller or larger, and is not limited herein. The parameters of the first coding network and the second coding network may be updated with the training of the second image processing model, i.e. the first coding network and the second coding network may be trained together with the second image processing model.

In this example, after the first feature block encoding result and the second feature block encoding result are obtained, the first feature block encoding result and the second feature block encoding result may be combined according to the relative position information between the feature blocks in the second feature map to obtain a third feature map.

In this example, the first feature block set and the second feature block set are respectively position-encoded to obtain first position information corresponding to the first feature block set and second position information corresponding to the second feature block set, the first feature block set is encoded in combination with the first position information to obtain a first feature block encoding result, the second feature block set is encoded in combination with the second position information to obtain a second feature block encoding result, and a third feature map is generated according to the first feature block encoding result and the second feature block encoding result, so that the third feature map is generated in combination with the position information of the feature blocks in the first feature block set and the second feature block set, and the third feature map can include the position information of the feature blocks in the first feature map and the second feature map, therefore, the second image processing model can be trained by using the prior characteristics containing the position information provided by the teacher model, and the accuracy of the trained second image processing model is improved.

In another possible implementation manner, the generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map includes: determining a first pixel set used for generating a third feature map from the first feature map, and determining a second pixel set used for generating the third feature map from the second feature map, wherein the first pixel set represents a set of pixels used for generating the third feature map in the first feature map, and the second pixel set represents a set of pixels used for generating the third feature map in the second feature map; and generating the third feature map according to the first pixel set and the second pixel set. In this implementation, the first feature map and the second feature map may be divided in a pixel as a minimum unit.

In a possible implementation manner, the determining, according to the third feature map and the first feature map, a value of a loss function corresponding to the second image processing model includes: carrying out position coding on the third feature map to obtain third position information corresponding to the third feature map; inputting the third position information and the third feature map into a decoding network to obtain a fourth feature map; and determining the value of the loss function corresponding to the second image processing model according to the fourth feature map and the first feature map.

In this implementation, a position coding method such as sine and cosine position coding may be adopted to perform position coding on the third feature map, so as to obtain third position information corresponding to the third feature map. In this implementation, the decoding network may consist of 6 layers of multi-headed self-attention layers and one layer of multi-layered perceptrons. Of course, the decoding network may also adopt other network structures, and is not limited herein. The parameters of the decoding network may be updated with the training of the second image processing model, i.e. the decoding network may be trained together with the second image processing model. In this implementation, the value of the loss function corresponding to the second image processing model may be determined using the L2 loss or the L1 loss, or the like. In the case that there are at least two feature map pairs, the value of the loss function corresponding to the second image processing model may be determined from the at least two feature map pairs.

In a possible implementation manner, the acquiring the first feature map and the second feature map of the training image includes: obtaining at least two first feature maps corresponding to the training images extracted by at least two first intermediate layers of the first image processing model and at least two second feature maps corresponding to the training images extracted by at least two second intermediate layers of the second image processing model, wherein the at least two first feature maps and the at least two second feature maps form at least two feature map pairs, and any one feature map pair in the at least two feature map pairs comprises a first feature map and a second feature map; generating a third feature map according to the partial features in the first feature map and the partial features in the second feature map, including: for any one of the at least two feature map pairs, generating a third feature map corresponding to the feature map pair according to a partial feature in a first feature map of the feature map pair and a partial feature in a second feature map of the feature map pair; determining a value of a loss function corresponding to the second image processing model according to the third feature map and the first feature map, including: and determining the value of the loss function corresponding to the second image processing model according to the third feature map corresponding to the at least two feature map pairs and the first feature map in the at least two feature map pairs.

In this implementation, the second image processing model is trained using at least two pairs of feature maps. For any feature map pair, position coding may be performed on a third feature map corresponding to the feature map pair to obtain third position information corresponding to the third feature map, the third position information and the third feature map may be input to a decoding network to obtain a fourth feature map, and difference information between the fourth feature map and the first feature map in the feature map pair may be determined. Similarly, difference information between the fourth feature map and the first feature map corresponding to each feature map pair may be determined. After determining the difference information corresponding to each of the at least two feature map pairs, the value of the loss function corresponding to the second image processing model may be determined according to a weighted sum of the difference information corresponding to the at least two feature map pairs. In this implementation, the second image processing model is trained by using at least two feature map pairs output by at least two intermediate layer pairs of the first image processing model and the second image processing model, so that the second image processing model can learn richer hidden layer features of the first image processing model, and the difference between the second image processing model and the first image processing model can be further reduced.

In a possible implementation manner, during the training process of the second image processing model, supervision can be performed in combination with a logits distillation method, so that a distillation effect can be provided.

In a possible implementation manner, in the process of training the second image processing model, the second image processing model may be supervised by combining difference information between a prediction result of the second image processing model and annotation data corresponding to the training image, so as to improve the accuracy of the second image processing model.

In one possible implementation, the first image processing model and the second image processing model are both used for image classification; after the second image processing model training is completed, the method further comprises: acquiring an image to be classified; processing the image to be classified through the second image processing model to obtain a feature map corresponding to the image to be classified; and processing the characteristic graph corresponding to the image to be classified through the second image processing model to obtain a classification result corresponding to the image to be classified. In this implementation, the image to be classified may be any image that needs to be classified. And processing the image to be classified through the trained second image processing model to obtain a feature map corresponding to the image to be classified, and processing the feature map corresponding to the image to be classified through the second image processing model to obtain a classification result corresponding to the image to be classified, so that the accuracy of image classification of the image to be classified can be improved.

In one possible implementation, the first image processing model and the second image processing model are both used for target detection; after the second image processing model training is completed, the method further comprises: acquiring an image to be detected; processing the image to be detected through the second image processing model to obtain a characteristic diagram corresponding to the image to be detected; and processing the characteristic graph corresponding to the image to be detected through the second image processing model to obtain a target detection result corresponding to the image to be detected. In this implementation, the image to be detected may be any image that needs to be subjected to target detection. And processing the image to be detected through a second image processing model obtained through training to obtain a characteristic diagram corresponding to the image to be detected, and processing the characteristic diagram corresponding to the image to be detected through the second image processing model to obtain a target detection result corresponding to the image to be detected, so that the accuracy of target detection on the image to be detected can be improved.

The training method of the image processing model provided by the embodiment of the disclosure can be applied to the technical fields of computer vision and the like.

The following describes a training method of an image processing model provided by an embodiment of the present disclosure through a specific application scenario. In this application scenario, the first image processing model may employ ResNet152 and the second image processing model may employ ResNet 18.

The training image may be input into the first image processing model to obtain an output of a last layer before a last down-sampled layer of the first image processing modelFirst characteristic diagram

And a first feature map output from a last layer preceding a penultimate down-sampled layer of the first image processing model

The training image may be input into a second image processing model to obtain a second feature map output from a last layer of the second image processing model prior to a last down-sampled layer

And a second feature map output from a last layer preceding a penultimate down-sampled layer of the second image processing model

Wherein, the first characteristic diagram

And a second characteristic diagram

Form a first characteristic diagram pair

First characteristic diagram

And a second characteristic diagram

Form a second characteristic diagram pair

For the first feature map pair

In that

And

under the condition that the channel numbers of the channels are different, the channel numbers can be matched

Performing convolution processing to obtain a convolution processed product

And the above-mentioned

The number of channels is the same. In that

And

in the case of different sizes of (2), can be used

Performing bilinear interpolation to obtain bilinear interpolated value

And the above-mentioned

Are the same size.

In alignment with

And with

Thereafter, can be

And

are respectively divided into size P ¹ ×P ¹ The feature block of (1). E.g. after alignment

And

all have a size of H ¹ ×W ¹ Then, the process of the present invention,

and

can be divided into M ¹ ＝(H ¹ ×W ¹ )/(P ¹ ×P ¹ ) A feature block.

The determination for merging may be made according to equations 1, 2 and 3 above

And

mask ratio of alpha ⁱ . May be based on a mask ratio alpha ⁱ The mask region is randomly determined. Can be based on

To the feature block belonging to said mask area, determined for generating a third feature map F ₁ A first set of feature blocks; can be based on

Of feature blocks not belonging to said mask area, determined for generating F ₁ The second set of feature blocks.

The first feature block set may be subjected to position coding to obtain first position information corresponding to the first feature block set, and a first coding network corresponding to the first image processing model may be adoptedAnd coding the first characteristic block set added with the first position information to obtain a first characteristic block coding result

The second feature block set can be subjected to position coding to obtain second position information corresponding to the second feature block set, and the second feature block set added with the second position information can be coded by adopting a second coding network corresponding to a second image processing model to obtain a second feature block coding result

Wherein, the first coding network and the second coding network can be multi-head self-attention networks with 6 layers respectively. Is obtained by

And

then, can be based on

Relative position information between feature blocks in (1), merging

And

obtaining a third characteristic diagram F ₁ 。

Can be paired with F ₁ Position coding is carried out to obtain F ₁ Corresponding third location information. F after adding the third position information ₁ Inputting into a decoding network to obtain a fourth feature map

Similarly, for the second profile pair

A fourth characteristic diagram can be obtained

According to

And with

Difference information therebetween, and

and

the difference information between the first and second image processing models can be obtained to obtain a first loss function corresponding to the second image processing model

The value of (c).

In addition, the method of logits distillation can be combined, and the difference information between the logits output by the second image processing model and the logits output by the first image processing model is used for determining the second loss function corresponding to the second image processing model

And obtaining a third loss function corresponding to the second image processing model according to difference information between the prediction result of the second image processing model and the labeled data corresponding to the training image

The value of (c).

In one example, equation 4 may be used to determine a corresponding loss function for the second image processing model

The value of (c):

wherein alpha represents

Corresponding weight, beta represents

The corresponding weights, α and β, may be determined empirically.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a training apparatus for an image processing model, an electronic device, a computer-readable storage medium, and a computer program product, which can be used to implement any one of the training methods for an image processing model provided by the present disclosure, and corresponding technical solutions and technical effects can be referred to in corresponding descriptions of the method sections and are not described again.

Fig. 2 shows a block diagram of a training apparatus for an image processing model provided in an embodiment of the present disclosure. As shown in fig. 2, the training apparatus for the image processing model includes:

an obtaining module 21, configured to obtain a first feature map and a second feature map of a training image, where the first feature map is output through a first image processing model, and the second feature map is output through a second image processing model;

a generating module 22, configured to generate a third feature map according to a partial feature in the first feature map and a partial feature in the second feature map;

a determining module 23, configured to determine, according to the third feature map and the first feature map, a value of a loss function corresponding to the second image processing model;

a training module 24, configured to train the second image processing model according to the value of the loss function.

In one possible implementation, the generating module 22 is configured to:

determining a mask area according to the mask proportion;

In one possible implementation, the generating module 22 is configured to:

aligning the first feature map with the second feature map;

In one possible implementation, the generating module 22 is configured to:

alternatively, the first and second electrodes may be,

and in response to the difference between the channel numbers of the second feature map and the first feature map, performing convolution processing on the first feature map to enable the channel numbers of the convoluted first feature map and the second feature map to be the same, and/or in response to the difference between the sizes of the second feature map and the first feature map, performing bilinear interpolation on the first feature map to enable the sizes of the bilinear interpolated first feature map and the second feature map to be the same.

In a possible implementation manner, the determining module 23 is configured to:

In one possible implementation, the generating module 22 is configured to:

In one possible implementation of the method according to the invention,

the obtaining module 21 is configured to: obtaining at least two first feature maps corresponding to the training images extracted by at least two first intermediate layers of the first image processing model and at least two second feature maps corresponding to the training images extracted by at least two second intermediate layers of the second image processing model, wherein the at least two first feature maps and the at least two second feature maps form at least two feature map pairs, and any one feature map pair in the at least two feature map pairs comprises a first feature map and a second feature map;

the generating module 22 is configured to: for any one of the at least two feature map pairs, generating a third feature map corresponding to the feature map pair according to a partial feature in a first feature map of the feature map pair and a partial feature in a second feature map of the feature map pair;

the determining module 23 is configured to: and determining the value of the loss function corresponding to the second image processing model according to the third feature map corresponding to the at least two feature map pairs and the first feature map in the at least two feature map pairs.

the device further comprises:

the classification module is used for acquiring an image to be classified; processing the image to be classified through the second image processing model to obtain a feature map corresponding to the image to be classified; and processing the characteristic graph corresponding to the image to be classified through the second image processing model to obtain a classification result corresponding to the image to be classified.

the device further comprises:

In some embodiments, functions or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementations and technical effects thereof may refer to the description of the above method embodiments, which are not described herein again for brevity.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-described method. The computer-readable storage medium may be a non-volatile computer-readable storage medium, or may be a volatile computer-readable storage medium.

Embodiments of the present disclosure also provide a computer program, which includes computer readable code, when the computer readable code runs in an electronic device, a processor in the electronic device executes the above method.

The disclosed embodiments also provide a computer program product comprising computer readable code or a non-volatile computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the above method.

An embodiment of the present disclosure further provides an electronic device, including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the above-described methods.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 3 shows a block diagram of a sub-device 1900 provided by an embodiment of the present disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 3, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932 ^TM ) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X) ^TM ) Multi-user, multi-process computer operating system (Unix) ^TM ) Free and open native code Unix-like operating System (Linux) ^TM ) Open native code Unix-like operating System (FreeBSD) ^TM ) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

If the technical scheme of the embodiment of the disclosure relates to personal information, a product applying the technical scheme of the embodiment of the disclosure clearly informs personal information processing rules before processing the personal information, and obtains personal independent consent. If the technical scheme of the embodiment of the disclosure relates to sensitive personal information, a product applying the technical scheme of the embodiment of the disclosure obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'express consent'. For example, at a personal information collection device such as a camera, a clear and significant identifier is set to inform that the personal information collection range is entered, the personal information is collected, and if the person voluntarily enters the collection range, the person is considered as agreeing to collect the personal information; or on the device for processing the personal information, under the condition of informing the personal information processing rule by using obvious identification/information, obtaining personal authorization by modes of popping window information or asking a person to upload personal information of the person by himself, and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing method, and a type of personal information to be processed.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for training an image processing model, comprising:

2. The method according to claim 1, wherein generating a third feature map from the partial features in the first feature map and the partial features in the second feature map comprises:

3. The method of claim 2, wherein determining a first set of feature blocks from the first feature map for generating a third feature map and determining a second set of feature blocks from the second feature map for generating the third feature map comprises:

determining a mask area according to the mask proportion;

4. The method of claim 3, wherein determining a masking scale for merging the first feature map and the second feature map according to the first feature map and the second feature map comprises:

5. The method of claim 4,

6. The method according to claim 4 or 5, wherein the determining similarity information between the first feature map and the second feature map comprises:

aligning the first feature map with the second feature map;

7. The method of claim 6, wherein said aligning the first feature map with the second feature map comprises:

alternatively, the first and second electrodes may be,

8. The method according to any one of claims 3 to 7, wherein the determining, according to the mask region, a first feature block set for generating a third feature map from the first feature map and a second feature block set for generating the third feature map from the second feature map comprises:

9. The method according to any one of claims 2 to 8, wherein the generating the third feature map from the first set of feature blocks and the second set of feature blocks comprises:

combining the second position information to encode the second feature block set to obtain a second feature block encoding result;

10. The method according to any one of claims 1 to 9, wherein determining the value of the loss function corresponding to the second image processing model according to the third feature map and the first feature map comprises:

11. The method according to any one of claims 1 to 10,

the acquiring of the first feature map and the second feature map of the training image includes: obtaining at least two first feature maps corresponding to the training images extracted by at least two first intermediate layers of the first image processing model and at least two second feature maps corresponding to the training images extracted by at least two second intermediate layers of the second image processing model, wherein the at least two first feature maps and the at least two second feature maps form at least two feature map pairs, and any one feature map pair in the at least two feature map pairs comprises a first feature map and a second feature map;

12. The method of any of claims 1 to 11, wherein the first image processing model and the second image processing model are both used for image classification;

acquiring an image to be classified;

13. The method according to any of claims 1 to 11, wherein the first image processing model and the second image processing model are both used for object detection;

acquiring an image to be detected;

14. An apparatus for training an image processing model, comprising:

15. An electronic device, comprising:

one or more processors;

a memory for storing executable instructions;

wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the method of any one of claims 1 to 13.

16. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 13.

17. A computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code which, when run in an electronic device, causes a processor in the electronic device to perform the method of any of claims 1 to 13.