WO2024066111A1 - 图像处理模型的训练、图像处理方法、装置、设备及介质 - Google Patents

图像处理模型的训练、图像处理方法、装置、设备及介质 Download PDF

Info

Publication number
WO2024066111A1
WO2024066111A1 PCT/CN2022/143756 CN2022143756W WO2024066111A1 WO 2024066111 A1 WO2024066111 A1 WO 2024066111A1 CN 2022143756 W CN2022143756 W CN 2022143756W WO 2024066111 A1 WO2024066111 A1 WO 2024066111A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
feature
image processing
loss
model
Prior art date
Application number
PCT/CN2022/143756
Other languages
English (en)
French (fr)
Inventor
王勇涛
刘子炜
Original Assignee
北京大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学 filed Critical 北京大学
Publication of WO2024066111A1 publication Critical patent/WO2024066111A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application belongs to the field of computer vision technology, and relates to deep learning technologies such as computer vision, neural network model compression, and neural network knowledge distillation based on intermediate features, and in particular to the training of an image processing model, an image processing method, device, equipment, and medium.
  • neural network model compression technology is needed.
  • Knowledge distillation is an important method in the current neural network model compression technology. This method uses a large-scale neural network as a teacher network and a small-scale neural network as a student network. The knowledge of the teacher network is transferred to the student network, thereby obtaining a neural network with low complexity, good performance, and easy deployment, thereby achieving the purpose of model compression.
  • the mainstream knowledge distillation methods are divided into output response-based and intermediate feature-based knowledge distillation.
  • the output response-based knowledge distillation method uses the prediction results of the teacher model's tail layer as supervision information to guide the student model to imitate the teacher model's behavior.
  • the intermediate feature-based knowledge distillation method uses the features of the teacher model's intermediate hidden layer as supervision signals to guide the student model training.
  • the present application proposes an image processing model training, image processing method, device, equipment and medium, which can be used to improve the training effect of the image processing model.
  • a knowledge distillation method based on learnable feature transformation as shown in FIG1, comprises the following steps:
  • the student model is trained to achieve knowledge distillation.
  • the multi-layer perceptron module is a multi-layer perceptron structure with 1 hidden layer and a ReLU activation function.
  • the second feature map is aligned with the first feature map in terms of spatial dimension and channel dimension through bilinear interpolation and 1 ⁇ 1 convolution.
  • downstream tasks of the student model are obtained, the objective function of the model is matched according to the downstream task type, and the objective function and the knowledge distillation loss function are combined to train the student model.
  • the hyperparameters of the distillation loss function are adjusted according to the teacher model, the student model, and the downstream task, the regression loss function, the classification loss function, and the knowledge distillation loss function in the objective function are summed to obtain the total loss function of the student model training, and the student model is trained according to the total loss function.
  • the present application provides a knowledge distillation method based on learnable feature transformation, which aligns the features of the teacher model and the student model to improve the distillation effect. At the same time, it does not need to design complex feature transformation modules for different tasks, does not introduce complex hyperparameters, and eliminates tedious parameter adjustment steps. It improves the versatility of knowledge distillation in multiple tasks and can achieve good results in a variety of computer vision tasks.
  • the present application embodiment provides a method for training an image processing model, the method comprising:
  • the training loss is used to update the parameters of the student image processing model to obtain a target image processing model.
  • acquiring a feature difference loss based on a difference between the first image feature and the third image feature includes:
  • the feature difference loss is obtained based on the first difference loss and the second difference loss.
  • the step of obtaining the second image feature output by the student image processing model includes:
  • the obtaining of the training loss based on the feature difference loss includes:
  • the training loss is obtained based on the feature difference loss and the processing result loss.
  • the student image processing model is used to process the image to match the computer vision task, and the obtaining of the second image feature output by the student image processing model and the prediction processing result include:
  • the obtaining of the processing result loss based on the difference between the predicted processing result and the standard processing result corresponding to the sample image includes:
  • the processing result loss is obtained based on the difference between the predicted processing result matching the computer vision task and the standard processing result corresponding to the sample image and matching the computer task.
  • the computer vision task includes an image classification task
  • the predicted processing result matching the computer vision task includes a predicted classification result
  • the standard processing result matching the computer vision task includes a standard classification result
  • the processing result loss is obtained based on a difference between the predicted classification result and the standard classification result
  • the computer vision task includes a semantic segmentation task, the predicted processing result matched with the computer vision task includes a predicted segmentation result, the standard processing result matched with the computer vision task includes a standard segmentation result, and the processing result loss is obtained based on the predicted segmentation result and the standard segmentation result; or,
  • the computer vision task includes a target detection task, the prediction processing result matching the computer vision task includes a detection position prediction result and a detection category prediction result, the standard processing result matching the computer vision task includes a detection position standard result and a detection category standard result, and the processing result loss is obtained based on the difference between the detection position prediction result and the detection position standard result, and the difference between the detection category prediction result and the detection category standard result.
  • the updating the parameters of the student image processing model using the training loss to obtain a target image processing model includes:
  • the updated student image processing model is trained based on the updated first feature transformation model to obtain the target image processing model.
  • aligning the second image feature with the first image feature to obtain an aligned image feature includes:
  • the number of channels of the intermediate image feature is aligned with the number of channels of the first image feature through channel transformation convolution to obtain the aligned image feature.
  • the present application also provides an image processing method, the method comprising:
  • the target image is input into a target image processing model to obtain a target processing result output by the target image processing model; wherein the target image processing model is trained using any of the above-mentioned image processing model training methods.
  • the present application also provides a training device for an image processing model, the device comprising:
  • a first acquisition unit used for acquiring a sample image
  • a second acquisition unit is used to input the sample image into the teacher image processing model to obtain the first image feature output by the teacher image processing model;
  • a third acquisition unit configured to input the sample image into the student image processing model, acquire a second image feature output by the student image processing model, and align the second image feature with the first image feature to obtain an aligned image feature;
  • a transformation unit configured to transform the aligned image features using a first feature transformation model to obtain a third image feature, wherein parameters of the first feature transformation model are learned based on a training process of an image processing model;
  • a fourth acquisition unit configured to acquire a feature difference loss based on a difference between the first image feature and the third image feature; and acquire a training loss based on the feature difference loss;
  • An updating unit is used to update the parameters of the student image processing model using the training loss to obtain a target image processing model.
  • the fourth acquisition unit is used to keep the number of channels of the aligned image feature unchanged, adjust the size of the aligned image feature from a first size to a second size, and obtain an adjusted image feature; transform the adjusted image feature using a second feature transformation model to obtain a transformed image feature, and restore the size of the transformed image feature from the second size to the first size to obtain a fourth image feature, wherein the parameters of the second feature transformation model are learned based on the training process of the image processing model; obtain a first difference loss based on the difference between the first image feature and the third image feature; obtain a second difference loss based on the difference between the first image feature and the fourth image feature; and obtain the feature difference loss based on the first difference loss and the second difference loss.
  • the third acquisition unit is used to acquire the second image feature and the prediction processing result output by the student image processing model
  • the fourth acquisition unit is used to obtain the processing result loss based on the difference between the predicted processing result and the standard processing result corresponding to the sample image; and to obtain the training loss based on the feature difference loss and the processing result loss.
  • the student image processing model is used to process the image to match the computer vision task
  • the third acquisition unit is used to obtain the second image feature output by the student image processing model and the prediction processing result matching the computer vision task
  • the fourth acquisition unit is used to acquire the processing result loss based on the difference between the predicted processing result matching the computer vision task and the standard processing result corresponding to the sample image and matching the computer task.
  • the computer vision task includes an image classification task
  • the predicted processing result matching the computer vision task includes a predicted classification result
  • the standard processing result matching the computer vision task includes a standard classification result
  • the processing result loss is obtained based on the difference between the predicted classification result and the standard classification result
  • the computer vision task includes a semantic segmentation task, the predicted processing result matched with the computer vision task includes a predicted segmentation result, the standard processing result matched with the computer vision task includes a standard segmentation result, and the processing result loss is obtained based on the predicted segmentation result and the standard segmentation result; or,
  • the computer vision task includes a target detection task, the prediction processing result matching the computer vision task includes a detection position prediction result and a detection category prediction result, the standard processing result matching the computer vision task includes a detection position standard result and a detection category standard result, and the processing result loss is obtained based on the difference between the detection position prediction result and the detection position standard result, and the difference between the detection category prediction result and the detection category standard result.
  • the updating unit is used to update the parameters of the student image processing model using the training loss to obtain an updated student image processing model; if the current training process does not meet the training termination condition, the parameters of the first feature transformation model are updated using the feature difference loss to obtain an updated first feature transformation model; and the updated student image processing model is trained based on the updated first feature transformation model to obtain the target image processing model.
  • the third acquisition unit is used to align the size of the second image feature with the size of the first image feature through linear interpolation to obtain an intermediate image feature; and align the number of channels of the intermediate image feature with the number of channels of the first image feature through channel transformation convolution to obtain the aligned image feature.
  • the present application also provides an image processing device, the device comprising:
  • a first acquisition unit used for acquiring a target image to be processed
  • the second acquisition unit is used to input the target image into a target image processing model to obtain a target processing result output by the target image processing model; wherein the target image processing model is trained using any of the above-mentioned image processing model training methods.
  • An embodiment of the present application also provides a computer device, which includes a processor and a memory, wherein at least one computer program is stored in the memory, and the at least one computer program is loaded and executed by the processor so that the computer device implements any of the above-mentioned image processing model training methods or image processing methods.
  • a computer-readable storage medium in which at least one computer program is stored.
  • the at least one computer program is loaded and executed by a processor so that a computer implements any of the above-mentioned image processing model training methods or image processing methods.
  • a computer program product which includes a computer program or computer instructions, and the computer program or computer instructions are loaded and executed by a processor so that a computer implements any of the above-mentioned image processing model training methods or image processing methods.
  • the technical solution provided in the embodiment of the present application first transforms the aligned image features using the first feature transformation model, and then compares the third image features obtained after the transformation with the first image features output by the teacher image processing model to obtain the training loss.
  • the student image processing model is trained using this training loss, which can make the third image features as close to the first image features as possible.
  • the third image features and the aligned image features have been transformed by the first feature transformation model, even if the third image features are very close to the first image features, it is possible to ensure that there is a certain gap between the aligned image features and the first image features, thereby avoiding the problem of the second image features output by the student image processing model overfitting the first image features output by the teacher image processing model, and helping the student image processing model to have more learning space to focus on the characteristics of its own model while learning the image features output by the teacher image processing model, thereby improving the training effect of the student image processing model.
  • the parameters of the first feature transformation model are parameters learned based on the training process of the image processing model, thereby ensuring the matching degree between the feature transformation process and the model training process, thereby ensuring the reliability of the feature transformation, improving the reliability of the training loss, and further improving the training effect of the image processing model.
  • FIG1 is a schematic diagram of a process of a knowledge distillation method based on learnable feature transformation in the present application
  • FIG2 is a schematic diagram of the training process architecture of the student model of the embodiment of the present application.
  • FIG3 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • FIG4 is a flow chart of a method for training an image processing model provided in an embodiment of the present application.
  • FIG5 is a flow chart of an image processing method provided by an embodiment of the present application.
  • FIG6 is a schematic diagram of a training device for an image processing model provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of an image processing device provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.
  • RetinNet-rx101 pre-trained on this data is used as the teacher model, and RetinaNet-R50 is selected as the student model to illustrate how to perform knowledge distillation on the object detection task through the learnable transformation module, as shown in Figure 1.
  • Step S1 inputting input data into a teacher model to obtain a first feature map output by an intermediate layer of the teacher model, and inputting the input data into a student model to obtain a second feature map output by an intermediate layer of the student model, specifically comprising:
  • S11 Input any batch of original training images into the teacher model RetinNet-rx101, and obtain the first feature map of the intermediate layer output in the FPN part of the teacher model.
  • S12 Input the training image into the student model RetinaNet-R50, and obtain the second feature map output by the intermediate layer in the FPN part of the student model.
  • Step S2 using a multi-layer perceptron module to obtain a third feature map and a fourth feature map, specifically including:
  • S21 Align the second feature map with the first feature map in terms of spatial dimension and channel dimension through bilinear interpolation and 1 ⁇ 1 convolution to obtain an aligned feature map.
  • Step S3 According to the first feature map, the third feature map and the fourth feature map, the spatial feature loss and the channel feature loss between the teacher model and the student model are calculated, and the weighted sum of the spatial feature loss and the channel feature loss is used as the knowledge distillation loss function between the teacher model and the student model, specifically including:
  • feat T is the first feature map
  • feat T is the first feature map
  • ⁇ and ⁇ are hyperparameters, which are set to 2e-5 and 1e-6 respectively in this embodiment.
  • Step S4 According to the knowledge distillation loss function, the student model is trained to achieve knowledge distillation.
  • the training process architecture of the student model can be shown in Figure 2.
  • the input image is input into the teacher model to obtain the first feature map (also known as the teacher feature) output by the intermediate layer of the teacher model, and the input image is input into the student model to obtain the second feature map (also known as the student feature) output by the intermediate layer of the student model.
  • the student feature is aligned with the teacher feature, and the aligned feature map is passed through a multi-layer perceptron to obtain a third feature map; the shape of the aligned feature map is adjusted by expansion and transposition operations, and the adjusted feature map is passed through another multi-layer perceptron to obtain a transformed feature map, and then the shape of the transformed feature map is restored to the original shape to obtain a fourth feature map.
  • the spatial feature distillation loss is obtained; based on the fourth feature map and the teacher feature, the channel feature distillation loss is obtained; the channel feature distillation loss and the spatial feature distillation loss are weighted summed to obtain the distillation loss; based on the distillation loss, the knowledge of the teacher model is transferred to the student model to realize the training of the student model.
  • a downstream task of the student model is obtained.
  • the downstream task is a target detection task.
  • Step S5 Match the model objective function according to the downstream task type.
  • the objective function of the model is divided into a regression loss function and a classification loss function.
  • the regression loss function expression is:
  • ti is the deviation between each predicted anchor and Ground Truth (GT), and is the true deviation between each anchor and GT.
  • the classification loss function adopts Focal Loss, which is expressed as:
  • p t is the probability value of the sample being correctly classified
  • ⁇ t and ⁇ are hyperparameters, which are set to 0.25 and 2.0 respectively in this embodiment.
  • Step S6 According to the teacher model, the student model, and the downstream task, the hyperparameters of the distillation loss function are adjusted, and the objective function, the knowledge distillation loss function, and the hyperparameters are used to obtain the total loss function of the student model training; the student model is trained according to the total loss function, wherein the expression of the total loss function is:
  • the results on the ImageNet dataset show that using ResNet34 as the teacher model and ResNet18 as the student model, and adopting the distillation method proposed in this application for knowledge distillation, the Top-1 accuracy on the test set can be improved from 69.9% to 71.4%;
  • the results on the MSCOCO dataset show that using RetinaNet-RX101 as the teacher model and RetinaNet-R50 as the student model, and adopting the knowledge distillation method proposed in this application, the mAP of the student model can be improved from 37.4% to 41.0%;
  • the results on the CityScapes dataset show that using PSPNet-ResNet34 is used as the teacher model, PSPNet-ResNet18 is used as the student model, and the knowledge distillation method proposed in this application can improve the mIoU of the student model from 69.9% to 74.2% (Note: ImageNet is a large-scale image classification dataset, and Top1-accuracy is used to
  • Cifar100 is a small-scale image classification dataset.
  • FIG3 shows a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • the implementation environment includes: a terminal 11 and a server 12 .
  • the training method of the image processing model provided in the embodiment of the present application can be executed by the terminal 11, can be executed by the server 12, or can be jointly executed by the terminal 11 and the server 12, and the embodiment of the present application does not limit this.
  • the server 12 undertakes the main computing work and the terminal 11 undertakes the secondary computing work; or, the server 12 undertakes the secondary computing work and the terminal 11 undertakes the main computing work; or, the server 12 and the terminal 11 adopt a distributed computing architecture for collaborative computing.
  • the image processing method provided in the embodiment of the present application can be executed by the terminal 11, can be executed by the server 12, or can be executed jointly by the terminal 11 and the server 12, and the embodiment of the present application does not limit this.
  • the server 12 undertakes the main computing work and the terminal 11 undertakes the secondary computing work; or, the server 12 undertakes the secondary computing work and the terminal 11 undertakes the main computing work; or, the server 12 and the terminal 11 adopt a distributed computing architecture for collaborative computing.
  • the execution device of the training method of the image processing model and the execution device of the image processing method may be the same as or different from each other, and the embodiments of the present application are not limited to this.
  • the terminal 11 may be any electronic product that can interact with a user through one or more methods such as a keyboard, a touch pad, a touch screen, a remote control, voice interaction or a handwriting device, such as a PC (Personal Computer), a mobile phone, a smart phone, a PDA (Personal Digital Assistant), a wearable device, a PPC (Pocket PC), a tablet computer, a smart car machine, a smart TV, a smart speaker, a smart voice interaction device, a smart home appliance, a car terminal, etc.
  • the server 12 may be a server, or a server cluster composed of multiple servers, or a cloud computing service center.
  • the terminal 11 establishes a communication connection with the server 12 through a wired or wireless network.
  • terminal 11 and server 12 are only examples, and other existing or future terminals or servers that are applicable to the present application should also be included in the protection scope of the present application and are included here by reference.
  • the present application embodiment provides a training method for an image processing model, and the training method for the image processing model is executed by a computer device, and the computer device can be a terminal 11 or a server 12, which is not limited in the present application embodiment.
  • the training method for the image processing model provided in the present application embodiment can include the following steps 401 to 406.
  • step 401 a sample image is acquired.
  • the sample image is an image based on which the parameters of the student image processing model are updated once, and the number of sample images is one or more.
  • the number of sample images is usually multiple to ensure the training effect of the student image processing model.
  • the number of channels and size of the sample image can be set based on experience, and can also be flexibly adjusted according to the application scenario, which is not limited in the embodiment of the present application. It should be noted that the sample image in the embodiment of the present application is equivalent to the input data in the above-mentioned embodiment.
  • the sample image may be extracted from a sample image library, may be crawled from the network, may be sent to a computer device by other devices, and the like.
  • the sample image may refer to an image in an open source image dataset
  • the open source image dataset may refer to an image dataset matching a computer vision task.
  • the image dataset may refer to a COCO (Common Objects in Context) dataset
  • the image dataset may refer to an ImageNet dataset (an image classification dataset)
  • the image dataset may refer to a CityScapes dataset (a semantic segmentation dataset).
  • step 402 the sample image is input into the teacher image processing model to obtain the first image feature output by the teacher image processing model.
  • the teacher image processing model is a model for providing supervision information for the training process of the student image processing model, that is, a model for guiding the training process of the student image processing model.
  • the "student image processing model” and “teacher image processing model” in the embodiment of the present application are named based on their respective functions, wherein the “student image processing model” can learn image processing knowledge from other models, and the “teacher image processing model” can transfer the learned image processing knowledge to other models.
  • the "student image processing model” and the “teacher image processing model” can also be named in other ways, which is not limited by the embodiment of the present application.
  • the teacher image processing model in the embodiment of the present application is equivalent to the teacher model in the above-mentioned embodiment, and the first image feature in the embodiment of the present application is equivalent to the first feature map in the above-mentioned embodiment.
  • the teacher image processing model and the student image processing model constitute a knowledge distillation architecture, and the teacher image processing model is used to distill the learned knowledge into the student image processing model to realize the training of the student image processing model.
  • the knowledge distillation architecture a large-scale neural network is used as the teacher image processing model, and a small-scale neural network is used as the student image processing model.
  • the knowledge of the teacher image processing model is transferred to the student image processing model, thereby obtaining a student image processing model with low complexity, good performance, and easy deployment, thereby achieving the purpose of model compression.
  • the teacher image processing model includes a feature extraction layer, and the feature extraction layer of the teacher image processing model is used to extract features of the image input to the teacher image processing model.
  • the number of feature extraction layers can be one or more, and each feature extraction layer can output an image feature.
  • the first feature extraction layer is used to extract features of the image input to the teacher image processing model, and starting from the second feature extraction layer, the next feature extraction layer is used to extract features of the image output by the previous feature extraction layer, or to extract features of the image features output by the previous feature extraction layer and other features (such as the input image or the image features output by the previous feature extraction layer).
  • multiple feature extraction layers can constitute the form of FPN (Feature Pyramid Networks).
  • the teacher image processing model may include a task processing layer in addition to the feature extraction layer.
  • the task processing layer in the teacher image processing model is used to process the image features extracted by the last feature extraction layer of the teacher image processing model, or the fusion features of the image features extracted by the last feature extraction layer and other features (such as the input image or the image features output by the previous feature extraction layer) to output the prediction processing result.
  • the teacher image processing model is used to process the image to match the computer vision task.
  • the model structure of the teacher image processing model can be set according to experience, or it can be flexibly adjusted according to the type of computer vision task, and the embodiments of the present application do not limit this.
  • the model structure of the teacher image processing model can refer to the RetinNet-RX101 model (a model for image processing); for the case where the computer vision task is an image classification task, the model structure of the teacher image processing model can refer to the ResNet34 model (a model for image processing); for the case where the computer vision task is a semantic segmentation task, the model structure of the teacher image processing model can refer to the PSPNet-ResNet34 model (an image processing model).
  • the model structure of the teacher image processing model can also be other structures, such as a structure consisting of some layers selected from the above-mentioned model, etc., and the embodiments of the present application will not be repeated here one by one.
  • the training method of the image processing model provided in the embodiment of the present application is a knowledge distillation method based on intermediate features, that is, the image features output by the teacher image processing model are used to provide guidance information for the training of the student image processing model.
  • the first image feature output by the teacher image processing model can be obtained, and then the first image feature is used to provide guidance information for the training of the student image processing model.
  • the first image feature output by the teacher image processing model refers to the first image feature output by the feature extraction layer of the teacher image processing model.
  • the number of feature extraction layers of the teacher image processing model may be one or more.
  • the feature extracted by the one feature extraction layer is directly used as the first image feature.
  • the number of first image features is one; in the case where the number of feature extraction layers of the teacher image processing model is more than one, a reference number of image features can be selected from the multiple image features extracted by the multiple feature extraction layers as the first image feature.
  • the reference number is not greater than the total number of feature extraction layers, and the reference number can be set based on experience or flexibly adjusted according to the application scenario.
  • sizes of different first image features may be the same or different; numbers of channels of different image features may be the same or different.
  • step 403 the sample image is input into the student image processing model, the second image feature output by the student image processing model is obtained, and the second image feature is aligned with the first image feature to obtain the aligned image feature.
  • the student image processing model refers to an image processing model to be trained. After the sample image is input into the student image processing model, the second image feature output by the student image processing model can be obtained.
  • the student image processing model also includes a feature extraction layer. After the sample image is input into the student image processing model, the second image feature output by the feature extraction layer of the student image processing model can be obtained. It should be noted that the student image processing model in the embodiment of the present application is equivalent to the student model in the above embodiment, and the second image feature in the embodiment of the present application is equivalent to the second feature map in the above embodiment.
  • the number of feature extraction layers included in the student image processing model may be the same as the number of feature extraction layers included in the teacher image processing model, or may be different from the number of feature extraction layers included in the teacher image processing model, and this is not limited in the embodiments of the present application. However, in either case, it is necessary to ensure that the number of second image features is the same as the number of first image features, that is, image features with the same number as the first image features are selected from the image features output by each feature extraction layer of the student image processing model as the second image features.
  • the model structure of the student image processing model can be set according to experience, or it can be flexibly adjusted according to the type of computer vision task, and the embodiments of the present application do not limit this.
  • the model structure of the image processing model refers to the case of the RetinNet-RX101 model
  • the structure of the student image processing model can refer to the RetinaNet-R50 model (an image processing model)
  • the model structure of the teacher image processing model refers to the case of the ResNet34 model
  • the structure of the student image processing model can refer to the ResNet18 model (a model for image processing)
  • the model structure of the teacher image processing model refers to the case of the PSPNet-ResNet34 model
  • the structure of the student image processing model can refer to the PSPNet-ResNet18 model (an image processing model).
  • the model structure of the student image processing model can also be other structures,
  • a corresponding relationship between the second image feature and the first image feature may be established, and the first image feature in a set of corresponding features is used to provide supervision information for the second image feature in the set of features.
  • aligning the second image feature with the first image feature After acquiring the second image feature, align the second image feature with the first image feature to obtain an aligned image feature.
  • the size of the aligned image feature is the same as the size of the first image feature, and the number of channels of the aligned image feature is the same as the number of channels of the first image feature.
  • the aligned image feature in the embodiment of the present application is equivalent to the aligned feature map in the above embodiment.
  • aligning the second image feature with the first image feature refers to aligning each second image feature with the first image feature corresponding to each second image feature.
  • the principle of aligning each second image feature with the first image feature corresponding to each second image feature is the same, and the present application takes the number of first image features (or second image features) as one as an example for explanation.
  • the second image feature is aligned with the first image feature to obtain the aligned image feature, including: aligning the size of the second image feature with the size of the first image feature through linear interpolation to obtain an intermediate image feature; aligning the number of channels of the intermediate image feature with the number of channels of the first image feature through channel transformation convolution to obtain the aligned image feature.
  • the size of the second image feature can be transformed into the size of the first image feature to achieve alignment of the spatial dimension, and the image feature obtained after the alignment of the spatial dimension is used as the intermediate image feature.
  • the linear interpolation method can be set based on experience or flexibly adjusted according to the application scenario.
  • the linear interpolation method can refer to bilinear interpolation, bicubic interpolation, area interpolation, etc.
  • bicubic interpolation refers to a more complex interpolation method that can create smoother image edges than bilinear interpolation.
  • channel transformation convolution the number of channels of the intermediate image feature can be transformed into the number of channels of the first image feature to achieve alignment of the channel dimension, and the image feature obtained after the alignment of the spatial dimension and the channel dimension is used as the aligned image feature.
  • Channel transformation convolution can be implemented by a convolution kernel that does not change the size of the image feature but only changes the number of channels of the image feature.
  • a channel transformation convolution of the intermediate image feature can be implemented by a convolution kernel of size 1 ⁇ 1.
  • the above-mentioned implementation method of aligning the second image feature with the first image feature to obtain the aligned image feature is only an illustrative example, and the embodiments of the present application are not limited to this.
  • the implementation method of aligning the second image feature with the first image feature to obtain the aligned image feature may also refer to: aligning the number of channels of the second image feature with the number of channels of the first image feature through channel transformation convolution to obtain the intermediate image feature; aligning the size of the intermediate image feature with the size of the first image feature through linear interpolation to obtain the aligned image feature.
  • the implementation method of aligning the second image feature with the first image feature to obtain the aligned image feature may also refer to: inputting the second image feature and the first image feature into an alignment network to obtain the aligned image feature output by the alignment network, wherein the alignment network is used to align the input second image feature with the input first image feature as a reference.
  • the predicted processing results output by the student image processing model can also be obtained.
  • the student image processing model includes a task processing layer in addition to a feature extraction layer.
  • the predicted processing results output by the task processing layer of the student image processing model can also be obtained.
  • the student image processing model is used to process the image to match the computer vision task.
  • the computer vision task can be regarded as a downstream task of the student image processing model.
  • obtaining the prediction processing result output by the student image processing model means obtaining the prediction processing result output by the student image processing model that matches the computer vision task.
  • the computer vision task includes any one of an image classification task, a semantic segmentation task, and a target detection task.
  • the image classification task is used to determine the category corresponding to the entire image
  • the semantic segmentation task is used to determine the category corresponding to each pixel in the image
  • the target detection task is used to detect the position of the target in the image and determine the category of the detected target.
  • the computer vision task includes an image classification task
  • the task processing layer of the student image processing model may include a branch, and the branch is used to output the predicted classification result. In this case, the predicted processing result matching the computer vision task includes the predicted classification result.
  • the task processing layer of the student image processing model may include a branch, and the branch is used to output the predicted segmentation result.
  • the predicted processing result matching the computer vision task includes the predicted segmentation result.
  • the task processing layer of the student image processing model may include two branches, one of which is used to output the detection position prediction result, and the other branch is used to output the detection category prediction result. In this case, the predicted processing result matching the computer vision task includes the detection position prediction result and the detection category prediction result.
  • step 404 the aligned image features are transformed using a first feature transformation model to obtain third image features, and the parameters of the first feature transformation model are learned based on a training process of an image processing model.
  • the aligned image features are transformed using the first feature transformation model to obtain the third image features, and then the training loss is calculated based on the comparison between the third image features and the first image features output by the teacher image processing model.
  • the student image processing model is trained based on the training loss obtained in this way, so that the third image features can be made as close to the first image features as possible.
  • the third image features and the aligned image features have been transformed by the first feature transformation model, even if the third image features are very close to the first image features, it is possible to ensure that there is a certain gap between the second image features based on which the aligned image features are obtained and the first image features, thereby avoiding the problem of the student image processing model overfitting the teacher image processing model, and helping the student image processing model to have more learning space to focus on the characteristics of its own model while learning the image features output by the teacher image processing model, thereby improving the training effect of the student image processing model.
  • the first feature transformation model is used to transform the input image features based on learnable parameters, that is, the parameters of the first feature transformation model are learned based on the training process of the image processing model, which can ensure the matching degree between the transformation process of the first feature transformation model and the training process of the image processing model, thereby ensuring the reliability of the feature transformation and the reliability of the training of the image processing model based on the third image feature obtained after the transformation.
  • the first feature transformation model can also be called a learnable transformation module, a learnable transformation model, etc.
  • the third image feature in the embodiment of the present application is equivalent to the third feature map in the above embodiment.
  • the parameters of the first feature transformation model are learned based on the training process of the image processing model, which means that the parameters of the first feature transformation model are continuously updated with the iteration of the training process of the image processing model. That is to say, the parameters of the first feature transformation model used in the Nth (N is an integer not less than 1)th training process of the image processing model are learned based on the training process of the previous (N-1) image processing model. In the previous (N-1) training process of the image processing model, each time the training process of the image processing model is executed, the parameters of the first feature transformation model are updated once according to the feature difference loss obtained in the training process of the current image processing model.
  • the structure of the first feature transformation model can be set according to experience, or it can be flexibly adjusted according to the experience scenario, as long as the first feature transformation model has learnable parameters.
  • the first feature transformation model can refer to a multi-layer perceptron, which has a relatively simple structure and can reduce the amount of calculation required for feature transformation and reduce the complexity of parameter adjustment.
  • the number of hidden layers of the multi-layer perceptron and the type of activation function used by the multi-layer perceptron can be set according to experience, or flexibly adjusted according to the application scenario.
  • the number of hidden layers of the multi-layer perceptron can be 1 or 2.
  • the activation function used by the multi-layer perceptron can refer to ReLU (Rectified Linear Unit, linear rectification function) or Sigmoid (S-type) function, etc.
  • the transformation process of the first feature transformation model does not change the size and number of channels of the image feature. That is to say, the size and number of channels of the third image feature are respectively the same as the size and number of channels of the aligned image feature. Since the size and number of channels of the aligned image feature are respectively the same as the size and number of channels of the first image feature, the size and number of channels of the third image feature are respectively the same as the size and number of channels of the first image feature, so as to facilitate measuring the difference between the third image feature and the first image feature.
  • step 405 a feature difference loss is obtained based on the difference between the first image feature and the third image feature; and a training loss is obtained based on the feature difference loss.
  • Feature discrepancy loss is used to provide supervision information for feature extraction to the student image processing model.
  • obtaining the feature difference loss may be implemented by: obtaining the first difference loss based on the difference between the first image feature and the third image feature; and obtaining the feature difference loss based on the first difference loss.
  • the difference between the two image features can be reflected by the result calculated by substituting the two image features into the loss function.
  • the type of loss function can be selected based on experience.
  • the type of loss function can include but is not limited to cross entropy loss function, mean square error loss function, KL (Kullback-Leibler) divergence loss function, etc.
  • the process of obtaining the first difference loss includes: substituting the first image feature and the third image feature into the loss function for calculation, and obtaining the first difference loss based on the calculated result.
  • the calculated result is used as the first difference loss, or the calculated result is processed (such as rounding, multiplying by a positive number, adding a positive number, etc.), and the processed result is used as the first difference loss.
  • the first difference loss can be calculated based on Formula 1:
  • Loss Spatial represents the first difference loss
  • MSELoss(,) represents the expression of the mean square error loss function, which is used to calculate the mean square error loss between the two pieces of information in the brackets
  • feat T represents the first image feature
  • the first difference loss may also be referred to as a spatial feature loss.
  • the feature difference loss is obtained based on the first difference loss.
  • the feature difference loss can be obtained by: using the first difference loss as the feature difference loss, which can improve the efficiency of obtaining the feature difference loss.
  • the feature difference loss can also be obtained by: obtaining the second difference loss based on the difference between the first image feature and the fourth image feature; obtaining the feature difference loss based on the first difference loss and the second difference loss.
  • the fourth image feature is a feature different from the third image feature obtained on the basis of the second image feature for comparison with the first image feature.
  • the fourth image feature may be obtained by: keeping the number of channels of the aligned image feature unchanged, adjusting the size of the aligned image feature from the first size to the second size, and obtaining the adjusted image feature; transforming the adjusted image feature using the second feature transformation model to obtain the transformed image feature, and restoring the size of the transformed image feature from the second size to the first size to obtain the fourth image feature.
  • the fourth image feature in the embodiment of the present application is equivalent to the fourth feature map in the above embodiment.
  • the first size is the original size of the aligned image features
  • the second size is the size of the adjusted image features.
  • the relationship between the first size and the second size can be set according to experience, or flexibly adjusted according to the application scenario.
  • the relationship between the first size and the second size can be that the product of the width and height in the first size is the same as the product of the width and height in the second size.
  • the first size may refer to a width of W and a height of H
  • the second size may refer to a width of (W*H) and a height of 1, or a width of 1 and a height of (W*H).
  • the process of adjusting the size of the aligned image features from the first size to the second size can be achieved by cropping and splicing.
  • a transposition operation may also be performed.
  • the dimension of the aligned image feature may be expressed as [N, C, H, W].
  • the dimension of the adjusted image feature may be expressed as [N, (H*W), 1, C] or [N, 1, (H*W), C].
  • N N is a positive integer
  • C C is a positive integer
  • H H is a positive number
  • W W is a positive number
  • the adjusted image feature may be regarded as weakening the information of the size dimension of the image feature and paying more attention to the characteristics of the information of the channel dimension of the image feature.
  • the adjusted image features are transformed using the second feature transformation model to obtain transformed image features.
  • the parameters of the second feature transformation model are learned based on the training process of the image processing model. In other words, the parameters of the second feature transformation model are continuously updated with the iteration of the training process of the image processing model, thereby ensuring the matching degree between the feature transformation process of the second feature transformation model and the training process of the image processing model, and improving the transformation reliability of the second feature transformation model.
  • the structure of the second feature transformation model can be set according to experience, or it can be flexibly adjusted according to the experience scenario.
  • the second feature transformation model can refer to a multi-layer perceptron, which has a relatively simple structure and can reduce the amount of calculation required for feature transformation and reduce the complexity of parameter adjustment.
  • the number of hidden layers of the multi-layer perceptron and the type of activation function used by the multi-layer perceptron can be set according to experience, or flexibly adjusted according to the application scenario.
  • the number of hidden layers of the multi-layer perceptron can be 1, or 2, etc.
  • the activation function used by the multi-layer perceptron can refer to ReLU, or it can refer to Sigmoid function, etc.
  • the structure of the second feature transformation model can be the same as the structure of the first feature transformation model, or it can be different from the structure of the first feature transformation model.
  • the second feature transformation model is also a model with learnable parameters, that is, the parameters of the second feature transformation model can be continuously updated during the training process of the image processing model to ensure the matching degree between the feature transformation process and the training process and improve the reliability of the feature transformation.
  • the transformation process of the second feature transformation model does not change the size and number of channels of the image feature, that is, the size and number of channels of the transformed image feature are respectively the same as the size and number of channels of the adjusted image feature. Since the size of the adjusted image feature is different from the aligned image feature, after obtaining the transformed image feature, the transformed image feature needs to be restored in the size dimension to restore the size of the transformed image feature from the second size to the first size, and the image feature obtained after the size dimension restoration is used as the fourth image feature.
  • the size and number of channels of the fourth image feature are respectively the same as the size and number of channels of the aligned image feature, and since the size and number of channels of the aligned image feature are respectively the same as the size and number of channels of the first image feature, the size and number of channels of the fourth image feature are respectively the same as the size and number of channels of the first image feature, so as to facilitate the measurement of the difference between the fourth image feature and the first image feature.
  • the dimension of the adjusted image feature is expressed as [N, (H*W), 1, C] or [N, 1, (H*W), C]
  • the dimension of the fourth image feature can be expressed as [N, C, H, W].
  • the principle of obtaining the second difference loss based on the difference between the first image feature and the fourth image feature is the same as the principle of obtaining the first difference loss based on the difference between the first image feature and the third image feature, and will not be repeated here.
  • the second difference loss can be calculated based on Formula 2:
  • Loss Channel represents the second difference loss
  • MSELoss(,) represents the expression of the mean square error loss function, which is used to calculate the mean square error loss between the two pieces of information in the brackets
  • feat T represents the first image feature
  • the second difference loss may also be referred to as a channel feature loss.
  • the characteristic difference loss may be obtained by taking the sum of the first difference loss and the second difference loss as the characteristic difference loss, or by taking the weighted sum of the first difference loss and the second difference loss as the characteristic difference loss.
  • the weights corresponding to the first difference loss and the second difference loss may be set based on experience or flexibly adjusted based on the application scenario.
  • the process of obtaining the feature difference loss can be implemented based on Formula 3:
  • L distill represents feature difference loss
  • Loss Spatial represents the first difference loss
  • Loss Channel represents the second difference loss
  • represents the weight corresponding to the first difference loss
  • represents the weight corresponding to the second difference loss.
  • ⁇ and ⁇ are hyperparameters that can be flexibly set based on experience. For example, ⁇ and ⁇ can be set to 2e-5 and 1e-6 respectively.
  • the training loss is obtained based on the feature difference loss.
  • the training loss is the loss directly based on which the parameters of the student image processing model are updated.
  • the implementation method of obtaining the training loss based on the feature difference loss can be set according to experience or flexibly adjusted according to the application scenario, and the embodiment of the present application does not limit this.
  • a method of obtaining the training loss based on the feature difference loss may be: using the feature difference loss as the training loss. This method can improve the efficiency of obtaining the training loss.
  • the predicted processing result is also obtained.
  • the method of obtaining the training loss based on the feature difference loss can also be: obtaining the processing result loss based on the difference between the predicted processing result and the standard processing result corresponding to the sample image; obtaining the training loss based on the feature difference loss and the processing result loss.
  • the standard processing result refers to the actual processing result corresponding to the sample image, which is used to provide supervision information for the predicted processing result output by the student image processing model.
  • the standard processing result can be determined by a technician.
  • the standard processing result can refer to the standard processing result that matches the computer vision task.
  • the type of standard processing result that matches the computer vision task is related to the type of computer vision task.
  • the standard processing result that matches the computer vision task includes a standard classification result; for the case where the computer vision task includes a semantic segmentation task, the standard processing result that matches the computer vision task includes a standard segmentation result; for the case where the computer vision task includes a target detection task, the standard processing result that matches the computer vision task includes a detection position standard result and a detection category standard result.
  • obtaining the processing result loss based on the difference between the predicted processing result and the standard processing result corresponding to the sample image means obtaining the processing result loss based on the difference between the predicted processing result matching the computer vision task and the standard processing result matching the computer task corresponding to the sample image.
  • the processing result loss is used to measure the difference between the predicted processing result output by the student image processing model and the standard processing result. The greater the difference between the predicted processing result output by the student image processing model and the standard processing result, the greater the processing result loss.
  • the method of obtaining the processing result loss is related to the type of computer vision task.
  • the processing result loss is obtained based on the difference between the predicted classification result and the standard classification result; in the case where the computer vision task includes a semantic segmentation task, the processing result loss is obtained based on the difference between the predicted segmentation result and the standard segmentation result; in the case where the computer vision task includes an object detection task, the processing result loss is obtained based on the difference between the detection position prediction result and the detection position standard result, and the difference between the detection category prediction result and the detection category standard result.
  • the difference between the two results can be reflected by the result obtained by substituting the two results into the loss function.
  • the loss function based on which the difference between the two different results is calculated can be the same or different, and this embodiment of the application is not limited to this.
  • the process of obtaining the processing result loss based on the difference between the predicted classification result and the standard classification result may be: substituting the predicted classification result and the standard classification result into the loss function corresponding to the image classification task, and obtaining the processing result loss based on the calculated result.
  • the loss function corresponding to the image classification task may include but is not limited to the cross entropy loss function, the mean square error loss function, and the like.
  • the process of obtaining the processing result loss based on the difference between the predicted segmentation result and the standard segmentation result may be: substituting the predicted segmentation result and the standard segmentation result into the loss function corresponding to the semantic segmentation task, and obtaining the processing result loss based on the calculated result.
  • the loss function corresponding to the semantic segmentation task may include but is not limited to a cross entropy loss function, a mean square error loss function, and the like.
  • the method for obtaining the processing result loss based on the difference between the detection position prediction result and the detection position standard result, and the difference between the detection category prediction result and the detection category standard result can be: substituting the detection position prediction result and the detection position standard result into the first loss function corresponding to the target detection task, and obtaining the first detection loss based on the calculated result; substituting the detection category prediction result and the detection category standard result into the second loss function corresponding to the target detection task, and obtaining the second detection loss based on the calculated result; obtaining the processing result loss based on the first detection loss and the second detection loss.
  • the first loss function is a loss function used to measure the accuracy of the position of the target detected in the target detection task
  • the second loss function is a loss function used to measure the accuracy of the category of the target detected in the target detection task.
  • the first loss function includes but is not limited to L1 Loss (L1 norm loss function), L2 Loss (L2 norm loss function), Smooth L1 Loss (stable L1 norm loss function), IOU (Intersection over Union) loss function, etc.
  • the second loss function includes but is not limited to cross entropy loss function, Focal Loss (focused loss function), etc.
  • the first loss function can also be called a regression loss function
  • the second loss function can also be called a classification loss function.
  • the first detection loss can be calculated based on Formula 4
  • the second detection loss can be calculated based on Formula 5:
  • ti represents the element i in the detection position prediction result
  • element i is an element of x, y, w and h
  • x and y represent the coordinates of a certain point of the detection position (such as the upper left corner, upper right corner, center point, etc.)
  • w and h represent the width and length of the detection position.
  • L cls represents the second detection loss
  • pt represents the closeness between the detection category prediction result and the detection category standard result.
  • the expression of pt is shown in Formula 6; ⁇ t and ⁇ are hyperparameters, which can be set according to experience or flexibly adjusted according to the application scenario. For example, ⁇ t and ⁇ can be set to 0.25 and 2.0 respectively.
  • p represents the probability value that the detected target is correctly classified
  • obtaining the processing result loss may refer to taking the sum of the first detection loss and the second detection loss as the processing result loss, or may refer to taking the weighted sum of the first detection loss and the second detection loss as the processing result loss, etc.
  • the weights corresponding to the first detection loss and the second detection loss may be set based on experience or flexibly adjusted based on the application scenario.
  • the training loss based on which the parameters of the student image processing model are updated is obtained based on the feature difference loss and the processing result loss.
  • the feature difference loss can also be called the knowledge distillation loss
  • the processing result loss can also be called the downstream task loss
  • the training loss can also be called the total loss.
  • the process of obtaining the training loss can be regarded as a process of adjusting the hyperparameters of the distillation loss according to the downstream task to obtain the total loss of the student image processing model.
  • the sum of the feature difference loss and the processing result loss can be used as the training loss, or the weighted sum of the feature difference loss and the processing result loss can be used as the training loss, etc.
  • the weights corresponding to the feature difference loss and the processing result loss can be set based on experience or flexibly adjusted according to the application scenario.
  • the training loss can be calculated based on Formula 7:
  • L total represents the training loss
  • L reg represents the first detection loss
  • L cls represents the second detection loss
  • L reg +L cls represents the processing result loss
  • L distill represents the feature difference loss
  • the implementation method of obtaining training loss based on feature difference loss can also be: obtaining the reference processing result output by the teacher image processing model and the predicted processing result output by the student image processing model; based on the difference between the reference processing result and the predicted processing result, obtaining the result difference loss; based on the result difference loss and the feature difference loss, obtaining the training loss.
  • the computer vision task corresponding to the teacher image processing model is the same type as the computer vision task corresponding to the student image processing model.
  • the implementation method of obtaining training loss based on feature difference loss can also be: obtaining the reference processing result output by the teacher image processing model and the predicted processing result output by the student image processing model; based on the difference between the reference processing result and the predicted processing result, obtaining the result difference loss; based on the difference between the predicted processing result and the standard processing result corresponding to the sample image, obtaining the processing result loss; based on the feature difference loss, the result difference loss and the processing result loss, obtaining the training loss.
  • step 406 the parameters of the student image processing model are updated using the training loss to obtain a target image processing model.
  • the parameters of the student image processing model are updated using the training loss to complete a training of the student image processing model.
  • the process of updating the parameters of the student image processing model using the training loss can be: based on the training loss, calculating the update gradient of the parameters of the student image processing model, and updating the parameters of the student image processing model according to the update gradient.
  • the update gradient of the parameters of the student image processing model can be calculated using the gradient descent method.
  • the process of using training loss to update the parameters of the student image processing model to obtain the target image processing model includes: using the training loss to update the parameters of the student image processing model to obtain an updated student image processing model; judging whether the current training process satisfies the training termination condition; if the current training process satisfies the training termination condition, using the updated student image processing model as the target image processing model; if the current training process does not meet the training termination condition, training the updated student image processing model until the current training process meets the training termination condition, and using the image processing model obtained when the training termination condition is met as the target image processing model.
  • the current training process satisfies the training termination condition, which is set according to experience or flexibly adjusted according to the application scenario, and the embodiments of the present application do not limit this.
  • the current training process satisfies the training termination condition, including but not limited to the number of image processing model trainings executed in the current training process reaching the number threshold, the training loss obtained in the current training process for updating the parameters of the student image processing model is less than the loss threshold, and the training loss obtained in the current training process for updating the parameters of the student image processing model converges.
  • Both the number threshold and the loss threshold are set according to experience or flexibly adjusted according to the application scenario.
  • the training process of the image processing model is terminated, and the updated student image processing model obtained by training at this time is used as the target image processing model. If the current training process does not meet the training termination condition, it is necessary to continue training the updated student image processing model.
  • the process of training the updated student image processing model includes: updating the parameters of the first feature transformation model using the feature difference loss to obtain the updated first feature transformation model; and training the updated student image processing model based on the updated first feature transformation model. Since the feature difference loss is obtained based on the image features transformed by the first feature transformation model, and involves the processing process of the first feature transformation model, the feature difference loss is used to update the parameters of the first feature transformation model. For example, the update gradient of the parameters of the first feature transformation model is calculated based on the feature difference loss, and the parameters of the first feature transformation model are updated using the update gradient.
  • the student image processing model is changed to the updated student image processing model
  • the first feature transformation model is changed to the updated first feature transformation model.
  • the sample image based on which the process of training the updated student image processing model based on the updated first feature transformation model is based and the sample image based on which the process of obtaining the updated student image processing model is based may be the same or different.
  • the second feature transformation model is also utilized.
  • the feature difference loss is also utilized to update the parameters of the second feature transformation model to obtain the updated second feature transformation model.
  • the updated student image processing model is trained based on the updated first feature transformation model and the updated second feature transformation model.
  • the teacher image processing model may or may not change.
  • the teacher image processing model is a pre-trained model
  • the teacher image processing model does not change, that is, the sample image is still input into the original teacher image processing model for processing.
  • the teacher image processing model is trained in real time with the training process of the student image processing model
  • the teacher image processing model changes, that is, the sample image is input into the updated teacher image processing model for processing.
  • the teacher image processing model changes in the following manner: obtaining a reference processing result output by the teacher image processing model; obtaining a loss corresponding to the teacher image processing model based on the difference between the reference processing result and the standard processing result corresponding to the sample image; and updating the parameters of the teacher image processing model using the loss corresponding to the teacher image processing model to obtain an updated teacher image processing model.
  • the process of training the updated student image processing model based on the updated first feature transformation model may include: obtaining a sample image; inputting the sample image into the teacher image processing model to obtain the first image feature output by the teacher image processing model; inputting the sample image into the updated student image processing model to obtain the fifth image feature output by the updated student image processing model, aligning the fifth image feature with the first image feature to obtain an updated aligned image feature; transforming the updated aligned image feature using the updated first feature transformation model to obtain a sixth image feature; based on the difference between the first image feature and the sixth image feature, obtaining an updated feature difference loss; based on the updated feature difference loss, obtaining an updated training loss; using the updated training loss to update the parameters of the updated student image processing model to obtain a further updated student image processing model; if the current training process meets the training termination condition, using the further updated student image processing model as the target image processing model.
  • the process of training the updated student image processing model based on the updated first feature transformation model and the updated second feature transformation model may also include: obtaining a sample image; inputting the sample image into the teacher image processing model to obtain a first image feature output by the teacher image processing model; inputting the sample image into the updated student image processing model to obtain a fifth image feature output by the updated student image processing model, aligning the fifth image feature with the first image feature to obtain an updated aligned image feature; transforming the updated aligned image feature using the updated first feature transformation model to obtain a sixth image feature; keeping the number of channels of the updated aligned image feature unchanged, adjusting the size of the updated aligned image feature from the first size to the second size to obtain an updated adjusted image feature; and using the updated The second feature transformation model transforms the updated adjusted image features to obtain updated transformed image features, and restores the size of the updated transformed image features from the second size to the first size to obtain the seventh image feature;
  • the image is processed using the target image processing model.
  • the process is detailed in the embodiment shown in FIG5 and will not be described in detail here.
  • the training method of the image processing model realizes the learnable transformation of features based on a feature transformation model with universal parameters, improves the knowledge distillation effect and the training effect of the image processing model, does not need to design complex feature transformation models for different tasks, does not need to introduce complex hyperparameters, can avoid cumbersome parameter adjustment steps, improves the versatility of knowledge distillation on multiple tasks, and avoids the cumbersomeness of manually designed structures while improving the training effect of the image processing model. It can achieve performance improvement on a variety of computer vision tasks (such as image classification tasks, target detection tasks, semantic segmentation tasks, etc.), and achieves good task processing effects.
  • computer vision tasks such as image classification tasks, target detection tasks, semantic segmentation tasks, etc.
  • the training method of the image processing model provided in the embodiment of the present application first transforms the aligned image features using the first feature transformation model, and then compares the third image features obtained after the transformation with the first image features output by the teacher image processing model to obtain the training loss.
  • the student image processing model is trained using this training loss, so that the third image features can be made as close to the first image features as possible.
  • the third image features and the aligned image features have been transformed by the first feature transformation model, even if the third image features are very close to the first image features, it is possible to ensure that there is a certain gap between the aligned image features and the first image features, thereby avoiding the problem of the second image features output by the student image processing model overfitting the first image features output by the teacher image processing model, and helping the student image processing model to have more learning space to focus on the characteristics of its own model while learning the image features output by the teacher image processing model, thereby improving the training effect of the student image processing model.
  • the parameters of the first feature transformation model are parameters learned based on the training process of the image processing model, thereby ensuring the matching degree between the feature transformation process and the model training process, thereby ensuring the reliability of the feature transformation, improving the reliability of the training loss, and further improving the training effect of the image processing model.
  • the present application embodiment provides an image processing training method, which is executed by a computer device, which may be a terminal 11 or a server 12, and the present application embodiment does not limit this.
  • the image processing method provided in the present application embodiment may include the following steps 501 and 502.
  • step 501 a target image to be processed is obtained.
  • the target image refers to an image that needs to be processed using the target image processing model.
  • the method for acquiring the target image in the embodiment of the present application is not limited, for example, crawling the target image from the network; extracting the target image from a database; acquiring the target image using an image acquisition device; intercepting the target image from a video stream; receiving the target image sent or uploaded by other devices, etc.
  • the number of channels and the size of the target image are respectively the same as the number of channels and the size of the sample image to ensure the processing effect of the target image processing model.
  • step 502 the target image is input into the target image processing model, and the target processing result output by the target image processing model is obtained.
  • the target image processing model is trained using any one of the image processing model training methods in the embodiments shown in FIG. 4 .
  • step 502 is the same as the implementation principle of inputting the sample image into the student image processing model and obtaining the predicted processing result output by the student image processing model in the embodiment shown in FIG4 , and will not be described in detail here.
  • the target image processing model includes a task processing layer
  • the target processing result is the processing result output by the task processing layer of the target image processing model.
  • the target processing result output by the target image processing model can be regarded as a processing result corresponding to the target image with high reliability.
  • the type of the target processing result is related to the type of computer vision task. If the type of computer vision task is an image classification task, the target processing result is a classification result; if the type of computer vision task is a semantic segmentation task, the target processing result is a segmentation result; if the type of computer vision task is a target detection task, the target processing result includes a detection position result and a detection category result.
  • the image processing method provided in the embodiment of the present application uses a target image processing model with good training effect to process the target image, which is conducive to ensuring the accuracy of image processing of the target image.
  • an embodiment of the present application provides a training device for an image processing model, the device comprising:
  • a first acquisition unit 601 is used to acquire a sample image
  • a second acquisition unit 602 is used to input the sample image into the teacher image processing model to obtain the first image feature output by the teacher image processing model;
  • the third acquisition unit 603 is used to input the sample image into the student image processing model, obtain the second image feature output by the student image processing model, align the second image feature with the first image feature, and obtain the aligned image feature;
  • a transformation unit 604 is used to transform the aligned image features using a first feature transformation model to obtain a third image feature, wherein the parameters of the first feature transformation model are learned based on a training process of an image processing model;
  • a fourth acquisition unit 605 is used to acquire a feature difference loss based on a difference between the first image feature and the third image feature; and acquire a training loss based on the feature difference loss;
  • the updating unit 606 is used to update the parameters of the student image processing model using the training loss to obtain the target image processing model.
  • the fourth acquisition unit 605 is used to keep the number of channels of the aligned image feature unchanged, adjust the size of the aligned image feature from the first size to the second size, and obtain the adjusted image feature; transform the adjusted image feature using the second feature transformation model to obtain the transformed image feature, restore the size of the transformed image feature from the second size to the first size to obtain the fourth image feature, and the parameters of the second feature transformation model are learned based on the training process of the image processing model; based on the difference between the first image feature and the third image feature, obtain the first difference loss; based on the difference between the first image feature and the fourth image feature, obtain the second difference loss; based on the first difference loss and the second difference loss, obtain the feature difference loss.
  • the third acquisition unit 603 is used to acquire the second image feature output by the student image processing model and the prediction processing result
  • the fourth acquisition unit 605 is used to acquire the processing result loss based on the difference between the predicted processing result and the standard processing result corresponding to the sample image; and acquire the training loss based on the feature difference loss and the processing result loss.
  • the student image processing model is used to process the image to match the computer vision task
  • the third acquisition unit 603 is used to obtain the second image feature output by the student image processing model and the prediction processing result matching the computer vision task
  • the fourth acquisition unit 605 is used to acquire the processing result loss based on the difference between the predicted processing result matching the computer vision task and the standard processing result matching the computer task corresponding to the sample image.
  • the computer vision task includes an image classification task
  • the predicted processing result matching the computer vision task includes a predicted classification result
  • the standard processing result matching the computer vision task includes a standard classification result
  • the processing result loss is obtained based on a difference between the predicted classification result and the standard classification result
  • the computer vision task includes a semantic segmentation task, the prediction processing result matching the computer vision task includes a prediction segmentation result, the standard processing result matching the computer vision task includes a standard segmentation result, and the processing result loss is obtained based on the prediction segmentation result and the standard segmentation result; or,
  • the computer vision task includes a target detection task, the prediction processing results matching the computer vision task include a detection position prediction result and a detection category prediction result, the standard processing results matching the computer vision task include a detection position standard result and a detection category standard result, and the processing result loss is obtained based on the difference between the detection position prediction result and the detection position standard result, and the difference between the detection category prediction result and the detection category standard result.
  • the updating unit 606 is used to update the parameters of the student image processing model using the training loss to obtain an updated student image processing model; if the current training process does not meet the training termination condition, the parameters of the first feature transformation model are updated using the feature difference loss to obtain an updated first feature transformation model; the updated student image processing model is trained based on the updated first feature transformation model to obtain a target image processing model.
  • the third acquisition unit 603 is used to align the size of the second image feature with the size of the first image feature through linear interpolation to obtain an intermediate image feature; and align the number of channels of the intermediate image feature with the number of channels of the first image feature through channel transformation convolution to obtain an aligned image feature.
  • an embodiment of the present application provides an image processing device, the device comprising:
  • a first acquisition unit 701 is used to acquire a target image to be processed
  • the second acquisition unit 702 is used to input the target image into the target image processing model to obtain the target processing result output by the target image processing model; wherein the target image processing model is trained using any of the above-mentioned image processing model training methods.
  • the device provided in the above embodiment only uses the division of the above functional units as an example to implement its functions.
  • the above functions can be assigned to different functional units as needed, that is, the internal structure of the device is divided into different functional units to complete all or part of the functions described above.
  • the device provided in the above embodiment belongs to the same concept as the method embodiment, and its specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • the effect achieved by the device provided in the above embodiment is the same as the effect achieved by the method embodiment, which will not be repeated here.
  • a computer device is also provided, see Fig. 8, the computer device includes a processor 801 and a memory 802, and at least one computer program is stored in the memory 802. The at least one computer program is loaded and executed by one or more processors 801, so that the computer device implements any of the above-mentioned image processing model training methods or image processing methods.
  • a computer-readable storage medium in which at least one computer program is stored.
  • the at least one computer program is loaded and executed by a processor of a computer device so that the computer implements any of the above-mentioned image processing model training methods or image processing methods.
  • the above-mentioned computer readable storage medium can be a read-only memory (ROM), a random access memory (RAM), a compact disc (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, etc.
  • a computer program product which includes a computer program or computer instructions, which are loaded and executed by a processor to enable a computer to implement any of the above-mentioned image processing model training methods or image processing methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

图像处理模型的训练、图像处理方法、装置、设备及介质,属于计算机视觉技术领域。该图像处理模型的训练方法包括:获取样本图像(401);将样本图像输入教师图像处理模型,获取第一图像特征(402);将样本图像输入学生图像处理模型,获取第二图像特征,将第二图像特征与第一图像特征对齐,得到对齐图像特征(403);利用第一特征变换模型对对齐图像特征进行变换,得到第三图像特征(404);基于第一图像特征和第三图像特征之间的差异,获取特征差异损失;基于特征差异损失,获取训练损失(405);利用训练损失更新学生图像处理模型的参数,得到目标图像处理模型(406)。

Description

图像处理模型的训练、图像处理方法、装置、设备及介质
本申请要求于2022年09月28日提交的申请号为202211196707.5、发明名称为“基于可学习特征变换的神经网络知识蒸馏方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于计算机视觉技术领域,涉及计算机视觉、神经网络模型压缩、基于中间特征的神经网络知识蒸馏等深度学习技术,特别涉及一种图像处理模型的训练、图像处理方法、装置、设备及介质。
背景技术
近年来,随着深度学习技术的不断发展,深度卷积神经网络被广泛应用于诸如图像分类、目标检测、语义分割等计算机视觉任务上,并在这些任务上取得了越来越好的表现。而在取得更好表现的背后,深度卷积神经网络模型的复杂度也越来越高,对计算资源和存储资源的需求日渐增大,使其难以在资源受限的设备,如移动设备和嵌入式平台上进行部署。为解决这一问题,需要使用到神经网络模型压缩技术。
知识蒸馏是目前神经网络模型压缩技术中一种重要的方法,该方法将大规模神经网络作为教师网络,将小规模神经网络作为学生网络,将教师网络的知识传递到学生网络中,进而获得一个复杂度低、性能好、易于部署的神经网络,达到模型压缩的目的。
目前,主流的知识蒸馏方法分为基于输出响应和基于中间特征的知识蒸馏,基于输出响应的知识蒸馏方法将教师模型尾层的预测结果作为监督信息,指导学生模型对教师模型的行为进行模仿。基于中间特征的知识蒸馏方法则将教师模型中间隐藏层的特征作为监督信号指导学生模型训练。
发明内容
本申请提出了一种图像处理模型的训练、图像处理方法、装置、设备及介质,可用于提高图像处理模型的训练效果。
本申请提供的技术方案是:
一种基于可学习特征变换的知识蒸馏方法,如图1所示,其步骤包括:
1)将输入数据输入教师模型,所述教师模型的中间层输出第一特征图,将所述输入数据输入学生模型,所述学生模型的中间层输出第二特征图;
2)将第二特征图与第一特征图进行空间维度和通道维度上的对齐,对齐后的特征图通过一个多层感知机模块得到第三特征图;同时,对对齐后的特征图的形状展开和转置,再通过另一个多层感知机模块得到变换后的特征图,再将变换后的特征图形状恢复成变换前的形状,得到第四特征图;
3)计算第一特征图和第三特征图间的均方差损失作为空间特征损失,计算第一特征图和第四特征图间的均方差损失作为通道特征损失,将所述空间特征损失和所述通道特征损失加权求和作为教师模型与学生模型间的知识蒸馏损失函数;
4)根据所述知识蒸馏损失函数,对学生模型进行训练实现知识蒸馏。
可选地,所述多层感知机模块为隐藏层数为1,激活函数为ReLU的多层感知机结构。
可选地,通过双线性插值和1×1卷积将所述第二特征图与所述第一特征图进行空间维度和通道维度上的对齐。
进一步,取得所述学生模型的下游任务,根据下游任务类型匹配模型的目标函数,将目 标函数和知识蒸馏损失函数组合对学生模型进行训练。
进一步,根据所述教师模型、所述学生模型、所述下游任务调整所述蒸馏损失函数的超参数,将所述目标函数中的回归损失函数、分类损失函数和知识蒸馏损失函数求和获得所述学生模型训练的总损失函数,根据该总损失函数对所述学生模型进行训练。
本申请提供一种基于可学习特征变换的知识蒸馏方法,对齐教师模型和学生模型的特征,提高蒸馏效果,同时无需针对不同任务设计复杂的特征变换模块,不引入复杂的超参数,免去了繁琐的参数调整步骤,提高了知识蒸馏在多个任务上的通用性,在多种计算机视觉任务上均能取得不错的效果。
本申请实施例提供了一种图像处理模型的训练方法,所述方法包括:
获取样本图像;
将所述样本图像输入教师图像处理模型,获取所述教师图像处理模型输出的第一图像特征;
将所述样本图像输入学生图像处理模型,获取所述学生图像处理模型输出的第二图像特征,将所述第二图像特征与所述第一图像特征对齐,得到对齐图像特征;
利用第一特征变换模型对所述对齐图像特征进行变换,得到第三图像特征,所述第一特征变换模型的参数基于图像处理模型的训练过程学习得到;
基于所述第一图像特征和所述第三图像特征之间的差异,获取特征差异损失;
基于所述特征差异损失,获取训练损失;
利用所述训练损失更新所述学生图像处理模型的参数,得到目标图像处理模型。
在一种可能实现方式中,所述基于所述第一图像特征和所述第三图像特征之间的差异,获取特征差异损失,包括:
保持所述对齐图像特征的通道数不变,将所述对齐图像特征的尺寸由第一尺寸调整为第二尺寸,得到调整图像特征;
利用第二特征变换模型对所述调整图像特征进行变换,得到变换图像特征,将所述变换图像特征的尺寸由所述第二尺寸恢复为所述第一尺寸,得到第四图像特征,所述第二特征变换模型的参数基于所述图像处理模型的训练过程学习得到;
基于所述第一图像特征和所述第三图像特征之间的差异,获取第一差异损失;
基于所述第一图像特征和所述第四图像特征之间的差异,获取第二差异损失;
基于所述第一差异损失和所述第二差异损失,获取所述特征差异损失。
在一种可能实现方式中,所述获取所述学生图像处理模型输出的第二图像特征,包括:
获取所述学生图像处理模型输出的第二图像特征以及预测处理结果;
所述基于所述特征差异损失,获取训练损失,包括:
基于所述预测处理结果和所述样本图像对应的标准处理结果之间的差异,获取处理结果损失;
基于所述特征差异损失和所述处理结果损失,获取所述训练损失。
在一种可能实现方式中,所述学生图像处理模型用于对图像进行与计算机视觉任务匹配的处理,所述获取所述学生图像处理模型输出的第二图像特征以及预测处理结果,包括:
获取所述学生图像处理模型输出的第二图像特征以及与所述计算机视觉任务匹配的预测处理结果;
所述基于所述预测处理结果和所述样本图像对应的标准处理结果之间的差异,获取处理结果损失,包括:
基于所述与所述计算机视觉任务匹配的预测处理结果和所述样本图像对应的与所述计算机任务匹配的标准处理结果之间的差异,获取所述处理结果损失。
在一种可能实现方式中,所述计算机视觉任务包括图像分类任务,所述与所述计算机视觉任务匹配的预测处理结果包括预测分类结果,所述与所述计算机视觉任务匹配的标准处理结果包括标准分类结果,所述处理结果损失基于所述预测分类结果和所述标准分类结果之间 的差异获取;或者,
所述计算机视觉任务包括语义分割任务,所述与所述计算机视觉任务匹配的预测处理结果包括预测分割结果,所述与所述计算机视觉任务匹配的标准处理结果包括标准分割结果,所述处理结果损失基于所述预测分割结果和所述标准分割结果获取;或者,
所述计算机视觉任务包括目标检测任务,所述与所述计算机视觉任务匹配的预测处理结果包括检测位置预测结果和检测类别预测结果,所述与所述计算机视觉任务匹配的标准处理结果包括检测位置标准结果和检测类别标准结果,所述处理结果损失基于所述检测位置预测结果和所述检测位置标准结果之间的差异,以及所述检测类别预测结果和所述检测类别标准结果之间的差异获取。
在一种可能实现方式中,所述利用所述训练损失更新所述学生图像处理模型的参数,得到目标图像处理模型,包括:
利用所述训练损失更新所述学生图像处理模型的参数,得到更新后的学生图像处理模型;
若当前训练过程不满足训练终止条件,利用所述特征差异损失更新所述第一特征变换模型的参数,得到更新后的第一特征变换模型;
基于所述更新后的第一特征变换模型对所述更新后的学生图像处理模型进行训练,得到所述目标图像处理模型。
在一种可能实现方式中,所述将所述第二图像特征与所述第一图像特征对齐,得到对齐图像特征,包括:
通过线性插值将所述第二图像特征的尺寸与所述第一图像特征的尺寸对齐,得到中间图像特征;
通过通道变换卷积将所述中间图像特征的通道数与所述第一图像特征的通道数对齐,得到所述对齐图像特征。
本申请实施例还提供了一种图像处理方法,所述方法包括:
获取待处理的目标图像;
将所述目标图像输入目标图像处理模型,获取所述目标图像处理模型输出的目标处理结果;其中,所述目标图像处理模型利用上述任一所述的图像处理模型的训练方法训练得到。
本申请实施例还提供了一种图像处理模型的训练装置,所述装置包括:
第一获取单元,用于获取样本图像;
第二获取单元,用于将所述样本图像输入教师图像处理模型,获取所述教师图像处理模型输出的第一图像特征;
第三获取单元,用于将所述样本图像输入学生图像处理模型,获取所述学生图像处理模型输出的第二图像特征,将所述第二图像特征与所述第一图像特征对齐,得到对齐图像特征;
变换单元,用于利用第一特征变换模型对所述对齐图像特征进行变换,得到第三图像特征,所述第一特征变换模型的参数基于图像处理模型的训练过程学习得到;
第四获取单元,用于基于所述第一图像特征和所述第三图像特征之间的差异,获取特征差异损失;基于所述特征差异损失,获取训练损失;
更新单元,用于利用所述训练损失更新所述学生图像处理模型的参数,得到目标图像处理模型。
在一种可能实现方式中,所述第四获取单元,用于保持所述对齐图像特征的通道数不变,将所述对齐图像特征的尺寸由第一尺寸调整为第二尺寸,得到调整图像特征;利用第二特征变换模型对所述调整图像特征进行变换,得到变换图像特征,将所述变换图像特征的尺寸由所述第二尺寸恢复为所述第一尺寸,得到第四图像特征,所述第二特征变换模型的参数基于所述图像处理模型的训练过程学习得到;基于所述第一图像特征和所述第三图像特征之间的差异,获取第一差异损失;基于所述第一图像特征和所述第四图像特征之间的差异,获取第二差异损失;基于所述第一差异损失和所述第二差异损失,获取所述特征差异损失。
在一种可能实现方式中,所述第三获取单元,用于获取所述学生图像处理模型输出的第 二图像特征以及预测处理结果;
所述第四获取单元,用于基于所述预测处理结果和所述样本图像对应的标准处理结果之间的差异,获取处理结果损失;基于所述特征差异损失和所述处理结果损失,获取所述训练损失。
在一种可能实现方式中,所述学生图像处理模型用于对图像进行与计算机视觉任务匹配的处理,所述第三获取单元,用于获取所述学生图像处理模型输出的第二图像特征以及与所述计算机视觉任务匹配的预测处理结果;
所述第四获取单元,用于基于所述与所述计算机视觉任务匹配的预测处理结果和所述样本图像对应的与所述计算机任务匹配的标准处理结果之间的差异,获取所述处理结果损失。
在一种可能实现方式中,所述计算机视觉任务包括图像分类任务,所述与所述计算机视觉任务匹配的预测处理结果包括预测分类结果,所述与所述计算机视觉任务匹配的标准处理结果包括标准分类结果,所述处理结果损失基于所述预测分类结果和所述标准分类结果之间的差异获取;或者,
所述计算机视觉任务包括语义分割任务,所述与所述计算机视觉任务匹配的预测处理结果包括预测分割结果,所述与所述计算机视觉任务匹配的标准处理结果包括标准分割结果,所述处理结果损失基于所述预测分割结果和所述标准分割结果获取;或者,
所述计算机视觉任务包括目标检测任务,所述与所述计算机视觉任务匹配的预测处理结果包括检测位置预测结果和检测类别预测结果,所述与所述计算机视觉任务匹配的标准处理结果包括检测位置标准结果和检测类别标准结果,所述处理结果损失基于所述检测位置预测结果和所述检测位置标准结果之间的差异,以及所述检测类别预测结果和所述检测类别标准结果之间的差异获取。
在一种可能实现方式中,所述更新单元,用于利用所述训练损失更新所述学生图像处理模型的参数,得到更新后的学生图像处理模型;若当前训练过程不满足训练终止条件,利用所述特征差异损失更新所述第一特征变换模型的参数,得到更新后的第一特征变换模型;基于所述更新后的第一特征变换模型对所述更新后的学生图像处理模型进行训练,得到所述目标图像处理模型。
在一种可能实现方式中,所述第三获取单元,用于通过线性插值将所述第二图像特征的尺寸与所述第一图像特征的尺寸对齐,得到中间图像特征;通过通道变换卷积将所述中间图像特征的通道数与所述第一图像特征的通道数对齐,得到所述对齐图像特征。
本申请实施例还提供了一种图像处理装置,所述装置包括:
第一获取单元,用于获取待处理的目标图像;
第二获取单元,用于将所述目标图像输入目标图像处理模型,获取所述目标图像处理模型输出的目标处理结果;其中,所述目标图像处理模型利用上述任一所述的图像处理模型的训练方法训练得到。
本申请实施例还提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条计算机程序,所述至少一条计算机程序由所述处理器加载并执行,以使所述计算机设备实现上述任一所述的图像处理模型的训练方法或者图像处理方法。
另一方面,还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行,以使计算机实现上述任一所述的图像处理模型的训练方法或者图像处理方法。
另一方面,还提供了一种计算机程序产品,所述计算机程序产品包括计算机程序或计算机指令,所述计算机程序或所述计算机指令由处理器加载并执行,以使计算机实现上述任一所述的图像处理模型的训练方法或者图像处理方法。
本申请实施例提供的技术方案,先利用第一特征变换模型对对齐图像特征进行变换,然后再将变换后得到的第三图像特征与教师图像处理模型输出的第一图像特征进行比对来获取训练损失,利用此种训练损失对学生图像处理模型进行训练,能够使第三图像特征尽可能接 近第一图像特征,但由于第三图像特征与对齐图像特征之间经过了第一特征变换模型的变换,所以即使第三图像特征与第一图像特征非常接近,也能够保证对齐图像特征与第一图像特征之间有一定的差距,从而能够避免学生图像处理模型输出的第二图像特征过度拟合教师图像处理模型输出的第一图像特征的问题,帮助学生图像处理模型在学习教师图像处理模型输出的图像特征的同时,有更多的学习空间来关注自身模型的特点,进而提高学生图像处理模型的训练效果。
此外,第一特征变换模型的参数是基于图像处理模型的训练过程学习得到的参数,从而能够保证特征变换过程与模型训练过程的匹配度,进而保证特征变换的可靠性,提高训练损失的可靠性,进一步提高图像处理模型的训练效果。
附图说明
图1为本申请基于可学习特征变换的知识蒸馏方法的流程示意图;
图2为本申请实施例学生模型的训练过程架构示意图;
图3为本申请实施例提供的一种实施环境的示意图;
图4为本申请实施例提供的一种图像处理模型的训练方法的流程图;
图5为本申请实施例提供的一种图像处理方法的流程图;
图6为本申请实施例提供的一种图像处理模型的训练装置的示意图;
图7为本申请实施例提供的一种图像处理装置的示意图;
图8为本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面结合附图,通过实例进一步描述本申请,但不以任何方式限制本申请的范围。
以大规模目标检测数据集COCO为例,以在该数据上预训练好的RetinNet-rx101作为教师模型,并选取RetinaNet-R50作为学生模型来说明如何通过可学习变换模块进行目标检测任务上的知识蒸馏,如图1所示。
步骤S1:将输入数据输入教师模型得到所述教师模型的中间层输出的第一特征图,将所述输入数据输入学生模型得到所述学生模型的中间层输出的第二特征图,具体包括:
S11:将任意一批原始的训练图片输入进教师模型RetinNet-rx101中,在所述教师模型的FPN部分得到中间层输出的第一特征图。
S12:将所述训练图片输入进学生模型RetinaNet-R50中,在所述学生模型的FPN部分得到中间层输出的第二特征图。
步骤S2:利用多层感知机模块得到第三特征图和第四特征图,具体包括:
S21:通过双线性插值和1×1卷积将所述第二特征图与所述第一特征图进行空间维度和通道维度上的对齐,得到对齐后的特征图。
S22:将所述对齐后的特征图通过一个隐藏层数为1,激活函数为ReLU的多层感知机模块得到第三特征图。
S23:设所述对齐后的特征图形状为[N,C,H,W],将该特征图的形状通过展开和转置操作调整为[N,(H*W),C],将调整后的特征图通过一个隐藏层数为1,激活函数为ReLU的多层感知机模块得到变换后的特征图,再将变换后的特征图形状调整为[N,C,H,W]得到所述第四特征图。
步骤S3:根据所述第一特征图、第三特征图和第四特征图,计算所述教师模型和所述学生模型间的空间特征损失和通道特征损失,将所述空间特征损失和所述通道特征损失加权求和作为所述教师模型与所述学生模型间的知识蒸馏损失函数,具体包括:
S31:计算所述第一特征图和所述第三特征图间的均方差损失作为所述空间特征损失,其表达式为:
Figure PCTCN2022143756-appb-000001
其中,feat T为所述第一特征图,
Figure PCTCN2022143756-appb-000002
为所述第三特征图。
S32:计算所述第一特征图和所述第四特征图间的均方差损失作为所述通道特征损失,其表达式为:
Figure PCTCN2022143756-appb-000003
其中,feat T为所述第一特征图,
Figure PCTCN2022143756-appb-000004
为所述第四特征图。
S33:将所述空间特征损失和所述通道特征损失加权求和得到所述知识蒸馏损失函数,其表达式为:
L distill=αLoss Spatial+βLoss Channel
其中,α,β为超参数,在本实施例中分别设定为2e-5和1e-6。
步骤S4:根据所述知识蒸馏损失函数,对学生模型进行训练实现知识蒸馏。
示例性地,学生模型的训练过程架构可以如图2所示。将输入图像输入教师模型得到教师模型的中间层输出的第一特征图(也即教师特征),将输入图像输入学生模型得到学生模型的中间层输出的第二特征图(也即学生特征)。将学生特征与教师特征进行对齐,将对齐后的特征图通过一个多层感知机得到第三特征图;将对齐后的特征图的形状通过展开和转置操作进行调整,将调整后的特征图通过另外一个多层感知机得到变换后的特征图,再将变换后的特征图的形状恢复为原来的形状,得到第四特征图。基于第三特征图和教师特征,获取空间特征蒸馏损失;基于第四特征图和教师特征,获取通道特征蒸馏损失;对通道特征蒸馏损失和空间特征蒸馏损失进行加权求和,得到蒸馏损失;基于蒸馏损失将教师模型的知识传递给学生模型,以实现对学生模型的训练。
进一步,取得所述学生模型的下游任务,在本实施例中,下游任务为目标检测任务。
步骤S5:根据所述下游任务类型匹配模型目标函数,在本实施例中,模型的目标函数分为回归损失函数和分类损失函数,所述回归损失函数表达式为:
Figure PCTCN2022143756-appb-000005
其中t i为预测的每一个anchor与Ground Truth(GT)的偏差,而
Figure PCTCN2022143756-appb-000006
为每一个anchor与GT的真实偏差。
在本实施例中,所述分类损失函数采用Focal Loss,其表达式为:
L cls=-α t(1-p t) γlog(p t)
其中p t为样本被正确分类的概率值,α t,γ为超参数,在本实施例中分别设定为0.25,2.0。
步骤S6:根据教师模型、学生模型、下游任务调整所述蒸馏损失函数的超参数,目标函数、知识蒸馏损失函数和超参数获得所述学生模型训练的总损失函数;根据所述总损失函数对所述学生模型进行训练,其中所述总损失函数的表达式为:
L total=L reg+L cls+L distill
对于图像分类任务,在ImageNet数据集上的结果表明,使用ResNet34作为教师模型,ResNet18作为学生模型,采用本申请所提出的蒸馏方法进行知识蒸馏,可以将测试集上的Top-1准确率从69.9%提升到了71.4%;对于目标检测任务,在MSCOCO数据集上的结果表明,使用RetinaNet-RX101作为教师模型,RetinaNet-R50作为学生模型,采用本申请所提的知识蒸馏方法,可以将学生模型的mAP从37.4%提升到41.0%;对于语义分割任务,在CityScapes数据集上的结果表明,使用PSPNet-ResNet34作为教师模型,PSPNet-ResNet18作为学生模型,采用本申请所提的知识蒸馏方法,可以将学生模型的mIoU从69.9%提升到74.2%(注:ImageNet是一个大规模图像分类数据集,Top1-accuracy用于衡量图像分类准确率;MSCOCO是一个大规模数据集,包含目标检测等任务,bbox的mAP是衡量目标检测性能的一个指标;CityScapes是一个语义分割数据集,mIoU是衡量语义分割性能的一个指标。)此外,本申请也可用于实现跨模型的知识蒸馏,并能取得不错的效果。例如,对于图像分类任务,在Cifar-100数据集上,使用基于卷积神经网络架构的ResNet56作为教师模型,基于Transformer架 构的ViT-tiny作为学生模型,可以将学生模型的Top1-accuracy由57.8%提升至77.5%(注:Cifar100是一个小规模图像分类数据集)。
图3示出了本申请实施例提供的实施环境的示意图。该实施环境包括:终端11和服务器12。
本申请实施例提供的图像处理模型的训练方法可以由终端11执行,也可以由服务器12执行,还可以由终端11和服务器12共同执行,本申请实施例对此不加以限定。对于本申请实施例提供的图像处理模型的训练方法由终端11和服务器12共同执行的情况,服务器12承担主要计算工作,终端11承担次要计算工作;或者,服务器12承担次要计算工作,终端11承担主要计算工作;或者,服务器12和终端11二者之间采用分布式计算架构进行协同计算。
本申请实施例提供的图像处理方法可以由终端11执行,也可以由服务器12执行,还可以由终端11和服务器12共同执行,本申请实施例对此不加以限定。对于本申请实施例提供的图像处理方法由终端11和服务器12共同执行的情况,服务器12承担主要计算工作,终端11承担次要计算工作;或者,服务器12承担次要计算工作,终端11承担主要计算工作;或者,服务器12和终端11二者之间采用分布式计算架构进行协同计算。
图像处理模型的训练方法的执行设备与图像处理方法的执行设备可以相同,也可以不同,本申请实施例对此不加以限定。
在一种可能实现方式中,终端11可以是任何一种可与用户通过键盘、触摸板、触摸屏、遥控器、语音交互或手写设备等一种或多种方式进行人机交互的电子产品,例如PC(Personal Computer,个人计算机)、手机、智能手机、PDA(Personal Digital Assistant,个人数字助手)、可穿戴设备、PPC(Pocket PC,掌上电脑)、平板电脑、智能车机、智能电视、智能音箱、智能语音交互设备、智能家电、车载终端等。服务器12可以是一台服务器,也可以是由多台服务器组成的服务器集群,或者是一个云计算服务中心。终端11与服务器12通过有线或无线网络建立通信连接。
本领域技术人员应能理解上述终端11和服务器12仅为举例,其他现有的或今后可能出现的终端或服务器如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
基于上述图3所示的实施环境,本申请实施例提供一种图像处理模型的训练方法,该图像处理模型的训练方法由计算机设备执行,该计算机设备可以为终端11,也可以为服务器12,本申请实施例对此不加以限定。如图4所示,本申请实施例提供的图像处理模型的训练方法可以包括如下步骤401至步骤406。
在步骤401中,获取样本图像。
样本图像是对学生图像处理模型的参数更新一次所依据的图像,样本图像的数量为一个或多个。示例性地,样本图像的数量通常为多个,以保证学生图像处理模型的训练效果。样本图像的通道数和尺寸可以根据经验设置,也可以根据应用场景灵活调整,本申请实施例对此不加以限定。需要说明的是,本申请实施例中的样本图像等同于上述实施例中的输入数据。
示例性地,样本图像可以从样本图像库中提取,也可以从网络中爬取得到,还可以由其他设备发送给计算机设备等。
示例性地,样本图像可以是指开源的图像数据集中的图像,该开源的图像数据集可以是指与计算机视觉任务匹配的图像数据集。例如,若计算机视觉任务为目标检测任务,则图像数据集可以是指COCO(Common Objects in Context,上下文中的公共对象)数据集;若计算机视觉任务为图像分类任务,则图像数据集可以是指ImageNet数据集(一种图像分类数据集);若计算机视觉任务为语义分割任务,则图像数据集可以是指CityScapes数据集(一种语义分割数据集)。
在步骤402中,将样本图像输入教师图像处理模型,获取教师图像处理模型输出的第一 图像特征。
教师图像处理模型是用于为学生图像处理模型的训练过程提供监督信息的模型,也即指导学生图像处理模型的训练过程的模型。需要说明的是,本申请实施例中的“学生图像处理模型”和“教师图像处理模型”是基于其各自的功能进行命名的,其中,“学生图像处理模型”能够从其他模型中学习图像处理知识,“教师图像处理模型”能够将学习到的图像处理知识迁移给其他模型。在一些实施例中,“学生图像处理模型”和“教师图像处理模型”还可以通过其他方式进行命名,本申请实施例对此不加以限定。本申请实施例中的教师图像处理模型等同于上述实施例中的教师模型,本申请实施例中的第一图像特征等同于上述实施例中的第一特征图。
教师图像处理模型和学生图像处理模型构成一个知识蒸馏架构,教师图像处理模型用于将学习到的知识蒸馏到学生图像处理模型中,以实现对学生图像处理模型的训练。示例性地,该在知识蒸馏架构中,将大规模神经网络作为教师图像处理模型,将小规模神经网络作为学生图像处理模型,将教师图像处理模型的知识传递到学生图像处理模型中,进而获得一个复杂度低、性能好、易于部署的学生图像处理模型,达到模型压缩的目的。
示例性地,教师图像处理模型包括特征提取层,教师图像处理模型的特征提取层用于对输入教师图像处理模型的图像进行特征提取,特征提取层的数量可以为一个,也可以为多个,每个特征提取层均能够输出一个图像特征。示例性地,对于教师图像处理模型中的特征提取层的数量为多个的情况,第一个特征提取层用于对输入教师图像处理模型的图像进行特征提取,从第二个特征提取层开始,下一个特征提取层用于对前一个特征提取层输出的图像特征,或者用于对前一个特征提取层输出的图像特征与其他特征(如,输入的图像或者前边的特征提取层输出的图像特征)的融合特征进行特征提取。示例性地,多个特征提取层可以构成FPN(Feature Pyramid Networks,特征金字塔)形式。
示例性地,教师图像处理模型除包括特征提取层外,还可以包括任务处理层。教师图像处理模型中的任务处理层用于对教师图像处理模型的最后一个特征提取层提取的图像特征,或者最后一个特征提取层提取的图像特征与其他特征(如,输入的图像或者前边的特征提取层输出的图像特征)的融合特征进行处理,以输出预测处理结果。
示例性地,教师图像处理模型用于对图像进行与计算机视觉任务匹配的处理。教师图像处理模型的模型结构可以根据经验设置,也可以根据计算机视觉任务的类型灵活调整,本申请实施例对此不加以限定。示例性地,对于计算机视觉任务为目标检测任务的情况,教师图像处理模型的模型结构可以是指RetinNet-RX101模型(一种用于图像处理模型);对于计算机视觉任务为图像分类任务的情况,教师图像处理模型的模型结构可以是指ResNet34模型(一种用于图像处理模型);对于计算机视觉任务为语义分割任务的情况,教师图像处理模型的模型结构可以是指PSPNet-ResNet34模型(一种图像处理模型)。当然,教师图像处理模型的模型结构还可以为其他结构,如,从上述模型中选取部分层构成的结构等,本申请实施例在此不再一一赘述。
本申请实施例提供的图像处理模型的训练方法是一种基于中间特征的知识蒸馏方法,也就是说,用于对学生图像处理模型的训练提供指导信息的包括教师图像处理模型输出的图像特征。本申请实施例中,在将样本图像输入教师图像处理模型后,能够获取教师图像处理模型输出的第一图像特征,进而利用该第一图像特征为学生图像处理模型的训练提供指导信息。示例性地,教师图像处理模型输出的第一图像特征是指教师图像处理模型的特征提取层输出的第一图像特征。
示例性地,教师图像处理模型的特征提取层的数量可能为一个,也可能为多个。对于教师图像处理模型的特征提取层的数量为一个的情况,直接将该一个特征提取层提取的特征作为第一图像特征,此种情况下,第一图像特征的数量为一个;对于教师图像处理模型的特征提取层的数量为多个的情况,可以从多个特征提取层提取的多个图像特征中选取参考数量个图像特征作为第一图像特征。参考数量不大于特征提取层的总数量,参考数量可以根据经验设置,或者根据应用场景灵活调整。
示例性地,对于第一图像特征的数量为多个的情况,不同第一图像特征的尺寸可以相同,也可以不同;不同图像特征的通道数可以相同,也可以不同。
在步骤403中,将样本图像输入学生图像处理模型,获取学生图像处理模型输出的第二图像特征,将第二图像特征与第一图像特征对齐,得到对齐图像特征。
学生图像处理模型是指待训练的图像处理模型,在将样本图像输入学生图像处理模型后,能够获取学生图像处理模型输出的第二图像特征。示例性地,学生图像处理模型同样包括特征提取层,在将样本图像输入学生图像处理模型后,能够获取学生图像处理模型的特征提取层输出的第二图像特征。需要说明的是,本申请实施例中的学生图像处理模型等同于上述实施例中的学生模型,本申请实施例中的第二图像特征等同于上述实施例中的第二特征图。
示例性地,学生图像处理模型包括的特征提取层的数量可能与教师图像处理模型包括的特征提取层的数量相同,也可能与教师图像处理模型包括的特征提取层的数量不同,本申请实施例对此不加以限定。但是,无论哪种情况,均需要保证第二图像特征的数量与第一图像特征的数量相同,也即,从学生图像处理模型的各个特征提取层输出的图像特征中选取与第一图像特征的数量相同的图像特征作为第二图像特征。
学生图像处理模型的模型结构可以根据经验设置,也可以根据计算机视觉任务的类型灵活调整,本申请实施例对此不加以限定。示例性地,对于计算机视觉任务为目标检测任务,图像处理模型的模型结构是指RetinNet-RX101模型的情况,学生图像处理模型的结构可以是指RetinaNet-R50模型(一种图像处理模型);对于计算机视觉任务为图像分类任务,教师图像处理模型的模型结构是指ResNet34模型的情况,学生图像处理模型的结构可以是指ResNet18模型(一种用于图像处理模型);对于计算机视觉任务为语义分割任务,教师图像处理模型的模型结构是指PSPNet-ResNet34模型的情况,学生图像处理模型的结构可以是指PSPNet-ResNet18模型(一种图像处理模型)。当然,学生图像处理模型的模型结构还可以为其他结构,如,从上述模型中选取部分层构成的结构等,本申请实施例在此不再一一赘述。
在获取第二图像特征之后,可以建立第二图像特征和第一图像特征之间的对应关系,相互对应的一组特征中的第一图像特征用于为该组特征中的第二图像特征提供监督信息。
在获取第二图像特征后,将第二图像特征与第一图像特征对齐,以得到对齐图像特征。对齐图像特征的尺寸和第一图像特征的尺寸相同,对齐图像特征的通道数与第一图像特征的通道数相同。需要说明的是,本申请实施例中的对齐图像特征等同于上述实施例中的对齐后的特征图。对于第一图像特征(或第二图像特征)的数量为多个的情况,将第二图像特征与第一图像特征对齐是指将每个第二图像特征分别与每个第二图像特征对应的第一图像特征对齐。将每个第二图像特征分别与每个第二图像特征对应的第一图像特征对齐的原理相同,本申请以第一图像特征(或第二图像特征)的数量为一个为例进行说明。
在一种可能实现方式中,将第二图像特征与第一图像特征对齐,得到对齐图像特征的实现方式包括:通过线性插值将第二图像特征的尺寸与第一图像特征的尺寸对齐,得到中间图像特征;通过通道变换卷积将中间图像特征的通道数与第一图像特征的通道数对齐,得到对齐图像特征。
通过线性插值能够将第二图像特征的尺寸变换为第一图像特征的尺寸,以实现空间维度的对齐,将实现了空间维度的对齐后得到的图像特征作为中间图像特征。线性插值方式可以根据经验设置,也可以根据应用场景灵活调整,例如,线性插值方式可以是指双线性插值、双三次插值、area(区域)插值等。其中,双三次插值是指一种更加复杂的插值方法,它能创造出比双线性插值更平滑的图像边缘。
通过通道变换卷积能够将中间图像特征的通道数变换为第一图像特征的通道数,以实现通道维度的对齐,将实现了空间维度的对齐和通道维度的对齐后得到的图像特征作为对齐图像特征。通道变换卷积可以通过不改变图像特征的尺寸,仅改变图像特征的通道数的卷积核实现,例如,通过尺寸为1×1的卷积核实现对中间图像特征的通道变换卷积。
需要说明的是,以上所述将第二图像特征与第一图像特征对齐,得到对齐图像特征的实 现方式仅为示例性举例,本申请实施例并不局限于此。在一些实施例中,将第二图像特征与第一图像特征对齐,得到对齐图像特征的实现方式还可以是指:通过通道变换卷积将第二图像特征的通道数与第一图像特征的通道数对齐,得到中间图像特征;通过线性插值将中间图像特征的尺寸与第一图像特征的尺寸对齐,得到对齐图像特征。在另一些实施例中,将第二图像特征与第一图像特征对齐,得到对齐图像特征的实现方式还可以是指:将第二图像特征和第一图像特征输入对齐网络,得到对齐网络输出的对齐图像特征,其中,对齐网络用于以输入的第一图像特征为基准,对输入的第二图像特征进行对齐。
在示例性实施例中,在将样本图像输入学生图像处理模型后,除了能够获取学生图像处理模型输出的第二图像特征外,还能够获取学生图像处理模型输出的预测处理结果。示例性地,学生图像处理模型除了包括特征提取层外,还包括任务处理层。此种情况下,在将样本图像输入学生图像处理模型后,除了能够获取学生图像处理模型的特征提取层输出的第二图像特征外,还能够获取学生图像处理模型的任务处理层输出的预测处理结果。
示例性地,学生图像处理模型用于对图像进行与计算机视觉任务匹配的处理。计算机视觉任务可视为学生图像处理模型的下游任务。此种情况下,获取学生图像处理模型输出的预测处理结果是指获取学生图像处理模型输出的与计算机视觉任务匹配的预测处理结果。
示例性地,计算机视觉任务包括图像分类任务、语义分割任务和目标检测任务中的任一种。其中,图像分类任务用于确定出整个图像对应的类别,语义分割任务用于确定出图像中的各个像素分别对应的类别,目标检测任务用于检测出图像中的目标物所处的位置以及确定检测出的目标物的类别。示例性地,若计算机视觉任务包括图像分类任务,则学生图像处理模型的任务处理层可以包括一个分支,该一个分支用于输出预测分类结果,此种情况下,与计算机视觉任务匹配的预测处理结果包括预测分类结果。若计算机视觉任务包括语义分割任务,则学生图像处理模型的任务处理层可以包括一个分支,该一个分组用于输出预测分割结果,此种情况下,与计算机视觉任务匹配的预测处理结果包括预测分割结果。若计算机视觉任务包括目标检测任务,则学生图像处理模型的任务处理层可以包括两个分支,其中一个分支用于输出检测位置预测结果,另外一个分支用于输出检测类别预测结果,此种情况下,与计算机视觉任务匹配的预测处理结果包括检测位置预测结果和检测类别预测结果。
在步骤404中,利用第一特征变换模型对对齐图像特征进行变换,得到第三图像特征,第一特征变换模型的参数基于图像处理模型的训练过程学习得到。
本申请实施例中,在获取对齐图像特征后,利用第一特征变换模型对对齐图像特征进行变换,得到第三图像特征,然后再根据第三图像特征和教师图像处理模型输出的第一图像特征进行比对来计算训练损失,基于根据此种方式获取的训练损失对学生图像处理模型进行训练,能够使第三图像特征尽可能接近第一图像特征,但由于第三图像特征与对齐图像特征之间经过了第一特征变换模型的变换,所以即使第三图像特征与第一图像特征非常接近,也能够保证获取对齐图像特征所依据的第二图像特征与第一图像特征之间有一定的差距,从而能够避免学生图像处理模型过度拟合教师图像处理模型的问题,帮助学生图像处理模型在学习教师图像处理模型输出的图像特征的同时,有更多的学习空间来关注自身模型的特点,进而提高学生图像处理模型的训练效果。
第一特征变换模型用于基于可学习的参数对输入的图像特征进行变换,也就是说,第一特征变换模型的参数基于图像处理模型的训练过程学习得到,能够保证第一特征变换模型的变换过程与图像处理模型的训练过程的匹配度,从而保证特征变换的可靠性,保证根据变换后得到的第三图像特征进行图像处理模型的训练的可靠性。示例性地,第一特征变换模型还可以称为可学习变换模块、可学习变换模型等。需要说明的是,本申请实施例中的第三图像特征等同于上述实施例中的第三特征图。
示例性地,第一特征变换模型的参数基于图像处理模型的训练过程学习得到是指第一特征变换模型的参数随着图像处理模型的训练过程的迭代不断更新。也就是说,第N(N为不小于1的整数)次图像处理模型的训练过程所利用的第一特征变换模型的参数是基于前(N- 1)次图像处理模型的训练过程学习得到的。在前(N-1)次图像处理模型的训练过程中,每执行一次图像处理模型的训练过程,则根据当前次图像处理模型的训练过程中所获取的特征差异损失更新一次第一特征变换模型的参数。
第一特征变换模型的结构可以根据经验设置,也可以根据经验场景灵活调整,只要保证第一特征变换模型具有可学习的参数即可。示例性地,第一特征变换模型可以是指一个多层感知机,多层感知机的结构较为简单,能够减少特征变换所需的计算量,降低参数调整的复杂性。示例性地,多层感知机的隐藏层的数量以及多层感知机所利用的激活函数的类型均可以根据经验设置,或者根据应用场景灵活调整。例如,多层感知机的隐藏层的数量可以为1,也可以为2等。多层感知机所利用的激活函数可以是指ReLU(Rectified Linear Unit,线性整流函数),也可以是指Sigmoid(S型)函数等。
需要说明的是,第一特征变换模型的变换过程不改变图像特征的尺寸和通道数,也就是说,第三图像特征的尺寸和通道数分别与对齐图像特征的尺寸和通道数相同,又由于对齐图像特征的尺寸和通道数分别与第一图像特征的尺寸和通道数相同,所以,第三图像特征的尺寸和通道数分别与第一图像特征的尺寸和通道数相同,以便于衡量第三图像特征和第一图像特征之间的差异。
在步骤405中,基于第一图像特征和第三图像特征之间的差异,获取特征差异损失;基于特征差异损失,获取训练损失。
特征差异损失用于为学生图像处理模型提供特征提取方面的监督信息。
在一种可能实现方式中,基于第一图像特征和第三图像特征之间的差异,获取特征差异损失的实现方式可以为:基于第一图像特征和第三图像特征之间的差异,获取第一差异损失;基于第一差异损失,获取特征差异损失。
两个图像特征之间的差异可以通过将两个图像特征代入损失函数后计算得到的结果体现,损失函数的类型可以根据经验选定,例如,损失函数的类型可以包括但不限于交叉熵损失函数、均方误差损失函数、KL(Kullback-Leibler)散度损失函数等。
示例性地,基于第一图像特征和第三图像特征之间的差异,获取第一差异损失的过程包括:将第一图像特征和第三图像特征代入损失函数进行计算,基于计算得到的结果获取第一差异损失。例如,将计算得到的结果作为第一差异损失,或者,对计算得到的结果进行处理(如,取整、乘以一个正数、加上一个正数等),将处理后得到的结果作为第一差异损失。
例如,以基于均方误差损失函数计算第一差异损失为例,第一差异损失可以基于公式1计算得到:
Figure PCTCN2022143756-appb-000007
其中,Loss Spatial表示第一差异损失;MSELoss(,)表示均方误差损失函数的表达式,用于计算括号内的两项信息之间的均方误差损失;feat T表示第一图像特征;
Figure PCTCN2022143756-appb-000008
表示第三图像特征。在一些实施例中,第一差异损失还可以称为空间特征损失。
在获取第一差异损失后,基于第一差异损失,获取特征差异损失。在示例性实施例中,基于第一差异损失,获取特征差异损失的方式可以为:将第一差异损失作为特征差异损失,此种方式能够提高获取特征差异损失的效率。在示例性实施例中,基于第一差异损失,获取特征差异损失的方式还可以为:基于第一图像特征和第四图像特征之间的差异,获取第二差异损失;基于第一差异损失和第二差异损失,获取特征差异损失。其中,第四图像特征是在第二图像特征的基础上获取的用于与第一图像特征比对的与第三图像特征不同的特征。通过综合考虑第一差异损失和第二差异损失来获取特征差异损失,有利于提高特征差异损失的全面性和可靠性,进而提高训练损失的可靠性,以及提高利用训练损失对学生图像处理模型的训练效果。
示例性地,第四图像特征的获取方式可以为:保持对齐图像特征的通道数不变,将对齐图像特征的尺寸由第一尺寸调整为第二尺寸,得到调整图像特征;利用第二特征变换模型对 调整图像特征进行变换,得到变换图像特征,将变换图像特征的尺寸由第二尺寸恢复为第一尺寸,得到第四图像特征。需要说明的是,本申请实施例中的第四图像特征等同于上述实施例中的第四特征图。
调整图像特征与对齐图像特征相比,通道数保持不变,尺寸发生了变化。第一尺寸为对齐图像特征的原尺寸,第二尺寸为调整图像特征的尺寸,第一尺寸和第二尺寸的关系可以根据经验设置,或者根据应用场景灵活调整。示例性地,第一尺寸和第二尺寸的关系可以为第一尺寸中的宽度和高度的乘积与第二尺寸中的宽度和高度的乘积相同。例如,第一尺寸可以是指宽度为W高度为H,第二尺寸可以是指宽度为(W*H)高度为1,或者宽度为1高度为(W*H)。示例性地,将对齐图像特征的尺寸由第一尺寸调整为第二尺寸的过程可以通过裁剪以及拼接实现。
示例性地,在将对齐图像特征的尺寸由第一尺寸调整为第二尺寸的过程中,还可以执行转置操作,例如,对齐图像特征的维度可以表示为[N,C,H,W],将对齐图像特征图的尺寸由第一尺寸调整为第二尺寸以及执行转置操作后,调整图像特征的维度可以表示为[N,(H*W),1,C]或[N,1,(H*W),C]。其中,N(N为正整数)表示样本图像的数量,C(C为正整数)表示对齐图像特征的通道数,H(H为正数)表示对齐图像特征的高度,W(W为正数)表示对齐图像特征的宽度。示例性地,由于调整图像特征与对齐图像特征的通道数是相同的,仅在尺寸维度进行了调整,因此,调整图像特征可视为弱化图像特征的尺寸维度的信息,更加关注图像特征的通道维度的信息的特征。
在获取调整图像特征后,利用第二特征变换模型对调整图像特征进行变换,得到变换图像特征。其中,第二特征变换模型的参数基于图像处理模型的训练过程学习得到。也就是说,第二特征变换模型的参数随着图像处理模型的训练过程的迭代不断更新,从而保证第二特征变换模型的特征变换过程与图像处理模型的训练过程的匹配度,提高第二特征变换模型的变换可靠性。
第二特征变换模型的结构可以根据经验设置,也可以根据经验场景灵活调整。示例性地,第二特征变换模型可以是指一个多层感知机,多层感知机的结构较为简单,能够减少特征变换所需的计算量,降低参数调整的复杂性。示例性地,多层感知机的隐藏层的数量以及多层感知机所利用的激活函数的类型均可以根据经验设置,或者根据应用场景灵活调整。例如,多层感知机的隐藏层的数量可以为1,也可以为2等。多层感知机所利用的激活函数可以是指ReLU,也可以是指Sigmoid函数等。示例性地,第二特征变换模型的结构可以与第一特征变换模型的结构相同,也可以与第一特征变换模型的结构不同。示例性地,第二特征变换模型同样为一个参数可学习的模型,也即第二特征变换模型的参数可以在图像处理模型的训练过程中不断更新,以保证特征变换过程与训练过程的匹配度,提高特征变换的可靠性。
需要说明的是,第二特征变换模型的变换过程不改变图像特征的尺寸和通道数,也就是说,变换图像特征的尺寸和通道数分别与调整图像特征的尺寸和通道数相同。由于调整图像特征的尺寸是与对齐图像特征不同的,因此,在获取变换图像特征后,需要将变换图像特征在尺寸维度进行恢复,以将变换图像特征的尺寸由第二尺寸恢复为第一尺寸,将在尺寸维度进行恢复后得到的图像特征作为第四图像特征。在尺寸维度进行恢复时,保持通道数不变,也就是说,第四图像特征的尺寸和通道数分别与对齐图像特征的尺寸和通道数相同,又由于对齐图像特征的尺寸和通道数分别与第一图像特征的尺寸和通道数相同,所以,第四图像特征的尺寸和通道数分别与第一图像特征的尺寸和通道数相同,以便于衡量第四图像特征和第一图像特征之间的差异。例如,若调整图像特征的维度表示为[N,(H*W),1,C]或[N,1,(H*W),C],则第四图像特征的维度可以表示为[N,C,H,W]。
基于第一图像特征和第四图像特征之间的差异,获取第二差异损失的原理与基于第一图像特征和第三图像特征之间的差异,获取第一差异损失的原理相同,此处不再加以赘述。
例如,以基于均方误差损失函数计算第二差异损失为例,第二差异损失可以基于公式2计算得到:
Figure PCTCN2022143756-appb-000009
其中,Loss Channel表示第二差异损失;MSELoss(,)表示均方误差损失函数的表达式,用于计算括号内的两项信息之间的均方误差损失;feat T表示第一图像特征;
Figure PCTCN2022143756-appb-000010
表示第四图像特征。在一些实施例中,第二差异损失还可以称为通道特征损失。
在示例性实施例中,基于第一差异损失和第二差异损失,获取特征差异损失的方式可以为将第一差异损失和第二差异损失的和作为特征差异损失,也可以为将第一差异损失和第二差异损失的加权和作为特征差异损失。在将第一差异损失和第二差异损失的加权和作为特征差异损失的情况下,第一差异损失和第二差异损失各自对应的权重可以根据经验设置,或者根据应用场景灵活调整。
例如,基于第一差异损失和第二差异损失,获取特征差异损失的过程可以基于公式3实现:
L distill=αLoss Spatial+βLoss Channel   (公式3)
其中,L distill表示特征差异损失;Loss Spatial表示第一差异损失;Loss Channel表示第二差异损失;α表示第一差异损失对应的权重;β表示第二差异损失对应的权重。α,β为超参数,可以根据经验灵活设置,例如,α和β可以分别设置为2e-5和1e-6。
在获取特征差异损失后,基于特征差异损失获取训练损失。其中,训练损失是对学生图像处理模型的参数进行更新所直接依据的损失。基于特征差异损失获取训练损失的实现方式可以根据经验设置,或者根据应用场景灵活调整,本申请实施例对此不加以限定。
在示例性实施例中,基于特征差异损失获取训练损失的方式可以为:将特征差异损失作为训练损失。此种方式能够提高训练损失的获取效率。
在示例性实施例中,对于将样本图像输入学生图像处理模型后,除了获取学生图像处理模型输出的第一图像特征外,还获取预测处理结果的情况,基于特征差异损失获取训练损失的方式还可以为:基于预测处理结果和样本图像对应的标准处理结果之间的差异,获取处理结果损失;基于特征差异损失和处理结果损失,获取训练损失。通过综合考虑特征差异损失和处理结果损失来获取训练损失,有利于提高训练损失的全面性和可靠性,进而提高图像处理模型的训练效果。
标准处理结果是指样本图像对应的真实的处理结果,用于为学生图像处理模型输出的预测处理结果提供监督信息。标准处理结果可以由技术人员确定。示例性地,对于学生图像处理模型用于对图像进行与计算机视觉任务匹配的处理的情况,标准处理结果可以是指与计算机视觉任务匹配的标准处理结果。与计算机视觉任务匹配的标准处理结果的类型与计算机视觉任务的类型有关。对于计算机视觉任务包括图像分类任务的情况,与计算机视觉任务匹配的标准处理结果包括标准分类结果;对于计算机视觉任务包括语义分割任务的情况,与计算机视觉任务匹配的标准处理结果包括标准分割结果;对于计算机视觉任务包括目标检测任务的情况,与计算机视觉任务匹配的标准处理结果包括检测位置标准结果和检测类别标准结果。
对于预测处理结果为与计算机视觉任务匹配的预测处理结果,标准处理结果为与计算机视觉任务匹配的标准处理结果的情况,基于预测处理结果和样本图像对应的标准处理结果之间的差异,获取处理结果损失是指基于与计算机视觉任务匹配的预测处理结果和样本图像对应的与计算机任务匹配的标准处理结果之间的差异,获取处理结果损失。
处理结果损失用于衡量学生图像处理模型输出的预测处理结果与标准处理结果之间的差异,学生图像处理模型输出的预测处理结果与标准处理结果之间的差异越大,处理结果损失越大。
示例性地,对于基于与计算机视觉任务匹配的预测处理结果和样本图像对应的与计算机任务匹配的标准处理结果之间的差异,获取处理结果损失的情况,处理结果损失的获取方式与计算机视觉任务的类型有关。对于计算机视觉任务包括图像分类任务的情况,处理结果损 失基于预测分类结果和标准分类结果之间的差异获取;对于计算机视觉任务包括语义分割任务的情况,处理结果损失基于预测分割结果和标准分割结果之间的差异获取;对于计算机视觉任务包括目标检测任务的情况,处理结果损失基于检测位置预测结果和检测位置标准结果之间的差异,以及检测类别预测结果和检测类别标准结果之间的差异获取。
两个结果之间的差异可以通过将两个结果代入损失函数后计算得到的结果体现。计算不同的两个结果之间的差异所依据的损失函数可以相同,也可以不同,本申请实施例对此不加以限定。
示例性地,基于预测分类结果和标准分类结果之间的差异获取处理结果损失的过程可以为:将预测分类结果和标准分类结果代入图像分类任务对应的损失函数,基于计算得到的结果获取处理结果损失。图像分类任务对应的损失函数可以包括但不限于交叉熵损失函数、均方误差损失函数等。
示例性地,基于预测分割结果和标准分割结果之间的差异获取处理结果损失的过程可以为:将预测分割结果和标准分割结果代入语义分割任务对应的损失函数,基于计算得到的结果获取处理结果损失。语义分割任务对应的损失函数可以包括但不限于交叉熵损失函数、均方误差损失函数等。
示例性地,基于检测位置预测结果和检测位置标准结果之间的差异,以及检测类别预测结果和检测类别标准结果之间的差异获取处理结果损失的方式可以为:将检测位置预测结果和检测位置标准结果代入目标检测任务对应的第一损失函数,基于计算得到的结果获取第一检测损失;将检测类别预测结果和检测类别标准结果代入目标检测任务对应的第二损失函数,基于计算得到的结果获取第二检测损失;基于第一检测损失和第二检测损失,获取处理结果损失。
第一损失函数是用于衡量目标检测任务中检测到的目标的位置的准确性的损失函数,第二损失函数是用于衡量目标检测任务中检测到的目标的类别的准确性的损失函数。示例性地,第一损失函数包括但不限于L1 Loss(L1范数损失函数)、L2 Loss(L2范数损失函数)、Smooth L1 Loss(稳定L1范数损失函数)、IOU(Intersection over Union,交并比)损失函数等;第二损失函数包括但不限于交叉熵损失函数、Focal Loss(有焦点的损失函数)等。在一些实施例中,第一损失函数还可以称为回归损失函数,第二损失函数还可以称为分类损失函数。
例如,以第一损失函数为Smooth L1 Loss、第二损失函数为Focal Loss为例,第一检测损失可以基于公式4计算得到,第二检测损失可以基于公式5计算得到:
Figure PCTCN2022143756-appb-000011
L cls=-α t(1-p t) γlog(p t)   (公式5)
其中,
Figure PCTCN2022143756-appb-000012
表示第一检测损失;SmoothL1(,)表示Smooth L1 Loss的表达式;t i表示检测位置预测结果中的元素i;
Figure PCTCN2022143756-appb-000013
表示检测位置标准结果中的元素i;元素i为x、y、w和h中的元素,x和y表示检测位置的某一点(如,左上角、右上角、中心点等)的坐标,w和h表示检测位置的宽度和长度。
L cls表示第二检测损失;p t表示检测类别预测结果与检测类别标准结果之间的接近程度,p t的表达式如公式6所示;α t和γ为超参数,可以根据经验设置,或者根据应用场景灵活调整,例如,α t和γ可以分别设置为0.25和2.0。
Figure PCTCN2022143756-appb-000014
其中,p表示检测目标被正确分类的概率值;y=1表示检测目标被正确分类。
示例性地,基于第一检测损失和第二检测损失,获取处理结果损失可以是指将第一检测损失和第二检测损失的和作为处理结果损失,也可以是指将第一检测损失和第二检测损失的 加权和作为处理结果损失等。对于将第一检测损失和第二检测损失的加权和作为处理结果损失的情况,第一检测损失和第二检测损失各自对应的权重可以根据经验设置,或者根据应用场景灵活调整。
在获取特征差异损失和处理结果损失后,基于特征差异损失和处理结果损失,获取更新学生图像处理模型的参数所依据的训练损失。示例性地,特征差异损失还可以称为知识蒸馏损失,处理结果损失还可以称为下游任务损失,训练损失还可以称为总损失。基于特征差异损失和处理结果损失,获取训练损失的过程可视为根据下游任务调整蒸馏损失的超参数,以获取学生图像处理模型的总损失的过程。
示例性地,可以将特征差异损失和处理结果损失的和作为训练损失,或者,将特征差异损失和处理结果损失的加权和作为训练损失等。对于将特征差异损失和处理结果损失的加权和作为训练损失的情况,特征差异损失和处理结果损失各自对应的权重可以根据经验设置,或者根据应用场景灵活调整。
例如,以计算机视觉任务为目标检测任务为例,训练损失可以基于公式7计算得到:
L total=L reg+L cls+L distill   (公式7)
其中,L total表示训练损失;L reg表示第一检测损失;L cls表示第二检测损失;L reg+L cls表示处理结果损失;L distill表示特征差异损失。
需要说明的是,以上基于特征差异损失获取训练损失的实现方式仅为示例性举例,本申请实施例并不局限于此。在一些实施例中,基于特征差异损失获取训练损失的实现方式还可以为:获取教师图像处理模型输出的参考处理结果以及学生图像处理模型输出的预测处理结果;基于参考处理结果和预测处理结果之间的差异,获取结果差异损失;基于结果差异损失和特征差异损失,获取训练损失。其中,教师图像处理模型所对应的计算机视觉任务与学生图像处理模型所对应的计算机视觉任务的类型相同。在一些实施例中,基于特征差异损失获取训练损失的实现方式还可以为:获取教师图像处理模型输出的参考处理结果以及学生图像处理模型输出的预测处理结果;基于参考处理结果和预测处理结果之间的差异,获取结果差异损失;基于预测处理结果和样本图像对应的标准处理结果之间的差异,获取处理结果损失;基于特征差异损失、结果差异损失以及处理结果损失,获取训练损失。
在步骤406中,利用训练损失更新学生图像处理模型的参数,得到目标图像处理模型。
在获取训练损失后,利用训练损失对学生图像处理模型的参数进行更新,以完成对学生图像处理模型的一次训练。示例性地,利用训练损失更新学生图像处理模型的参数的过程可以为:基于训练损失,计算学生图像处理模型的参数的更新梯度,根据更新梯度更新学生图像处理模型的参数。例如,基于训练损失,可以利用梯度下降法计算学生图像处理模型的参数的更新梯度。
在示例性实施例中,利用训练损失更新学生图像处理模型的参数,得到目标图像处理模型的过程包括:利用训练损失更新学生图像处理模型的参数,得到更新后的学生图像处理模型;判断当前训练过程是否满足训练终止条件;若当前训练过程满足训练终止条件,将更新后的学生图像处理模型作为目标图像处理模型;若当前训练过程不满足训练终止条件,对更新后的学生图像处理模型进行训练,直至当前训练过程满足训练终止条件,将满足训练终止条件时得到的图像处理模型作为目标图像处理模型。
当前训练过程满足训练终止条件根据经验设置,或者根据应用场景灵活调整,本申请实施例对此不加以限定。示例性地,当前训练过程满足训练终止条件包括但不限于当前训练过程已执行的图像处理模型训练次数达到次数阈值、当前训练过程获取的用于对学生图像处理模型的参数进行更新的训练损失小于损失阈值、当前训练过程获取的用于对学生图像处理模型的参数进行更新的训练损失收敛中的任一项。次数阈值和损失阈值均根据经验设置,或者根据应用场景灵活调整。
若当前训练过程满足训练终止条件,则结束图像处理模型的训练过程,将此时训练得到 的更新后的学生图像处理模型作为目标图像处理模型。若当前训练过程不满足训练终止条件。则需要继续对更新后的学生图像处理模型进行训练。
在示例性实施例中,对更新后的学生图像处理模型进行训练的过程包括:利用特征差异损失更新第一特征变换模型的参数,得到更新后的第一特征变换模型;基于更新后的第一特征变换模型对更新后的学生图像处理模型进行训练。由于特征差异损失是基于第一特征变换模型变换得到的图像特征得到的,涉及到第一特征变换模型的处理过程,所以利用特征差异损失来更新第一特征变换模型的参数。例如,基于特征差异损失计算第一特征变换模型的参数的更新梯度,利用更新梯度更新第一特征变换模型的参数。
示例性地,基于更新后的第一特征变换模型对更新后的学生图像处理模型进行训练的过程与获取更新后的学生图像处理模型的过程相比,学生图像处理模型变化为了更新后的学生图像处理模型,第一特征变换模型变化为了更新后的第一特征变换模型。基于更新后的第一特征变换模型对更新后的学生图像处理模型进行训练的过程所依据的样本图像与获取更新后的学生图像处理模型的过程所依据的样本图像可以相同,也可以不同。
示例性地,对于特征差异损失的获取过程除利用了第一特征变换模型外,还利用了第二特征变换模型的情况,除了利用特征差异损失更新第一特征变换模型的参数,得到更新后的第一特征变换模型外,还利用特征差异损失更新第二特征变换模型的参数,得到更新后的第二特征变换模型。此种情况下,基于更新后的第一特征变换模型和更新后的第二特征变换模型对更新后的学生图像处理模型进行训练。
需要说明的是,在对更新后的学生图像处理模型进行训练的过程中,教师图像处理模型可以发生变化,也可以不发生变化。对于教师图像处理模型为预先训练好的模型的情况,在对更新后的学生图像处理模型进行训练的过程中,教师图像处理模型不发生变化,也即仍然将样本图像输入原来的教师图像处理模型进行处理。对于教师图像处理模型随着学生图像处理模型的训练过程实时训练的情况,在对更新后的学生图像处理模型进行训练的过程中,教师图像处理模型发生变化,也即将样本图像输入更新后的教师图像处理模型进行处理。
示例性地,教师图像处理模型变化的方式为:获取教师图像处理模型输出的参考处理结果;基于参考处理结果和样本图像对应的标准处理结果之间的差异,获取教师图像处理模型对应的损失;利用教师图像处理模型对应的损失,更新教师图像处理模型的参数,得到更新后的教师图像处理模型。
示例性地,以对更新后的学生图像处理模型进行训练的过程中,教师图像处理模型不发生变化为例,基于更新后的第一特征变换模型对更新后的学生图像处理模型进行训练的过程可以包括:获取样本图像;将样本图像输入教师图像处理模型,获取教师图像处理模型输出的第一图像特征;将样本图像输入更新后的学生图像处理模型,获取更新后的学生图像处理模型输出的第五图像特征,将第五图像特征与第一图像特征对齐,得到更新后的对齐图像特征;利用更新后的第一特征变换模型对更新后的对齐图像特征进行变换,得到第六图像特征;基于第一图像特征和第六图像特征之间的差异,获取更新后的特征差异损失;基于更新后的特征差异损失,获取更新后的训练损失;利用更新后的训练损失对更新后的学生图像处理模型的参数进行更新,得到再次更新后的学生图像处理模型;若当前训练过程满足训练终止条件,将再次更新后的学生图像处理模作为目标图像处理模型。
示例性地,以对更新后的学生图像处理模型进行训练的过程中,教师图像处理模型不发生变化为例,基于更新后的第一特征变换模型和更新后的第二特征变换模型对更新后的学生图像处理模型进行训练的过程还可以包括:获取样本图像;将样本图像输入教师图像处理模型,获取教师图像处理模型输出的第一图像特征;将样本图像输入更新后的学生图像处理模型,获取更新后的学生图像处理模型输出的第五图像特征,将第五图像特征与第一图像特征对齐,得到更新后的对齐图像特征;利用更新后的第一特征变换模型对更新后的对齐图像特征进行变换,得到第六图像特征;保持更新后的对齐图像特征的通道数不变,将更新后的对齐图像特征的尺寸由第一尺寸调整为第二尺寸,得到更新后的调整图像特征;利用更新后的 第二特征变换模型对更新后的调整图像特征进行变换,得到更新后的变换图像特征,将更新后的变换图像特征的尺寸由第二尺寸恢复为第一尺寸,得到第七图像特征;基于第一图像特征和第六图像特征之间的差异,获取更新后的第一差异损失;基于第一图像特征和第七图像特征之间的差异,获取更新后的第二差异损失;基于更新后的第一差异损失和更新后的第二差异损失,获取更新后的特征差异损失;利用更新后的训练损失对更新后的学生图像处理模型的参数进行更新,得到再次更新后的学生图像处理模型;若当前训练过程满足训练终止条件,将再次更新后的学生图像处理模作为目标图像处理模型。
无论哪种情况,在获取目标图像处理模型后,利用目标图像处理模型对图像进行处理,该过程详见图5所示的实施例,此处暂不赘述。
相关技术中,针对不同的计算机视觉任务衍生出了多种多样的知识蒸馏方法,而这些方法往往有很多手工设计的部分,如损失函数、特征掩膜,这些手工设计的部分一方面使得蒸馏方法的通用性降低,另一方面带来额外的超参数,使得调参难度增大。而本申请实施例提供的图像处理模型的训练方法,基于通用的参数可学习的特征变换模型实现对特征的可学习变换,提高知识蒸馏效果以及图像处理模型的训练效果,无需针对不同任务设计复杂的特征变换模型,无需引入复杂的超参数,能够免去繁琐的参数调整步骤,提高知识蒸馏在多个任务上的通用性,在提升图像处理模型的训练效果的同时免去手工设计结构的繁琐,能够在多种计算机视觉任务上(如图像分类任务、目标检测任务、语义分割任务等)实现性能提升,取得不错的任务处理效果。
本申请实施例提供的图像处理模型的训练方法,先利用第一特征变换模型对对齐图像特征进行变换,然后再将变换后得到的第三图像特征与教师图像处理模型输出的第一图像特征进行比对来获取训练损失,利用此种训练损失对学生图像处理模型进行训练,能够使第三图像特征尽可能接近第一图像特征,但由于第三图像特征与对齐图像特征之间经过了第一特征变换模型的变换,所以即使第三图像特征与第一图像特征非常接近,也能够保证对齐图像特征与第一图像特征之间有一定的差距,从而能够避免学生图像处理模型输出的第二图像特征过度拟合教师图像处理模型输出的第一图像特征的问题,帮助学生图像处理模型在学习教师图像处理模型输出的图像特征的同时,有更多的学习空间来关注自身模型的特点,进而提高学生图像处理模型的训练效果。
此外,第一特征变换模型的参数是基于图像处理模型的训练过程学习得到的参数,从而能够保证特征变换过程与模型训练过程的匹配度,进而保证特征变换的可靠性,提高训练损失的可靠性,进一步提高图像处理模型的训练效果。
基于上述图3所示的实施环境,本申请实施例提供一种图像处理训练方法,该图像处理方法由计算机设备执行,该计算机设备可以为终端11,也可以为服务器12,本申请实施例对此不加以限定。如图5所示,本申请实施例提供的图像处理方法可以包括如下步骤501和步骤502。
在步骤501中,获取待处理的目标图像。
目标图像是指需要利用目标图像处理模型进行处理的图像。本申请实施例丢目标图像的获取方式不加以限定,例如,从网络上爬取目标图像;从数据库中提取目标图像;利用图像采集设备采集目标图像;从视频流中截取目标图像;接收其他设备发送或上传的目标图像等。
示例性地,目标图像的通道数和尺寸分别与样本图像的通道数和尺寸相同,以保证目标图像处理模型的处理效果。
在步骤502中,将目标图像输入目标图像处理模型,获取目标图像处理模型输出的目标处理结果。
其中,目标图像处理模型利用图4所示的实施例中任一所述图像处理模型的训练方法训练得到。
该步骤502的实现原理与图4所示的实施例中将样本图像输入学生图像处理模型,获取 学生图像处理模型输出的预测处理结果的实现原理相同,此处不再加以赘述。示例性地,目标图像处理模型包括任务处理层,目标处理结果是目标图像处理模型的任务处理层输出的处理结果。
由于目标图像处理模型是通过较为可靠的方式训练得到的,所以目标图像处理模型输出的目标处理结果可视为目标图像对应的可靠性较高的处理结果。示例性地,目标处理结果的类型与计算机视觉任务的类型有关,若计算机视觉任务的类型为图像分类任务,则目标处理结果为分类结果;若计算机视觉任务的类型为语义分割任务,则目标处理结果为分割结果;若计算机视觉任务的类型为目标检测任务,则目标处理结果包括检测位置结果和检测类别结果。
本申请实施例提供的图像处理方法,利用训练效果较好的目标图像处理模型对目标图像进行处理,有利于保证目标图像的图像处理的准确性。
参见图6,本申请实施例提供了一种图像处理模型的训练装置,该装置包括:
第一获取单元601,用于获取样本图像;
第二获取单元602,用于将样本图像输入教师图像处理模型,获取教师图像处理模型输出的第一图像特征;
第三获取单元603,用于将样本图像输入学生图像处理模型,获取学生图像处理模型输出的第二图像特征,将第二图像特征与第一图像特征对齐,得到对齐图像特征;
变换单元604,用于利用第一特征变换模型对对齐图像特征进行变换,得到第三图像特征,第一特征变换模型的参数基于图像处理模型的训练过程学习得到;
第四获取单元605,用于基于第一图像特征和第三图像特征之间的差异,获取特征差异损失;基于特征差异损失,获取训练损失;
更新单元606,用于利用训练损失更新学生图像处理模型的参数,得到目标图像处理模型。
在一种可能实现方式中,第四获取单元605,用于保持对齐图像特征的通道数不变,将对齐图像特征的尺寸由第一尺寸调整为第二尺寸,得到调整图像特征;利用第二特征变换模型对调整图像特征进行变换,得到变换图像特征,将变换图像特征的尺寸由第二尺寸恢复为第一尺寸,得到第四图像特征,第二特征变换模型的参数基于图像处理模型的训练过程学习得到;基于第一图像特征和第三图像特征之间的差异,获取第一差异损失;基于第一图像特征和第四图像特征之间的差异,获取第二差异损失;基于第一差异损失和第二差异损失,获取特征差异损失。
在一种可能实现方式中,第三获取单元603,用于获取学生图像处理模型输出的第二图像特征以及预测处理结果;
第四获取单元605,用于基于预测处理结果和样本图像对应的标准处理结果之间的差异,获取处理结果损失;基于特征差异损失和处理结果损失,获取训练损失。
在一种可能实现方式中,学生图像处理模型用于对图像进行与计算机视觉任务匹配的处理,第三获取单元603,用于获取学生图像处理模型输出的第二图像特征以及与计算机视觉任务匹配的预测处理结果;
第四获取单元605,用于基于与计算机视觉任务匹配的预测处理结果和样本图像对应的与计算机任务匹配的标准处理结果之间的差异,获取处理结果损失。
在一种可能实现方式中,计算机视觉任务包括图像分类任务,与计算机视觉任务匹配的预测处理结果包括预测分类结果,与计算机视觉任务匹配的标准处理结果包括标准分类结果,处理结果损失基于预测分类结果和标准分类结果之间的差异获取;或者,
计算机视觉任务包括语义分割任务,与计算机视觉任务匹配的预测处理结果包括预测分割结果,与计算机视觉任务匹配的标准处理结果包括标准分割结果,处理结果损失基于预测分割结果和标准分割结果获取;或者,
计算机视觉任务包括目标检测任务,与计算机视觉任务匹配的预测处理结果包括检测位置预测结果和检测类别预测结果,与计算机视觉任务匹配的标准处理结果包括检测位置标准结果和检测类别标准结果,处理结果损失基于检测位置预测结果和检测位置标准结果之间的差异,以及检测类别预测结果和检测类别标准结果之间的差异获取。
在一种可能实现方式中,更新单元606,用于利用训练损失更新学生图像处理模型的参数,得到更新后的学生图像处理模型;若当前训练过程不满足训练终止条件,利用特征差异损失更新第一特征变换模型的参数,得到更新后的第一特征变换模型;基于更新后的第一特征变换模型对更新后的学生图像处理模型进行训练,得到目标图像处理模型。
在一种可能实现方式中,第三获取单元603,用于通过线性插值将第二图像特征的尺寸与第一图像特征的尺寸对齐,得到中间图像特征;通过通道变换卷积将中间图像特征的通道数与第一图像特征的通道数对齐,得到对齐图像特征。
参见图7,本申请实施例提供了一种图像处理装置,该装置包括:
第一获取单元701,用于获取待处理的目标图像;
第二获取单元702,用于将目标图像输入目标图像处理模型,获取目标图像处理模型输出的目标处理结果;其中,目标图像处理模型利用上述任一所述的图像处理模型的训练方法训练得到。
需要说明的是,上述实施例提供的装置在实现其功能时,仅以上述各功能单元的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元完成,即将设备的内部结构划分成不同的功能单元,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。上述实施例提供的装置所实现的效果与方法实施例所实现的效果相同,这里不再赘述。
在示例性实施例中,还提供了一种计算机设备,参见图8,该计算机设备包括处理器801和存储器802,该存储器802中存储有至少一条计算机程序。该至少一条计算机程序由一个或者一个以上处理器801加载并执行,以使该计算机设备实现上述任一种图像处理模型的训练方法或者图像处理方法。
在示例性实施例中,还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一条计算机程序,该至少一条计算机程序由计算机设备的处理器加载并执行,以使计算机实现上述任一种图像处理模型的训练方法或者图像处理方法。
在一种可能实现方式中,上述计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品,该计算机程序产品包括计算机程序或计算机指令,该计算机程序或计算机指令由处理器加载并执行,以使计算机实现上述任一种图像处理模型的训练方法或者图像处理方法。
需要说明的是,本申请中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以上示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与本申请的一些方面相一致的装置和方法的例子。应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
以上通过详细实施案例描述了本申请,本领域的研究人员和技术人员可以根据上述的步骤作出形式或内容方面的非实质性的改变而不偏离本申请实质保护的范围。因此,本申请不局限于以上实施例中所公开的内容,本申请的保护范围应以权利要求所述为准。

Claims (12)

  1. 一种图像处理模型的训练方法,其中,所述方法包括:
    获取样本图像;
    将所述样本图像输入教师图像处理模型,获取所述教师图像处理模型输出的第一图像特征;
    将所述样本图像输入学生图像处理模型,获取所述学生图像处理模型输出的第二图像特征,将所述第二图像特征与所述第一图像特征对齐,得到对齐图像特征;
    利用第一特征变换模型对所述对齐图像特征进行变换,得到第三图像特征,所述第一特征变换模型的参数基于图像处理模型的训练过程学习得到;
    基于所述第一图像特征和所述第三图像特征之间的差异,获取特征差异损失;
    基于所述特征差异损失,获取训练损失;
    利用所述训练损失更新所述学生图像处理模型的参数,得到目标图像处理模型。
  2. 根据权利要求1所述的方法,其中,所述基于所述第一图像特征和所述第三图像特征之间的差异,获取特征差异损失,包括:
    保持所述对齐图像特征的通道数不变,将所述对齐图像特征的尺寸由第一尺寸调整为第二尺寸,得到调整图像特征;
    利用第二特征变换模型对所述调整图像特征进行变换,得到变换图像特征,将所述变换图像特征的尺寸由所述第二尺寸恢复为所述第一尺寸,得到第四图像特征,所述第二特征变换模型的参数基于所述图像处理模型的训练过程学习得到;
    基于所述第一图像特征和所述第三图像特征之间的差异,获取第一差异损失;
    基于所述第一图像特征和所述第四图像特征之间的差异,获取第二差异损失;
    基于所述第一差异损失和所述第二差异损失,获取所述特征差异损失。
  3. 根据权利要求1所述的方法,其中,所述获取所述学生图像处理模型输出的第二图像特征,包括:
    获取所述学生图像处理模型输出的第二图像特征以及预测处理结果;
    所述基于所述特征差异损失,获取训练损失,包括:
    基于所述预测处理结果和所述样本图像对应的标准处理结果之间的差异,获取处理结果损失;
    基于所述特征差异损失和所述处理结果损失,获取所述训练损失。
  4. 根据权利要求3所述的方法,其中,所述学生图像处理模型用于对图像进行与计算机视觉任务匹配的处理,所述获取所述学生图像处理模型输出的第二图像特征以及预测处理结果,包括:
    获取所述学生图像处理模型输出的第二图像特征以及与所述计算机视觉任务匹配的预测处理结果;
    所述基于所述预测处理结果和所述样本图像对应的标准处理结果之间的差异,获取处理结果损失,包括:
    基于所述与所述计算机视觉任务匹配的预测处理结果和所述样本图像对应的与所述计算机任务匹配的标准处理结果之间的差异,获取所述处理结果损失。
  5. 根据权利要求4所述的方法,其中,所述计算机视觉任务包括图像分类任务,所述与所述计算机视觉任务匹配的预测处理结果包括预测分类结果,所述与所述计算机视觉任务匹配的标准处理结果包括标准分类结果,所述处理结果损失基于所述预测分类结果和所述标准分 类结果之间的差异获取;或者,
    所述计算机视觉任务包括语义分割任务,所述与所述计算机视觉任务匹配的预测处理结果包括预测分割结果,所述与所述计算机视觉任务匹配的标准处理结果包括标准分割结果,所述处理结果损失基于所述预测分割结果和所述标准分割结果获取;或者,
    所述计算机视觉任务包括目标检测任务,所述与所述计算机视觉任务匹配的预测处理结果包括检测位置预测结果和检测类别预测结果,所述与所述计算机视觉任务匹配的标准处理结果包括检测位置标准结果和检测类别标准结果,所述处理结果损失基于所述检测位置预测结果和所述检测位置标准结果之间的差异,以及所述检测类别预测结果和所述检测类别标准结果之间的差异获取。
  6. 根据权利要求1所述的方法,其中,所述利用所述训练损失更新所述学生图像处理模型的参数,得到目标图像处理模型,包括:
    利用所述训练损失更新所述学生图像处理模型的参数,得到更新后的学生图像处理模型;
    若当前训练过程不满足训练终止条件,利用所述特征差异损失更新所述第一特征变换模型的参数,得到更新后的第一特征变换模型;
    基于所述更新后的第一特征变换模型对所述更新后的学生图像处理模型进行训练,得到所述目标图像处理模型。
  7. 根据权利要求1-6任一所述的方法,其中,所述将所述第二图像特征与所述第一图像特征对齐,得到对齐图像特征,包括:
    通过线性插值将所述第二图像特征的尺寸与所述第一图像特征的尺寸对齐,得到中间图像特征;
    通过通道变换卷积将所述中间图像特征的通道数与所述第一图像特征的通道数对齐,得到所述对齐图像特征。
  8. 一种图像处理方法,其中,所述方法包括:
    获取待处理的目标图像;
    将所述目标图像输入目标图像处理模型,获取所述目标图像处理模型输出的目标处理结果;其中,所述目标图像处理模型利用权利要求1-7任一所述的图像处理模型的训练方法训练得到。
  9. 一种图像处理模型的训练装置,其中,所述装置包括:
    第一获取单元,用于获取样本图像;
    第二获取单元,用于将所述样本图像输入教师图像处理模型,获取所述教师图像处理模型输出的第一图像特征;
    第三获取单元,用于将所述样本图像输入学生图像处理模型,获取所述学生图像处理模型输出的第二图像特征,将所述第二图像特征与所述第一图像特征对齐,得到对齐图像特征;
    变换单元,用于利用第一特征变换模型对所述对齐图像特征进行变换,得到第三图像特征,所述第一特征变换模型的参数基于图像处理模型的训练过程学习得到;
    第四获取单元,用于基于所述第一图像特征和所述第三图像特征之间的差异,获取特征差异损失;基于所述特征差异损失,获取训练损失;
    更新单元,用于利用所述训练损失更新所述学生图像处理模型的参数,得到目标图像处理模型。
  10. 一种图像处理装置,其中,所述装置包括:
    第一获取单元,用于获取待处理的目标图像;
    第二获取单元,用于将所述目标图像输入目标图像处理模型,获取所述目标图像处理模型输出的目标处理结果;其中,所述目标图像处理模型利用权利要求1-7任一所述的图像处理模型的训练方法训练得到。
  11. 一种计算机设备,其中,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条计算机程序程序代码,所述至少一条计算机程序程序代码由所述处理器加载并执行,以使所述计算机设备实现如权利要求1至7任一所述的图像处理模型的训练方法,或者如权利要求8所述的图像处理方法。
  12. 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行,以使计算机实现如权利要求1至7任一所述的图像处理模型的训练方法,或者如权利要求8所述的图像处理方法。
PCT/CN2022/143756 2022-09-28 2022-12-30 图像处理模型的训练、图像处理方法、装置、设备及介质 WO2024066111A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211196707.5A CN115565021A (zh) 2022-09-28 2022-09-28 基于可学习特征变换的神经网络知识蒸馏方法
CN202211196707.5 2022-09-28

Publications (1)

Publication Number Publication Date
WO2024066111A1 true WO2024066111A1 (zh) 2024-04-04

Family

ID=84743371

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/143756 WO2024066111A1 (zh) 2022-09-28 2022-12-30 图像处理模型的训练、图像处理方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN115565021A (zh)
WO (1) WO2024066111A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015316A (zh) * 2024-04-07 2024-05-10 之江实验室 一种图像匹配模型训练的方法、装置、存储介质、设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160409A (zh) * 2019-12-11 2020-05-15 浙江大学 一种基于共同特征学习的异构神经网络知识重组方法
CN112819050A (zh) * 2021-01-22 2021-05-18 北京市商汤科技开发有限公司 知识蒸馏和图像处理方法、装置、电子设备和存储介质
CN114998694A (zh) * 2022-06-08 2022-09-02 上海商汤智能科技有限公司 图像处理模型的训练方法、装置、设备、介质和程序产品

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160409A (zh) * 2019-12-11 2020-05-15 浙江大学 一种基于共同特征学习的异构神经网络知识重组方法
CN112819050A (zh) * 2021-01-22 2021-05-18 北京市商汤科技开发有限公司 知识蒸馏和图像处理方法、装置、电子设备和存储介质
CN114998694A (zh) * 2022-06-08 2022-09-02 上海商汤智能科技有限公司 图像处理模型的训练方法、装置、设备、介质和程序产品

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KUANG HONGBO; LIU ZIWEI: "Research on Object Detection Network Based on Knowledge Distillation", 2021 4TH INTERNATIONAL CONFERENCE ON INTELLIGENT AUTONOMOUS SYSTEMS (ICOIAS), IEEE, 14 May 2021 (2021-05-14), pages 8 - 12, XP033969617, DOI: 10.1109/ICoIAS53694.2021.00009 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015316A (zh) * 2024-04-07 2024-05-10 之江实验室 一种图像匹配模型训练的方法、装置、存储介质、设备
CN118015316B (zh) * 2024-04-07 2024-06-11 之江实验室 一种图像匹配模型训练的方法、装置、存储介质、设备

Also Published As

Publication number Publication date
CN115565021A (zh) 2023-01-03

Similar Documents

Publication Publication Date Title
EP3968179A1 (en) Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device
TWI773189B (zh) 基於人工智慧的物體檢測方法、裝置、設備及儲存媒體
WO2020119350A1 (zh) 视频分类方法、装置、计算机设备和存储介质
Haider et al. Deepgender: real-time gender classification using deep learning for smartphones
US20170150235A1 (en) Jointly Modeling Embedding and Translation to Bridge Video and Language
CN111695415A (zh) 图像识别模型的构建方法、识别方法及相关设备
CN110659573B (zh) 一种人脸识别方法、装置、电子设备及存储介质
CN112396106B (zh) 内容识别方法、内容识别模型训练方法及存储介质
US20220237917A1 (en) Video comparison method and apparatus, computer device, and storage medium
CN109255369A (zh) 利用神经网络识别图片的方法及装置、介质和计算设备
CN109712108B (zh) 一种基于多样鉴别性候选框生成网络的针对视觉定位方法
CN110781302B (zh) 文本中事件角色的处理方法、装置、设备及存储介质
CN109885796B (zh) 一种基于深度学习的网络新闻配图匹配性检测方法
CN111259940A (zh) 一种基于空间注意力地图的目标检测方法
CN112381763A (zh) 一种表面缺陷检测方法
WO2024066111A1 (zh) 图像处理模型的训练、图像处理方法、装置、设备及介质
CN113628059A (zh) 一种基于多层图注意力网络的关联用户识别方法及装置
CN109583367A (zh) 图像文本行检测方法及装置、存储介质和电子设备
CN115171052B (zh) 基于高分辨率上下文网络的拥挤人群姿态估计方法
CN113505797A (zh) 模型训练方法、装置、计算机设备和存储介质
CN114841142A (zh) 文本生成方法、装置、电子设备和存储介质
CN107633527B (zh) 基于全卷积神经网络的目标追踪方法及装置
CN114492755A (zh) 基于知识蒸馏的目标检测模型压缩方法
CN114299304A (zh) 一种图像处理方法及相关设备
CN113763420A (zh) 一种目标跟踪方法、系统及存储介质和终端设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22960702

Country of ref document: EP

Kind code of ref document: A1