CN116597260A

CN116597260A - Image processing method, electronic device, storage medium, and computer program product

Info

Publication number: CN116597260A
Application number: CN202310300351.3A
Authority: CN
Inventors: 李帅霖; 贾凡; 汪天才
Original assignee: Nanjing Kuangyun Technology Co ltd; Beijing Megvii Technology Co Ltd
Current assignee: Nanjing Kuangyun Technology Co ltd; Beijing Megvii Technology Co Ltd
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-08-15

Abstract

The embodiment of the application provides an image processing method, electronic equipment, a storage medium and a computer program product. The method comprises the following steps: acquiring an image to be processed; performing image processing on the image to be processed by using the trained student processing model; the model is obtained by the following training operations: acquiring a first sample image and corresponding first labeling information; inputting the first sample image into a teacher coding module of an initial teacher processing model for coding; acquiring a first initial teacher query feature; combining the first initial teacher query feature with the first real query feature; inputting the first teacher coding feature and the combined first teacher query feature into a teacher decoding module of an initial teacher processing model for decoding to obtain a first teacher processing result; training the initial teacher processing model based on at least the first teacher processing result and the first labeling information; the first initial student treatment model is distilled using the trained teacher treatment model. This approach has little training data.

Description

Image processing method, electronic device, storage medium, and computer program product

Technical Field

The present application relates to the field of image processing technology, and more particularly, to an image processing method, an electronic device, a storage medium, and a computer program product.

Background

Image processing technologies such as three-dimensional object detection and three-dimensional instance segmentation are mainly performed through corresponding object detection models and instance segmentation models, but the performance of the models is often limited by the influence of the scale of data labeling and the size of the models. In the prior art, in order to reduce such an influence, a training method of performing knowledge distillation on an image processing model (object detection model or example segmentation model) is employed. In the training stage, a teacher network with better performance and stronger characteristic learning ability is utilized to assist a student network, and then the student network is used for reasoning (or testing) in the reasoning stage (also called a testing stage).

In the prior art, for example, in the application scene of target detection, a common distillation range in a target detection model is a method for distilling by using data of different modes. The data of the different modalities generally include three-dimensional point cloud data and the like. The three-dimensional point cloud data has large data volume and high acquisition cost, so that the method has larger training cost and poorer practicability. Therefore, a new image processing scheme is needed to solve the above technical problems.

Disclosure of Invention

The present application has been made in view of the above-described problems. The application provides an image processing method, an electronic device, a storage medium and a computer program product.

According to an aspect of the present application, there is provided an image processing method including: acquiring an image to be processed; performing image processing on the image to be processed by using the trained student processing model to obtain a processing result of the image to be processed, wherein the image processing comprises three-dimensional target detection and/or three-dimensional instance segmentation, and the processing result comprises a target detection result and/or an instance segmentation result; the trained student treatment model is trained by the following training operations: acquiring a first sample image and corresponding first annotation information, wherein the first annotation information is used for indicating the three-dimensional position of a target object contained in the first sample image; inputting the first sample image into a teacher coding module of an initial teacher processing model to code, so as to obtain first teacher coding characteristics; acquiring first initial teacher query features, wherein the first initial teacher query features comprise at least one feature vector corresponding to at least one potential target object one by one; combining the first initial teacher query feature with the first real query feature to obtain a first teacher query feature, wherein the first real query feature is obtained by performing position coding based on first labeling information; inputting the first teacher coding feature and the first teacher query feature into a teacher decoding module of an initial teacher processing model for decoding to obtain a first teacher processing result of the first sample image, wherein the first teacher processing result comprises a target detection result and/or an instance segmentation result; training the initial teacher processing model based on at least the first teacher processing result and the first labeling information to obtain a trained teacher processing model; and performing distillation training on the first initial student processing model by using the trained teacher processing model to obtain a trained student processing model.

Illustratively, prior to distillative training of the first initial student treatment model with the trained teacher treatment model to obtain the trained student treatment model, the training operations further comprise: acquiring a second sample image and corresponding second labeling information, wherein the second labeling information is used for indicating the three-dimensional position of a target object contained in the second sample image; inputting the second sample image into a student coding module of a second initial student processing model for coding to obtain a first student coding characteristic; acquiring first student query characteristics, wherein the first student query characteristics comprise at least one characteristic vector corresponding to at least one potential target object one by one; inputting the first student coding feature and the first student query feature into a student decoding module of a second initial student processing model for decoding to obtain a first student processing result of the second sample image, wherein the first student processing result comprises a target detection result and/or an instance segmentation result; training the second initial student processing model based on at least the first student processing result and the second labeling information to obtain a first initial student processing model.

Illustratively, the teacher encoding module includes a feature extraction module and a position encoder module, and inputs the first sample image into the teacher encoding module of the initial teacher processing model for encoding to obtain a first teacher encoding feature, including: inputting the first sample image into a feature extraction module of an initial teacher processing model to extract features, so as to obtain first teacher image features; acquiring a first position embedded feature corresponding to a first sample image; inputting the first teacher image feature and the first position embedded feature into a position encoder module of an initial teacher processing model for position encoding to obtain a first teacher encoding feature; the student coding module comprises a feature extraction module and a position encoder module, inputs the second sample image into the student coding module of the second initial student processing model for coding, obtains a first student coding feature, and comprises: inputting the second sample image into a feature extraction module of a second initial student processing model to perform feature extraction to obtain first student image features; acquiring a second position embedded feature corresponding to a second sample image; inputting the first student image feature and the second position embedded feature into a position encoder module of a second initial student processing model for position encoding to obtain a first student encoding feature; the feature extraction module of the initial teacher processing model and the feature extraction module of the second initial student processing model are the same shared feature extraction module.

Illustratively, the teacher decoding module includes a decoder module and a processing head, the processing head including a processing head of the teacher decoding module including a detection head for outputting a target detection result and/or a segmentation head for outputting an instance segmentation result, the processing head decoding a first teacher encoding feature and a first teacher query feature input to the teacher decoding module of the initial teacher processing model to obtain a first teacher processing result of the first sample image, including: inputting the first teacher coding feature and the first teacher query feature into a decoder module of an initial teacher processing model to obtain a first teacher decoding feature; inputting the first teacher decoding characteristic into a processing head of an initial teacher processing model to obtain a first teacher processing result; the student decoding module comprises a decoder module and a processing head, the processing head comprises a processing head of the teacher decoding module comprises a detection head for outputting a target detection result and/or a segmentation head for outputting an instance segmentation result, the student decoding module inputs a first student coding feature and a first student query feature into a second initial student processing model for decoding, and a first student processing result of a second sample image is obtained, and the student decoding module comprises: inputting the first student coding feature and the first student query feature into a decoder module of a second initial student processing model to obtain a first student decoding feature; inputting the first student decoding characteristics into a processing head of a second initial student processing model to obtain a first student processing result; wherein the processing head of the initial teacher processing model and the processing head of the second initial student processing model are the same shared processing head.

The second sample image is illustratively identical to the first sample image, wherein the first predicted loss is determined based on the first teacher processing result and the first labeling information during training of the initial teacher processing model based on at least the first teacher processing result and the first labeling information, the second predicted loss is determined based on the first student processing result and the second labeling information during training of the second initial student processing model based on at least the first student processing result and the second labeling information, and the initial teacher processing model and the second initial student processing model are synchronously trained based on the first total loss, wherein the first total loss is obtained based on the first predicted loss and the second predicted loss.

Illustratively, distillating training the first initial student treatment model with the trained teacher treatment model to obtain a trained student treatment model, comprising: acquiring a third sample image; inputting the third sample image into a teacher coding module of a trained teacher processing model for coding to obtain second teacher coding characteristics; acquiring second initial teacher query features, wherein the second initial teacher query features comprise at least one feature vector corresponding to at least one potential target object one by one; combining the second initial teacher query feature with the second real query feature to obtain a second teacher query feature, wherein the second real query feature is obtained by performing position coding based on third labeling information, and the third labeling information is used for indicating the three-dimensional position of a target object contained in a third sample image; the method for processing the third sample image comprises the steps of inputting a second teacher coding feature and a second teacher query feature into a teacher decoding module of a trained teacher processing model to decode, obtaining a second teacher processing result of the third sample image, wherein the second teacher processing result comprises a target detection result and/or an instance segmentation result, the teacher decoding module comprises a decoder module and a processing head, and inputting the second teacher coding feature and the second teacher query feature into the teacher decoding module of the trained teacher processing model to decode, obtaining a second teacher processing result of the third sample image, and the method comprises the following steps: inputting the second teacher coding feature and the second teacher query feature into a decoder module of the trained teacher processing model to obtain a second teacher decoding feature; inputting the second teacher decoding characteristics into a processing head of the trained teacher processing model to obtain a second teacher processing result; inputting the third sample image into a student coding module of the first initial student processing model for coding to obtain a second student coding characteristic; acquiring second student query characteristics; the method for obtaining the second student processing result of the third sample image comprises the steps of inputting a second student coding feature and a second student query feature into a student decoding module of a first initial student processing model for decoding, obtaining the second student processing result of the third sample image, wherein the second student processing result comprises a target detection result and/or an instance segmentation result, the student decoding module comprises a decoder module and a processing head, and inputting the second student coding feature and the second student query feature into the student decoding module of the first initial student processing model for decoding, obtaining the second student processing result of the third sample image, and the method comprises the following steps: inputting the second student coding feature and the second student query feature into a decoder module of the first initial student processing model to obtain a second student decoding feature; inputting the second student decoding characteristics into a processing head of the first initial student processing model to obtain a second student processing result; determining a third prediction loss based at least on the second teacher decoding feature and the second student decoding feature; determining a second total loss based at least on the third predicted loss; and optimizing parameters in the first initial student treatment model based on the second total loss to obtain a trained student treatment model.

Illustratively, the decoder module in the teacher decoding module and the decoder module in the student decoding module each include N attention layers, N being an integer greater than 1; inputting the second teacher encoded feature and the second teacher query feature into a decoder module of the trained teacher processing model to obtain a second teacher decoded feature, comprising: for each current attention layer in the decoder module of the trained teacher processing model, inputting input features into the current attention layer to obtain output features, wherein the input features comprise a second teacher encoding feature and a second teacher query feature in the case that the current attention layer is the attention layer located at the front of the N attention layers, and comprise the second teacher encoding feature and the output features of the previous attention layer in the case that the current attention layer is any attention layer other than the attention layer located at the front of the N attention layers, and the output features of the last attention layer are second teacher decoding features; a decoder module that inputs a second student encoding feature and a second student query feature into the first initial student processing model to obtain a second student decoding feature, comprising: for each current attention layer in the decoder module of the first initial student processing model, inputting input features into the current attention layer to obtain output features, wherein the input features comprise second student coding features and second student query features in the case that the current attention layer is the attention layer positioned at the front in the N attention layers, and comprise second student coding features and output features of the previous attention layer in the case that the current attention layer is any attention layer except the attention layer positioned at the front in the N attention layers, and the output features of the last attention layer are second student decoding features; determining a third prediction loss based at least on the second teacher decoding feature and the second student decoding feature, comprising: calculating an i-th sub-prediction loss based on the output features of the i-th attention layer comprised by the decoder module of the trained teacher processing model and the output features of the i-th attention layer comprised by the decoder module of the first initial student processing model, wherein i = 1,2,3, …, N; the calculated N sub-prediction losses are summed or averaged to obtain a third prediction loss.

According to another aspect of the present application, there is also provided an electronic device including a processor and a memory, wherein the memory stores computer program instructions for executing the above-described image processing method when the computer program instructions are executed by the processor.

According to still another aspect of the present application, there is also provided a storage medium on which program instructions are stored, wherein the program instructions are for executing the above-described image processing method at run-time.

According to a further aspect of the present application, there is also provided a computer program product comprising a computer program, wherein the computer program is adapted to perform the above-mentioned image processing method when run.

According to the image processing method, the electronic device, the storage medium and the computer program product of the embodiment of the application, the trained student processing model is adopted for image processing. The student processing model is obtained by distillation training based on a teacher processing model. When the teacher processing model is trained, the first initial teacher query feature and the first real query feature can be combined to obtain the first teacher query feature, so that the teacher processing model can be obtained through training without additional multi-mode labeling data such as three-dimensional point cloud, and required training data can be greatly reduced. In addition, the position information contained in the first real query feature is more accurate, so that the performance of a teacher processing model obtained by training the first teacher query feature can be improved. The first initial student processing model is distilled and trained by the trained teacher processing model, so that the feature learning capacity and performance of the trained student processing model are improved, and the trained student processing model has better generalization and practicability. Therefore, the student processing model trained and obtained in this way is used for image processing, so that the accuracy of the processing result is improved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 shows a schematic block diagram of an example electronic device for implementing image processing methods and apparatus in accordance with embodiments of the application;

FIG. 2 shows a schematic flow chart of an image processing method according to one embodiment of the application;

FIG. 3 shows a flow diagram of a training operation according to one embodiment of the application;

FIG. 4 shows a training schematic of a teacher processing model in accordance with one embodiment of the application;

FIG. 5 illustrates a schematic diagram of the synchronous training of an initial teacher processing model and a second initial student processing model in accordance with one embodiment of the application;

FIG. 6 shows a schematic diagram of N attention layers according to one embodiment of the application;

fig. 7 shows a schematic block diagram of an image processing apparatus according to an embodiment of the present application; and

Fig. 8 shows a schematic block diagram of an electronic device according to an embodiment of the application.

Detailed Description

In recent years, technology research such as computer vision, deep learning, machine learning, image processing, image recognition and the like based on artificial intelligence has been advanced significantly. Artificial intelligence (Artificial Intelligence, AI) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human intelligence. The artificial intelligence discipline is a comprehensive discipline and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning, neural networks and the like. Computer vision is an important branch of artificial intelligence, and particularly, machine recognition is a world, and computer vision technologies generally include technologies such as face recognition, image processing, fingerprint recognition and anti-counterfeit verification, biometric feature recognition, face detection, pedestrian detection, object detection, pedestrian recognition, image processing, image recognition, image semantic understanding, image retrieval, word recognition, video processing, video content recognition, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map construction (SLAM), computed photography, robot navigation and positioning, and the like. With research and progress of artificial intelligence technology, the technology expands application in various fields, such as urban management, traffic management, building management, park management, face passing, face attendance, logistics management, warehouse management, robots, intelligent marketing, computed photography, mobile phone images, cloud services, smart home, wearable equipment, unmanned driving, automatic driving, smart medical treatment, face payment, face unlocking, fingerprint unlocking, personnel verification, smart screens, smart televisions, cameras, mobile internet, network living broadcast, beauty, make-up, medical beauty, intelligent temperature measurement and the like.

In order to make the objects, technical solutions and advantages of the present application more apparent, exemplary embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein. Based on the embodiments of the application described in the present application, all other embodiments that a person skilled in the art would have without inventive effort shall fall within the scope of the application.

The embodiment of the application provides an image processing method, electronic equipment, a storage medium and a computer program product. According to the image processing method of the embodiment of the application, the trained student processing model is adopted for image processing. The student processing model is obtained by distillation training based on a teacher processing model. The teacher processing model can be added with real query features to form teacher query features during training, so that the student processing model can be obtained through training without additional data (such as three-dimensional point cloud data) and the performance of the student processing model is guaranteed. The image processing techniques according to embodiments of the present application may be applied to any field involving three-dimensional object detection and/or three-dimensional instance segmentation, including, but not limited to, the autopilot field, the SLAM field, and the like.

First, an example electronic apparatus 100 for implementing the image processing method and apparatus according to an embodiment of the present application is described with reference to fig. 1.

As shown in fig. 1, the electronic device 100 includes one or more processors 102, one or more storage devices 104. Optionally, the electronic device 100 may also include an input device 106, an output device 108, and an image capture device 110, which are interconnected by a bus system 112 and/or other forms of connection mechanisms (not shown). It should be noted that the components and structures of the electronic device 100 shown in fig. 1 are exemplary only and not limiting, as the electronic device may have other components and structures as desired.

The processor 102 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a microprocessor, the processor 102 may be one or a combination of several of a Central Processing Unit (CPU), an image processor (GPU), an Application Specific Integrated Circuit (ASIC), or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 102 to implement client functions and/or other desired functions in embodiments of the present application as described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, mouse, microphone, touch screen, and the like.

The output device 108 may output various information (e.g., images and/or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like. Alternatively, the input device 106 and the output device 108 may be integrated together and implemented using the same interaction device (e.g., a touch screen).

The image acquisition device 110 may acquire images and store the acquired images in the storage device 104 for use by other components. The image capturing mechanism 110 may be a separate camera or a video camera in a mobile terminal, etc. It should be understood that the image capturing apparatus 110 is merely an example, and the electronic device 100 may not include the image capturing apparatus 110. In this case, other devices having image capturing capability may be used to capture images and transmit the captured images to the electronic device 100.

Exemplary electronic devices for implementing the image processing methods and apparatuses according to embodiments of the present application may be implemented on devices such as a personal computer, a terminal device, an attendance machine, a panel machine, a camera, or a remote server. Wherein the terminal device includes, but is not limited to: tablet computers, cell phones, PDAs (Personal Digital Assistant, personal digital assistants), touch screen all-in-one, wearable devices, etc.

Next, an image processing method according to an embodiment of the present application will be described with reference to fig. 2. Fig. 2 shows a schematic flow chart of an image processing method 200 according to an embodiment of the application. As shown in fig. 2, the image processing method 200 includes the following steps S210 and S220.

Step S210, acquiring an image to be processed.

The image to be processed may be an image containing any target object, such as a landscape, a person image, or a road image. The target object may be any object including, but not limited to, a vehicle, a pedestrian, an animal, a building, and the like. In one or some embodiments of the present application, the image to be processed may be a road image, which may be a road image acquired by an image acquisition device provided on an object such as a running vehicle, a road, or a building. The image to be processed may be an original image acquired by an image acquisition device (for example, the image acquisition device 110 described above), or an image obtained after preprocessing the original image acquired by the image acquisition device. The preprocessing may include normalization, scaling, smoothing, etc. The preprocessing may further include an operation of extracting a partial image area including the target object from the original image acquired by the image acquisition device to obtain an image to be processed.

The image to be processed may be one or more images, which may be from an external device, transmitted by the external device to the electronic device 100 for image processing. In addition, the image to be processed may also be acquired by the electronic device 100 itself. For example, the electronic device 100 may utilize an image acquisition device 110 (e.g., a stand-alone camera) to acquire an image to be processed. The image acquisition device 110 may transmit the acquired image to be processed to the processor 102, and the processor 102 performs image processing.

Step S220, performing image processing on the image to be processed by using the trained student processing model to obtain a processing result of the image to be processed, wherein the image processing comprises three-dimensional target detection and/or three-dimensional instance segmentation, and the processing result comprises a target detection result and/or an instance segmentation result.

Illustratively, the obtained image to be processed is input into a trained student processing model, and the corresponding processing result can be obtained. Image processing may include any one or more of three-dimensional object detection and three-dimensional instance segmentation. For three-dimensional target detection, the target detection result may include, for example, position information of a three-dimensional detection frame of the target object (i.e., a target detection frame) and a confidence level corresponding to each three-dimensional detection frame, the three-dimensional detection frame being a bounding box (bounding box) containing the target object, which may optionally be a rectangular parallelepiped box. Of course, the shape of the target detection frame may also be other suitable shapes, such as spherical, conical, etc. The same target object may correspond to one or more target detection boxes. The positional information of the detection frame may include one or more of the following information of the detection frame: corner coordinates of one or more corner points; a center coordinate; length information; width information; height information. Wherein, in case the position information of the detection frame includes one or more of length information, width information and height information, the position information may further include corner coordinates and/or center coordinates of at least one corner. The confidence of the detection frame can be represented by any value, for example, the confidence can be in a range between 0 and 1. As described above, a value of confidence approaching 1 may indicate that the target object detected by the detection frame is more accurate. Similarly, reference may be made to understanding the meaning of the final position information of the detection frame and its confidence. For three-dimensional instance segmentation, the instance segmentation result may include mask information for the target object, for example. The mask information is used to indicate the position of the three-dimensional mask of the corresponding target object. Mask information may be presented by thermodynamic diagrams (heat maps). In the thermodynamic diagram, individual pixels within a mask of a target object may be highlighted.

The trained student treatment model is trained by a training operation. Fig. 3 shows a flow diagram of a training operation S300 according to an embodiment of the application. Referring to fig. 3, the training operation S300 includes steps S310, S320, S330, S340, S350, S360, and S370.

In step S310, a first sample image and corresponding first labeling information are acquired, where the first labeling information is used to indicate a three-dimensional position of a target object included in the first sample image.

The first sample image and the first annotation information are used to train a teacher processing model. Figure 4 illustrates a training schematic of a teacher processing model according to one embodiment of the application. Illustratively, a first sample image X ₁ The acquiring manner of the image to be processed is similar to that of the image to be processed, and the acquiring manner of the image to be processed has been described in detail in step S210, and is not described here again for brevity. For an application scenario of three-dimensional object detection, the first labeling (group trunk) information may include position information of a real detection frame, for indicating a three-dimensional position of the target object included in the first sample image. For example, the three-dimensional position of the target object included in the first sample image may be labeled by a rectangular parallelepiped-shaped detection frame. For an application scenario of three-dimensional instance segmentation, the first annotation information may include mask information of a real mask of the target object contained in the first sample image. For convenience of description and understanding, the following description mainly uses an application scenario of the image processing method as three-dimensional object detection. The implementation of three-dimensional instance segmentation is similar to three-dimensional object detection.

Step S320, inputting the first sample image into a teacher coding module of the initial teacher processing model for coding, and obtaining first teacher coding characteristics.

In one embodiment, for the first sample image X obtained ₁ Input it to the initial teacherAfter the teacher coding module of the processing model codes, a first teacher coding feature M can be obtained _T1 . The initial teacher processing model may be any three-dimensional object detection or three-dimensional instance segmentation model, such as a position transcoding (Position Embedding TRansformation, PETR) model, a detection converter (DEtection Transformer, DETR) model, etc., and the teacher encoding module may include encoders in the PETR or DETR models.

In step S330, a first initial teacher query (query) feature is obtained, where the first initial teacher query feature includes at least one feature vector corresponding one-to-one to at least one potential target object.

First initial teacher query feature Q ₁ May be a predefined sequence of features. The first initial teacher query feature may be expressed asFrom L ₁ A composition of learnable feature vectors of length C, L ₁ The learnable feature vectors with the length of C are L ₁ Feature vectors for each potential target object are in one-to-one correspondence. Wherein C represents the number of channels. Illustratively, a first initial teacher query feature Q ₁ May contain the original learning feature vector in the PETR framework.

Step S340, combining the first initial teacher query feature and the first real query feature to obtain a first teacher query feature, wherein the first real query feature is obtained by performing position coding based on the first labeling information.

Based on the first labeling information contained in the first sample image, the position of each labeled real target object can be subjected to position coding to obtain a corresponding first real query feature Q ₂ . The first true query feature may be expressed asFrom L ₂ A composition of learnable feature vectors of length C, L ₂ The learnable feature vectors with the length of C are L ₂ And the feature vectors are in one-to-one correspondence with the real target objects. L (L) ₂ And L is equal to ₁ May be equal or unequal. For example, the initial teacher processing model may be a three-dimensional target detection model, the first sample image may be a multi-View image, and the first labeling information may include position information of a three-dimensional target detection frame corresponding to the target object under a Bird's Eye View (BEV). The three-dimensional position coordinates of the target object may be passed through a position-coding network (e.g., using trigonometric periodic function mapping plus full-join layer mapping) to obtain a first true query feature Q ₂ . Query feature Q for first initial teacher ₁ And a first true query feature Q ₂ Combining to obtain the first teacher query feature Q _T1 . First teacher query feature Q _T1 ∈R ^L×C Consists of L learnable feature vectors of length C, L=L ₁ +L ₂ . First teacher query feature Q _T1 First initial teacher query feature Q ₁ And a first true query feature Q ₂ The respective channel numbers C are equal.

In step S350, the first teacher coding feature and the first teacher query feature are input to the teacher decoding module of the initial teacher processing model to decode, so as to obtain a first teacher processing result of the first sample image, where the first teacher processing result includes a target detection result and/or an instance segmentation result.

Referring again to FIG. 4, for a teacher decoding module in the initial teacher processing model, a first teacher encoding feature M may be used _T1 And first teacher query feature Q _T1 And the input of the teacher decoding module is used for obtaining a corresponding first teacher processing result. For example, the first teacher processing result may include position information of a predicted target detection frame of the target object in the first sample image or mask information of the predicted mask. Illustratively, the teacher decoding module may include a decoder in the PETR or DETR model.

Step S360, training the initial teacher processing model at least based on the first teacher processing result and the first labeling information to obtain a trained teacher processing model.

Illustratively, according to the obtained first teacher processing result and the first labeling information, the prediction loss of the initial processing model can be calculated, and parameters in the initial teacher detection model can be optimized through a back propagation and gradient descent algorithm, so that a trained teacher processing model is obtained. The first initial teacher query feature may also be optimized synchronously, illustratively in optimizing parameters in the initial teacher processing model.

In step S370, the first initial student processing model is distilled and trained by using the trained teacher processing model, and a trained student processing model is obtained.

Step S370 may use any existing or future distillation training method to perform distillation training on the first initial student processing model, where the trained first initial student processing model is the trained student processing model. During the distillation, the first initial student treatment model may be distillation trained by loading the trained teacher treatment model and fixing its parameters.

In one or some embodiments, the image processing method 200 may be applied to the field of autopilot, and the student processing model and the teacher processing model may optionally be three-dimensional object detection models. In this case, the number of images to be processed may be plural, and the plurality of images to be processed may be multi-angle images acquired for different angles.

According to the image processing method of the embodiment of the application, the trained student processing model is adopted for image processing. The student processing model is obtained by distillation training based on a teacher processing model. When the teacher processing model is trained, the first initial teacher query feature and the first real query feature can be combined to obtain the first teacher query feature, so that the teacher processing model can be obtained through training without additional multi-mode labeling data such as three-dimensional point cloud, and required training data can be greatly reduced. In addition, the position information contained in the first real query feature is more accurate, so that the performance of a teacher processing model obtained by training the first teacher query feature can be improved. The first initial student processing model is distilled and trained by the trained teacher processing model, so that the feature learning capacity and performance of the trained student processing model are improved, and the trained student processing model has better generalization and practicability. Therefore, the student processing model trained and obtained in this way is used for image processing, so that the accuracy of the processing result is improved.

The image processing method according to the embodiment of the present application may be implemented in an apparatus, device or system having a memory and a processor, for example.

The image processing method according to the embodiment of the application can be deployed at an image acquisition end, for example, at a personal terminal or a server end with an image acquisition function.

Alternatively, the image processing method according to the embodiment of the present application may be distributed and deployed at the server side (or cloud side) and the personal terminal. For example, an image may be acquired at a client, where the client transmits the acquired image to a server (or cloud) and the server (or cloud) performs image processing.

Illustratively, before distillative training of the first initial student treatment model with the trained teacher treatment model to obtain the trained student treatment model, the training operations may further comprise: acquiring a second sample image and corresponding second labeling information, wherein the second labeling information is used for indicating the three-dimensional position of a target object contained in the second sample image; inputting the second sample image into a student coding module of a second initial student processing model for coding to obtain a first student coding characteristic; acquiring first student query characteristics, wherein the first student query characteristics comprise at least one characteristic vector corresponding to at least one potential target object one by one; inputting the first student coding feature and the first student query feature into a student decoding module of a second initial student processing model for decoding to obtain a first student processing result of the second sample image, wherein the first student processing result comprises a target detection result and/or an instance segmentation result; training the second initial student processing model based on at least the first student processing result and the second labeling information to obtain a first initial student processing model.

The first student query feature may be a reservationA signature sequence. The first student query feature may be expressed asFrom L ₃ A composition of learnable feature vectors of length C, L ₃ The learnable feature vectors with the length of C are L ₃ Feature vectors for each potential target object are in one-to-one correspondence.

The second sample image may be the same image as the first sample image or may be a different image. In one embodiment, the training operation may further include the operation of training a second initial student treatment model to obtain the first initial student treatment model prior to distillative training of the first initial student treatment model with the trained teacher treatment model to obtain the trained student treatment model. Those skilled in the art can understand the training method of the initial student processing model by reading the related description of the initial teacher processing model in the foregoing embodiments, and for brevity, the description is omitted here. The training of the second initial student processing model based on the first student processing result and the second labeling information may further include training the acquired first student query feature to optimize the first student query feature. The network structure of the second initial student treatment model and the first initial student treatment model may be uniform, except that the parameters (including weights and/or offsets, etc.) in the two may not be exactly uniform in size. Similarly, the network structure of the first initial student treatment model and the trained student treatment model may be identical, except that the parameter sizes in the two may not be exactly identical. The training process of the initial teacher processing model and the training process of the second initial student processing model can be executed synchronously or independently. Furthermore, the operation of obtaining the first initial student processing model by training the second initial student processing model is optional, and the first initial student processing model may be a preset student processing model, that is, parameters in the first initial student processing model may be preset.

According to the technical scheme, the first student processing result is obtained through the second sample image and the corresponding second labeling information, and then the second initial student processing model is trained based on the first student processing result and the second labeling information to obtain the first initial student processing model.

Illustratively, the teacher encoding module may include a feature extraction module and a position encoder module, and inputting the first sample image into the teacher encoding module of the initial teacher processing model to encode, to obtain the first teacher encoding feature, may include: inputting the first sample image into a feature extraction module of an initial teacher processing model to extract features, so as to obtain first teacher image features; acquiring a first position embedded feature corresponding to a first sample image; and inputting the first teacher image feature and the first position embedded feature into a position encoder module of the initial teacher processing model for position encoding to obtain a first teacher encoding feature. Illustratively, the student coding module may include a feature extraction module and a position encoder module, and inputting the second sample image into the student coding module of the second initial student processing model for coding, obtaining the first student coding feature may include: inputting the second sample image into a feature extraction module of a second initial student processing model to perform feature extraction to obtain first student image features; acquiring a second position embedded feature corresponding to a second sample image; inputting the first student image feature and the second position embedded feature into a position encoder module of a second initial student processing model for position encoding to obtain a first student encoding feature; the feature extraction module of the initial teacher processing model and the feature extraction module of the second initial student processing model are the same shared feature extraction module.

In one embodiment, a first sample image X ₁ Can be expressed asWherein H is ₀ 、W ₀ And C ₀ Respectively represent the first sample image X ₁ Corresponding height, width and number of channels. For example, the number of the cells to be processed,in the first sample image X ₁ In the case of RGB image, channel number C ₀ May be 3. Can image the first sample X ₁ Inputting the characteristic extraction module of the initial teacher processing model, and carrying out image X on the first sample through the characteristic extraction module ₁ Extracting features to obtain first teacher image features F _T ，F _T ∈R ^H×W×C . Wherein H, W and C respectively represent a first teacher image feature F _T Height, width and channel number of the optical fiber. By way of example and not limitation, the feature extraction module may be implemented using a convolutional neural network backbone (Convolutional Neural Networks Backbone, CNN backbone). First position embedding (positional embedding) feature P corresponding to the first sample image can be obtained _T . Image feature F of first teacher _T And a first position embedding feature P _T The first teacher coding feature M can be obtained by position coding by the position coder module which inputs the initial teacher processing model together _T1 。

Similarly, in a similar manner to the above embodiment, the first student image feature F may be obtained _S And a second position embedding feature P corresponding to the second sample image _S . Characterizing a first student image F _S And a second position embedding feature P _S The position encoder module input into the second initial student processing model performs position encoding to obtain a first student encoding feature M _S1 . The feature extraction module of the initial teacher processing model and the feature extraction module of the second initial student processing model may be the same shared feature extraction module. I.e., the initial teacher processing model and the second initial student processing model may share a feature extraction module. It will be appreciated that the trained teacher processing model and the first initial student processing model may also share the same feature extraction module, as the initial teacher processing model is consistent with the network structure of the trained teacher processing model and the network structure of the first initial student processing model is consistent with the network structure of the second initial student processing model.

According to the technical scheme, the feature extraction module of the initial teacher processing model and the feature extraction module of the second initial student processing model are the same shared feature extraction module, and the obtained image processing model belongs to a lightweight model, so that the parameter number can be effectively reduced, the data training cost is reduced, and the model training and processing efficiency is improved.

Illustratively, the teacher decoding module includes a decoder module and a processing head, the processing head of the teacher decoding module includes a detection head for outputting a target detection result and/or a segmentation head for outputting an instance segmentation result, and the teacher decoding module for inputting the first teacher coding feature and the first teacher query feature into the initial teacher processing model to decode to obtain a first teacher processing result of the first sample image may include: inputting the first teacher coding feature and the first teacher query feature into a decoder module of an initial teacher processing model to obtain a first teacher decoding feature; and inputting the first teacher decoding characteristic into a processing head of the initial teacher processing model to obtain a first teacher processing result. Illustratively, the student decoding module includes a decoder module and a processing head, the processing head of the student decoding module includes a detection head for outputting a target detection result and/or a segmentation head for outputting an instance segmentation result, the student decoding module for inputting a first student coding feature and a first student query feature into a second initial student processing model for decoding, and obtaining a first student processing result of the second sample image may include: inputting the first student coding feature and the first student query feature into a decoder module of a second initial student processing model to obtain a first student decoding feature; inputting the first student decoding characteristics into a processing head of a second initial student processing model to obtain a first student processing result; wherein the processing head of the initial teacher processing model and the processing head of the second initial student processing model are the same shared processing head.

In one embodiment, the teacher decoding module may include a decoder module and a processing head including a detection head and/or a segmentation head. For application of the image processing method in a three-dimensional object detection scene, the processing head is a detection head. For the application of the image processing method in the embodiment of the application in the three-dimensional instance segmentation scene, the processing head is a segmentation head. In addition, the processing head may include both a detection head and a division head.

The application of the image processing method according to the embodiment of the present application to a three-dimensional object detection scene will be described below as an example. Encoding a first teacher code feature M _T1 And first teacher query feature Q _T1 A decoder module for inputting the initial teacher processing model to obtain the first teacher decoding characteristic Q _T . Illustratively, the decoder module of the initial teacher processing model may be implemented with a decoder module in the PETR framework.

Similarly, the implementation manner of the first student processing result for obtaining the second sample image may be understood with reference to the above embodiment, and will not be described herein for brevity. The processing heads of the initial teacher processing model and the processing heads of the second initial student processing model may be the same shared processing head. I.e., the initial teacher processing model and the second initial student processing model may share a processing head. It will be appreciated that the trained teacher processing model and the first initial student processing model may also share the same processing head as the network structure of the first initial student processing model and the second initial student processing model are identical, as the network structure of the initial teacher processing model is identical to the network structure of the trained teacher processing model.

According to the technical scheme, the processing heads of the initial teacher processing model and the processing heads of the second initial student processing model are the same shared processing heads, so that the obtained image processing model belongs to a lightweight model, the parameter quantity can be effectively reduced, the data training cost is reduced, and the model training and processing efficiency is improved.

FIG. 5 illustrates a schematic diagram of the synchronous training of an initial teacher processing model and a second initial student processing model in accordance with one embodiment of the present application. As shown in fig. 5, the first sample image is input to the shared feature extraction module, and the first teacher image feature F can be obtained _T And a first student image feature F _S . In addition, the first position embedding feature P can be acquired _T And a second position embedding feature P _S . Image feature F of first teacher _T And a first position embedding feature P _T The teacher coding module input into the initial teacher processing model can obtain the corresponding first teacher coding feature M _T1 . Encoding features M using a first teacher _T1 And first teacher query feature Q _T1 Input to a teacher decoding module to obtain corresponding first teacher decoding characteristics Q _T . Decoding the first teacher feature Q _T Input to the shared processing head to obtain a first teacher processing result. Similarly, the first student treatment result may be obtained by using the second initial student treatment model in a similar manner. Characterizing a first student image F _S And a second position embedding feature P _S The student coding module input into the second initial student processing model can obtain the corresponding first student coding feature M _S1 . Using first student code feature M _S1 And first student query feature Q _S1 Input to the student decoding module to obtain corresponding first student decoding characteristics Q _S . Decoding feature Q of first student _S Input to the shared processing head to obtain a first student processing result.

In one embodiment, the first sample image and the second sample image are the same. Training of the initial teacher processing model and training of the second initial student processing model may be performed simultaneously. For training of the initial teacher processing model, according to the first teacher processing result and the first labeling information, a first prediction loss LS between the first teacher processing result and the first labeling information can be calculated ₁ . Training of the second initial student processing model can be based on the first student processing result and the second annotationCalculating a second predictive loss LS between the two ₂ . By losing LS for the first prediction ₁ And a second predictive loss LS ₂ The first total loss L can be obtained by summing or averaging ¹ . Based on the first total loss L ¹ Parameters in the initial teacher processing model and the second initial student processing model can be optimized respectively by utilizing the back propagation and gradient descent algorithm synchronization, so that a trained teacher processing model and a trained first initial student processing model are obtained.

According to the technical scheme, the first total loss is obtained based on the first predicted loss and the second predicted loss to synchronously train the initial teacher processing model and the second initial student processing model, and the method can avoid the over fitting of a shared feature extraction network or a processing head to the teacher processing model.

Illustratively, training the initial teacher processing model based at least on the first teacher processing result and the first labeling information to obtain a trained teacher processing model may include: and synchronously optimizing parameters in the initial teacher processing model and the first initial teacher query feature at least based on the first teacher processing result and the first labeling information to obtain a trained teacher processing model and the trained initial teacher query feature.

In one embodiment, the parameters in the initial teacher processing model may be optimized according to the first predicted loss obtained by calculating the first teacher processing result and the first labeling information or according to the first total loss, and the first initial teacher query feature may be synchronously optimized. Thus, the convergence efficiency of the initial teacher processing model can be improved.

Illustratively, distillating training the first initial student treatment model with the trained teacher treatment model to obtain a trained student treatment model may include: acquiring a third sample image; inputting the third sample image into a teacher coding module of a trained teacher processing model for coding to obtain second teacher coding characteristics; acquiring second initial teacher query features, wherein the second initial teacher query features comprise at least one feature vector corresponding to at least one potential target object one by one; combining the second initial teacher query feature with the second real query feature to obtain a second teacher query feature, wherein the second real query feature is obtained by performing position coding based on third labeling information, and the third labeling information is used for indicating the three-dimensional position of a target object contained in a third sample image; the method may further include decoding the second teacher encoding feature and the second teacher query feature by a teacher decoding module of the trained teacher processing model to obtain a second teacher processing result of the third sample image, the second teacher processing result including a target detection result and/or an instance segmentation result, wherein the teacher decoding module includes a decoder module and a processing head, decoding the second teacher encoding feature and the second teacher query feature by a teacher decoding module of the trained teacher processing model to obtain a second teacher processing result of the third sample image, and the method may include: inputting the second teacher coding feature and the second teacher query feature into a decoder module of the trained teacher processing model to obtain a second teacher decoding feature; inputting the second teacher decoding characteristics into a processing head of the trained teacher processing model to obtain a second teacher processing result; inputting the third sample image into a student coding module of the first initial student processing model for coding to obtain a second student coding characteristic; acquiring second student query characteristics; the method may further include decoding the second student coding feature and the second student query feature by a student decoding module of the first initial student processing model to obtain a second student processing result of the third sample image, the second student processing result including a target detection result and/or an instance segmentation result, wherein the student decoding module includes a decoder module and a processing head, decoding the second student coding feature and the second student query feature by a student decoding module of the first initial student processing model to obtain a second student processing result of the third sample image, and the method may include: inputting the second student coding feature and the second student query feature into a decoder module of the first initial student processing model to obtain a second student decoding feature; inputting the second student decoding characteristics into a processing head of the first initial student processing model to obtain a second student processing result; determining a third prediction loss based at least on the second teacher decoding feature and the second student decoding feature; determining a second total loss based at least on the third predicted loss; and optimizing parameters in the first initial student treatment model based on the second total loss to obtain a trained student treatment model.

In one embodiment, the method of obtaining the third sample image is similar to the method of obtaining the first sample image, and the method of obtaining the first sample image is described in detail in step S210, and is not described herein for brevity. One of ordinary skill in the art will understand the implementation of the second teacher processing result and the second student processing result obtained by the third sample image herein by reading the relevant description in the previous embodiments. The second initial teacher query feature may be the first initial teacher query feature or a trained initial teacher query feature (i.e., a result obtained after training the first initial teacher query feature). The third sample image may be the same as or different from the first sample image or the second sample image.

Second initial teacher query feature Q ₃ May be a predefined sequence of features. The second initial teacher query feature may be expressed asFrom L ₄ A composition of learnable feature vectors of length C, L ₄ The learnable feature vectors with the length of C are L ₄ Feature vectors for each potential target object are in one-to-one correspondence. Wherein C represents the number of channels. The second real query feature is obtained by performing position coding based on third labeling information corresponding to the third sample image, and the obtaining manner is similar to that of the first real query feature, and is not repeated. The manner of obtaining the second teacher query feature based on the second initial teacher query feature and the second real query feature is similar to the manner of obtaining the first teacher query feature based on the first initial teacher query feature and the first real query feature, and will not be described again. In addition, the second teacher processing result, the second student coding feature, the second student query feature, the second student processing result, and other information obtaining modes can be referred to The obtaining manner of the information such as the first teacher processing result, the first student coding feature, the first student query feature, the first student processing result and the like described above is understood, and is not described herein again.

Based on the obtained second teacher decoding feature and second student decoding feature, a third prediction loss may be calculated. By way of example and not limitation, the third predictive loss may be an average squared error (Mean Square Error, MSE) loss function, i.e., an L2 loss function. The second total loss may be determined based at least on the third predicted loss. For example, the third predicted loss may be determined directly as the second total loss, and the second total loss may be determined further in combination with other losses. And optimizing parameters in the first initial student treatment model by using a second total loss through a back propagation and gradient descent algorithm, so that a trained student treatment model can be obtained.

According to the technical scheme, the parameters in the first initial student processing model are optimized based on the second total loss, so that a trained student processing model is obtained, and the trained student processing model obtained by the method can have better characteristic learning capability and performance.

Illustratively, the decoder module in the teacher decoding module and the decoder module in the student decoding module each include N attention layers, N being an integer greater than 1; inputting the second teacher encoded feature and the second teacher query feature into a decoder module of the trained teacher processing model to obtain a second teacher decoded feature may include: for each current attention layer in the decoder module of the trained teacher processing model, inputting input features into the current attention layer to obtain output features, wherein the input features comprise a second teacher encoding feature and a second teacher query feature in the case that the current attention layer is the attention layer located at the front of the N attention layers, and comprise the second teacher encoding feature and the output features of the previous attention layer in the case that the current attention layer is any attention layer other than the attention layer located at the front of the N attention layers, and the output features of the last attention layer are second teacher decoding features; illustratively, inputting the second student encoding feature and the second student query feature into the decoder module of the first initial student processing model to obtain the second student decoding feature may include: for each current attention layer in the decoder module of the first initial student processing model, inputting input features into the current attention layer to obtain output features, wherein the input features comprise second student coding features and second student query features in the case that the current attention layer is the attention layer positioned at the front in the N attention layers, and comprise second student coding features and output features of the previous attention layer in the case that the current attention layer is any attention layer except the attention layer positioned at the front in the N attention layers, and the output features of the last attention layer are second student decoding features; illustratively, determining the third prediction loss based at least on the second teacher decoding feature and the second student decoding feature may include: calculating an i-th sub-prediction loss based on the output features of the i-th attention layer comprised by the decoder module of the trained teacher processing model and the output features of the i-th attention layer comprised by the decoder module of the first initial student processing model, wherein i = 1,2,3, …, N; the calculated N sub-prediction losses are summed or averaged to obtain a third prediction loss.

In one embodiment, the decoder module in the teacher decoding module and the decoder module in the student decoding module may each contain N attention layers, e.g., N may be equal to 6. The decoder module of the teacher decoding module is described below as an example.

Fig. 6 shows a schematic diagram of N attention layers according to one embodiment of the application. As shown in fig. 6, for the attention layer located at the front of the N attention layers (i.e., the 1 st attention layer), the corresponding input feature is the second teacher code feature M _T2 And a second teacher query feature Q _T2 . Encoding a second teacher code feature M _T2 And a second teacher query feature Q _T2 Input into the 1 st attention layer to obtain the corresponding output characteristic Q of the attention layer _T21 . The output characteristic Q _T21 Second teacher code feature M _T2 Can be used forAs the input feature of the latter attention layer (i.e. the 2 nd attention layer), the output feature Q corresponding to the 2 nd attention layer is obtained _T22 . Output features corresponding to the previous attention layer and second teacher coding features M are sequentially processed in a similar manner _T Input features serving as the following attention layer are input to the following attention layer, and finally output features Q corresponding to the 6 th attention layer can be obtained _T26 . Output feature Q corresponding to the 6 th attention layer _T26 Can be used as the final decoding characteristic of the decoder module in the teacher decoding module, namely the second teacher decoding characteristic Q _T ’。

Similarly, a person of ordinary skill in the art can understand the implementation of the N attention layers of the decoder module in the teacher decoding module by implementing the N attention layers of the decoder module in the student decoding module, which is not described herein for brevity. The output characteristics of each of the 6 attention layers of the decoder module in the student decoding module are Q _S21 、Q _S22 、...、Q _S26 。

The sub-prediction loss for the current attention layer may be calculated based on the output characteristics of any attention layer contained by the decoder module of the trained teacher processing model and the output characteristics of the corresponding attention layer contained by the decoder module of the first initial student processing model. For example, for the 2 nd attention layer, the decoder module based on the trained teacher processing model contains the output characteristics Q of the 2 nd attention layer _T22 And the output characteristics Q of the 2 nd attention layer contained in the decoder module of the first initial student processing model _S22 The sub-predictive loss l of the current attention layer 2 can be calculated ₂ . Similarly, the sub-prediction loss l corresponding to each attention layer can be obtained sequentially ₁ 、l ₂ 、...、l ₆ . By losing the 1 st sub-prediction to the 6 th sub-prediction. Summing or averaging these 6 sub-prediction losses, a third prediction loss LS can be obtained ₃ 。

According to the technical scheme, the third prediction loss is calculated based on the sub-prediction loss corresponding to each attention layer in the decoder module of the teacher decoding module and the decoder module of the student decoding module respectively, so that the output characteristics of each attention layer can be subjected to loss supervision from the teacher processing model to the student processing model. Thus, the feature learning capability of the student processing model can be better improved.

Illustratively, acquiring the third sample image may include: and acquiring a third sample image and corresponding third labeling information, wherein the third labeling information is used for indicating the three-dimensional position of a target object contained in the third sample image. Illustratively, prior to determining the second total loss based at least on the third predicted loss, distillation training the first initial student treatment model with the trained teacher treatment model to obtain a trained student treatment model may further include one or more of: determining a fourth prediction loss based on the second student processing result and the third labeling information; determining a fifth prediction loss based on the second teacher processing result and the third labeling information; a sixth predictive loss is determined based on the second student processing result and the second teacher processing result. For example, determining the second total loss based at least on the third predicted loss may include: a second total loss is determined based on the third predicted loss and based on one or more of the fourth predicted loss, the fifth predicted loss, and the sixth predicted loss.

In one embodiment, the third annotation information corresponding to the third sample image may be used to indicate the three-dimensional position of the target object contained in the third sample image. Similarly, in the application scenario of three-dimensional target detection, the third labeling information may include position information and confidence of a three-dimensional detection frame corresponding to the target object; in the application scenario of three-dimensional instance segmentation, the third labeling information may include three-dimensional mask information corresponding to the target object.

The fourth prediction loss LS can be calculated through the second student processing result and the third labeling information ₄ . The fifth prediction loss LS can be calculated through the processing result of the second teacher and the third labeling information ₅ . A sixth predictive loss LS may be calculated from the second student processing result and the second teacher processing result ₆ . Exemplary and not limitingLinearly, a second total loss L ² Loss LS by fourth prediction ₄ Fifth prediction loss LS ₅ Sixth prediction loss LS ₆ One or more of (a) and a third predictive loss LS ₃ Summing or averaging. Illustratively, L ² ＝LS ₃ +LS ₄ +LS ₆ Or L ² ＝LS ₃ +LS ₄ 。

According to the above-described aspect, determining the second total loss based on the third predicted loss and one or more of the fourth predicted loss, the fifth predicted loss, and the sixth predicted loss may further improve the performance of the obtained trained student processing.

Illustratively, the second teacher processing result may include a target detection result, the second student processing result may include a target detection result, and the second teacher decoding feature includes a target value associated with M ₁ M corresponding to each target detection frame ₁ Group feature vector, second teacher processing result includes and M ₁ M corresponding to each target detection frame ₁ Group detection frame information, second student decoding feature including M ₂ M corresponding to each target detection frame ₂ Group feature vector, second student processing result includes and M ₂ M corresponding to each target detection frame ₂ Group detection frame information, M ₁ Is an integer greater than or equal to 1, M ₂ Is an integer greater than or equal to 1; illustratively, acquiring the third sample image may include: acquiring a third sample image and corresponding third labeling information, wherein the third labeling information is used for indicating the three-dimensional position of a target object contained in the third sample image; before determining the third prediction loss based at least on the second teacher decoding feature and the second student decoding feature, performing distillation training on the first initial student processing model using the trained teacher processing model to obtain a trained student processing model, may further include: matching a target detection frame in the second teacher processing result with a target detection frame in the third labeling information to obtain a first matching result, wherein the target detection frame is a boundary frame containing any target object; target detection frame and third mark in second student processing result Matching the target detection frames in the injection information to obtain a second matching result; determining a third matching result between the target detection frame in the second teacher processing result and the target detection frame in the second student processing result based on the first matching result and the second matching result, wherein the third matching result comprises M ₃ Pair of detection frames matched with each other, M ₃ Is an integer greater than or equal to 0 and M ₃ Less than or equal to M ₁ And M ₂ Each pair of detection frames includes a first target detection frame belonging to a second teacher processing result and a second target detection frame belonging to a second student processing result; illustratively, determining the third prediction loss based at least on the second teacher decoding feature and the second student decoding feature may include: decoding of M and M in features based on a second teacher ₃ Feature vectors corresponding to the first target detection frames one by one and M in the second student decoding features ₃ The feature vectors corresponding to the second target detection frames one by one determine a seventh prediction loss; a third predictive loss is determined based at least on the seventh predictive loss.

In one embodiment, the second teacher processing result and the second student processing result may include target detection results. The target detection result comprises a plurality of target detection frames corresponding to the target objects one by one and the confidence of each target detection frame. The second teacher decoding feature may include M ₁ M corresponding to each target detection frame ₁ And (5) group feature vectors. For example, M ₁ May be equal to 20. The second teacher processing result may include 20 sets of detection frame information corresponding to 20 target detection frames one to one. The detection frame information may include location information of the target detection frame and optionally a confidence level. Similarly, the second student decoding feature may include a second code associated with M ₂ M corresponding to each target detection frame ₂ And (5) group feature vectors. For example, M ₂ May be equal to 15. The second student processing result may include 15 sets of detection frame information corresponding to 15 target detection frames one to one. Wherein M is ₁ And M ₂ Are integers greater than or equal to 1, and M ₁ And M ₂ May be equal or unequal, i.e. M ₁ 、M ₂ May all be equal to 20, or M ₁ Equal to 20, M ₂ Equal to 15.

In one embodiment, prior to determining the third predictive loss based at least on the second teacher decoded feature and the second student decoded feature, the first initial student processing model is distillation trained using the trained teacher processing model to obtain a trained student processing model, which may further include the following operations.

And matching the target detection frame in the second teacher processing result with the target detection frame in the third labeling information, namely matching the target detection frames corresponding to the same target object, so as to obtain a first matching result. Similarly, the target detection frame in the second student processing result is matched with the target detection frame in the third labeling information, and a second matching result can be obtained. And matching the target detection frames containing the same target object in the first matching result and the second matching result again according to the obtained first matching result and the second matching result so as to obtain a third matching result. The number M of mutually matched detection frame pairs contained in the third matching result ₃ Is an integer greater than 0 and less than or equal to M ₁ And M ₂ Smaller of (3). For example, M ₃ May be equal to 5. And calculating a seventh prediction loss based on the feature vectors of the second teacher decoding feature, which are in one-to-one correspondence with the 5 first target detection frames, and the feature vectors of the second student decoding feature, which are in one-to-one correspondence with the 5 second target detection frames. In the above embodiment employing multiple attention layers, the seventh prediction loss may be a sub-prediction loss corresponding to the last attention layer of the N attention layers included in the decoder module in the trained teacher processing model and the decoder module in the first initial student processing model, for example, the 6 th sub-prediction loss l in the previous embodiment ₆ . The current sub-prediction loss l can be calculated ₆ The third prediction loss may be determined by summing a plurality of sub-prediction losses, and determining the result of the summation as the third prediction loss. For example LS ₃ ＝l ₁ +l ₂ +l ₃ +l ₄ +l ₅ +l ₆ 。

According to the technical scheme, the third matching result between the target detection frame in the second teacher processing result and the target detection frame in the second student processing result is determined based on the first matching result and the second matching result, so that loss supervision can be performed mainly based on the information of the target detection frames in the foreground (i.e. matched with each other) and interference of the information of the target detection frames in the background (non-matched) can be eliminated when the first initial student processing model is subjected to distillation training, and the accuracy of the obtained trained student processing model can be improved.

Through researches, the training operation can improve the performance upper limit of the existing three-dimensional target detection framework. Experimental results on Nuscenes three-dimensional target detection tasks show that compared with a baseline method, the student processing model obtained through the training mode can improve the maximum posterior probability (Maximum aposteriori, map) by 0.7% and the Nuscenes detection score (NuScenes Detection Score, NDS) by 2.0%. It is expected to achieve a considerable performance gain also on the actual service data set.

According to another aspect of the present application, there is provided an image processing apparatus. Fig. 7 shows a schematic block diagram of an image processing apparatus 700 according to an embodiment of the application.

As shown in fig. 7, an image processing apparatus 700 according to an embodiment of the present application includes an acquisition module 710 and a processing module 720. The respective modules may perform the respective steps of the image processing method described in fig. 2 above, respectively. Only the main functions of the respective components of the image processing apparatus 700 will be described below, and details already described above will be omitted.

The acquisition module 710 is configured to acquire an image to be processed. The acquisition module 710 may be implemented by the processor 102 in the electronic device shown in fig. 1 running program instructions stored in the storage 104.

The processing module 720 is configured to perform image processing on an image to be processed by using the trained student processing model, so as to obtain a processing result of the image to be processed, where the image processing includes three-dimensional object detection and/or three-dimensional instance segmentation, and the processing result includes an object detection result and/or an instance segmentation result. The trained student treatment model is trained by the following training operations: acquiring a first sample image and corresponding first annotation information, wherein the first annotation information is used for indicating the three-dimensional position of a target object contained in the first sample image; inputting the first sample image into a teacher coding module of an initial teacher processing model to code, so as to obtain first teacher coding characteristics; acquiring first initial teacher query features, wherein the first initial teacher query features comprise at least one feature vector corresponding to at least one potential target object one by one; combining the first initial teacher query feature with the first real query feature to obtain a first teacher query feature, wherein the first real query feature is obtained by performing position coding based on first labeling information; inputting the first teacher coding feature and the first teacher query feature into a teacher decoding module of an initial teacher processing model for decoding to obtain a first teacher processing result of the first sample image, wherein the first teacher processing result comprises a target detection result and/or an instance segmentation result; training the initial teacher processing model based on at least the first teacher processing result and the first labeling information to obtain a trained teacher processing model; and performing distillation training on the first initial student processing model by using the trained teacher processing model to obtain a trained student processing model. The processing module 720 may be implemented by the processor 102 in the electronic device shown in fig. 1 running program instructions stored in the storage 104.

Fig. 8 shows a schematic block diagram of an electronic device 800 according to one embodiment of the application. The electronic device 800 includes a memory 810 and a processor 820.

The memory 810 stores computer program instructions for implementing corresponding steps in an image processing method according to an embodiment of the present application.

Processor 820 is operative to execute computer program instructions stored in memory 810 to perform corresponding steps of an image processing method in accordance with an embodiment of the present application.

In one embodiment, the computer program instructions, when executed by processor 820, are configured to perform the steps of: acquiring an image to be processed; performing image processing on the image to be processed by using the trained student processing model to obtain a processing result of the image to be processed, wherein the image processing comprises three-dimensional target detection and/or three-dimensional instance segmentation, and the processing result comprises a target detection result and/or an instance segmentation result; the trained student treatment model is trained by the following training operations: acquiring a first sample image and corresponding first annotation information, wherein the first annotation information is used for indicating the three-dimensional position of a target object contained in the first sample image; inputting the first sample image into a teacher coding module of an initial teacher processing model to code, so as to obtain first teacher coding characteristics; acquiring first initial teacher query features, wherein the first initial teacher query features comprise at least one feature vector corresponding to at least one potential target object one by one; combining the first initial teacher query feature with the first real query feature to obtain a first teacher query feature, wherein the first real query feature is obtained by performing position coding based on first labeling information; inputting the first teacher coding feature and the first teacher query feature into a teacher decoding module of an initial teacher processing model for decoding to obtain a first teacher processing result of the first sample image, wherein the first teacher processing result comprises a target detection result and/or an instance segmentation result; training the initial teacher processing model based on at least the first teacher processing result and the first labeling information to obtain a trained teacher processing model; and performing distillation training on the first initial student processing model by using the trained teacher processing model to obtain a trained student processing model.

Illustratively, the electronic device 800 may further include an image acquisition apparatus 830. The image acquisition device 830 is used for acquiring an image to be processed. The image acquisition device 830 is optional, and the electronic device 800 may not include the image acquisition device 830. Processor 820 may then obtain the image to be processed by other means, such as from an external device or from memory 810.

Furthermore, according to an embodiment of the present application, there is also provided a storage medium on which program instructions are stored for performing the respective steps of the image processing method of the embodiment of the present application when the program instructions are executed by a computer or a processor, and for realizing the respective modules in the image processing apparatus according to the embodiment of the present application. The storage medium may include, for example, a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, read-only memory (ROM), erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, or any combination of the foregoing storage media.

In one embodiment, the program instructions, when executed by a computer or processor, may cause the computer or processor to implement the respective functional modules of the image processing apparatus according to the embodiments of the present application, and/or may perform the image processing method according to the embodiments of the present application.

In one embodiment, the program instructions, when executed, are configured to perform the steps of: acquiring an image to be processed; performing image processing on the image to be processed by using the trained student processing model to obtain a processing result of the image to be processed, wherein the image processing comprises three-dimensional target detection and/or three-dimensional instance segmentation, and the processing result comprises a target detection result and/or an instance segmentation result; the trained student treatment model is trained by the following training operations: acquiring a first sample image and corresponding first annotation information, wherein the first annotation information is used for indicating the three-dimensional position of a target object contained in the first sample image; inputting the first sample image into a teacher coding module of an initial teacher processing model to code, so as to obtain first teacher coding characteristics; acquiring first initial teacher query features, wherein the first initial teacher query features comprise at least one feature vector corresponding to at least one potential target object one by one; combining the first initial teacher query feature with the first real query feature to obtain a first teacher query feature, wherein the first real query feature is obtained by performing position coding based on first labeling information; inputting the first teacher coding feature and the first teacher query feature into a teacher decoding module of an initial teacher processing model for decoding to obtain a first teacher processing result of the first sample image, wherein the first teacher processing result comprises a target detection result and/or an instance segmentation result; training the initial teacher processing model based on at least the first teacher processing result and the first labeling information to obtain a trained teacher processing model; and performing distillation training on the first initial student processing model by using the trained teacher processing model to obtain a trained student processing model.

Furthermore, according to an embodiment of the present application, there is also provided a computer program product comprising a computer program for executing the above-mentioned image processing method 200 when the computer program is run.

The modules in the electronic device according to the embodiments of the present application may be implemented by a processor of the electronic device implementing image processing or image processing according to the embodiments of the present application running computer program instructions stored in a memory, or may be implemented when computer instructions stored in a computer readable storage medium of a computer program product according to the embodiments of the present application are run by a computer.

Furthermore, according to an embodiment of the present application, there is also provided a computer program for executing the above-described image processing method 200 when running.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above illustrative embodiments are merely illustrative and are not intended to limit the scope of the present application thereto. Various changes and modifications may be made therein by one of ordinary skill in the art without departing from the scope and spirit of the application. All such changes and modifications are intended to be included within the scope of the present application as set forth in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, e.g., the division of elements is merely a logical function division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another device, or some features may be omitted, or not performed.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in order to streamline the application and aid in understanding one or more of the various application aspects, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof in the description of exemplary embodiments of the application. However, the method of the present application should not be construed as reflecting the following intent: i.e., the claimed application requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be combined in any combination, except combinations where the features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

Various component embodiments of the application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some of the blocks in an image processing apparatus according to an embodiment of the present application may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present application can also be implemented as an apparatus program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present application may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

The above description is merely illustrative of the embodiments of the present application and the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes or substitutions are covered by the protection scope of the present application. The protection scope of the application is subject to the protection scope of the claims.

Claims

1. An image processing method, comprising:

acquiring an image to be processed;

performing image processing on the image to be processed by using a trained student processing model to obtain a processing result of the image to be processed, wherein the image processing comprises three-dimensional target detection and/or three-dimensional instance segmentation, and the processing result comprises a target detection result and/or an instance segmentation result;

the trained student treatment model is trained by the following training operations:

acquiring a first sample image and corresponding first annotation information, wherein the first annotation information is used for indicating the three-dimensional position of a target object contained in the first sample image;

inputting the first sample image into a teacher coding module of an initial teacher processing model to code, so as to obtain first teacher coding characteristics;

acquiring first initial teacher query features, wherein the first initial teacher query features comprise at least one feature vector corresponding to at least one potential target object one by one;

combining the first initial teacher query feature with a first real query feature to obtain a first teacher query feature, wherein the first real query feature is obtained by performing position coding based on the first labeling information;

Inputting the first teacher coding feature and the first teacher query feature into a teacher decoding module of the initial teacher processing model for decoding to obtain a first teacher processing result of the first sample image, wherein the first teacher processing result comprises a target detection result and/or an instance segmentation result;

training the initial teacher processing model based on at least the first teacher processing result and the first labeling information to obtain a trained teacher processing model;

and performing distillation training on the first initial student processing model by using the trained teacher processing model to obtain the trained student processing model.

2. The method of claim 1, wherein prior to the distillative training of a first initial student treatment model with the trained teacher treatment model to obtain the trained student treatment model, the training operation further comprises:

acquiring a second sample image and corresponding second labeling information, wherein the second labeling information is used for indicating the three-dimensional position of a target object contained in the second sample image;

inputting the second sample image into a student coding module of a second initial student processing model for coding to obtain a first student coding characteristic;

Acquiring a first student query feature, wherein the first student query feature comprises at least one feature vector corresponding to at least one potential target object one by one;

inputting the first student coding feature and the first student query feature into a student decoding module of the second initial student processing model for decoding to obtain a first student processing result of the second sample image, wherein the first student processing result comprises a target detection result and/or an instance segmentation result;

and training the second initial student processing model at least based on the first student processing result and the second labeling information to obtain the first initial student processing model.

3. The method of claim 2, wherein,

the teacher coding module comprises a feature extraction module and a position coder module, the teacher coding module for inputting the first sample image into an initial teacher processing model for coding, and obtaining a first teacher coding feature comprises:

inputting the first sample image into a feature extraction module of the initial teacher processing model to perform feature extraction to obtain first teacher image features;

acquiring a first position embedded feature corresponding to the first sample image;

Inputting the first teacher image feature and the first position embedded feature into a position encoder module of the initial teacher processing model for position encoding to obtain the first teacher encoding feature;

the student coding module comprises a feature extraction module and a position encoder module, the student coding module for inputting the second sample image into a second initial student processing model for coding, and obtaining a first student coding feature, and the student coding module comprises:

inputting the second sample image into a feature extraction module of the second initial student processing model to perform feature extraction to obtain first student image features;

acquiring a second position embedded feature corresponding to the second sample image;

inputting the first student image feature and the second position embedded feature into a position encoder module of the second initial student processing model for position encoding to obtain the first student encoding feature; the feature extraction module of the initial teacher processing model and the feature extraction module of the second initial student processing model are the same shared feature extraction module.

4. The method of claim 2, wherein,

The teacher decoding module includes a decoder module and a processing head, the processing head of the teacher decoding module includes a detection head for outputting the target detection result and/or a segmentation head for outputting the instance segmentation result, the teacher decoding module for inputting the first teacher coding feature and the first teacher query feature into the initial teacher processing model decodes to obtain a first teacher processing result of the first sample image, and the method includes:

inputting the first teacher coding feature and the first teacher query feature into a decoder module of the initial teacher processing model to obtain a first teacher decoding feature;

inputting the first teacher decoding characteristic into a processing head of the initial teacher processing model to obtain a first teacher processing result;

the student decoding module comprises a decoder module and a processing head, the processing head of the student decoding module comprises a detection head for outputting the target detection result and/or a segmentation head for outputting the instance segmentation result, the student decoding module for inputting the first student coding feature and the first student query feature into the second initial student processing model decodes to obtain a first student processing result of the second sample image, and the method comprises the following steps:

Inputting the first student coding feature and the first student query feature into a decoder module of the second initial student processing model to obtain a first student decoding feature;

inputting the first student decoding characteristics into a processing head of the second initial student processing model to obtain the first student processing result;

the processing head of the initial teacher processing model and the processing head of the second initial student processing model are the same shared processing head.

5. The method of claim 3 or 4, wherein the second sample image and the first sample image are identical, wherein,

determining a first prediction loss based on the first teacher processing result and the first labeling information in the training of the initial teacher processing model based on at least the first teacher processing result and the first labeling information, and determining a second prediction loss based on the first student processing result and the second labeling information in the training of the second initial student processing model based on at least the first student processing result and the second labeling information, wherein the initial teacher processing model and the second initial student processing model are synchronously trained based on a first total loss, and the first total loss is obtained based on the first prediction loss and the second prediction loss.

6. The method of any of claims 1-4, wherein the distillative training of a first initial student treatment model with the trained teacher treatment model to obtain the trained student treatment model comprises:

acquiring a third sample image;

inputting the third sample image into a teacher coding module of the trained teacher processing model for coding to obtain a second teacher coding feature;

acquiring a second initial teacher query feature, wherein the second initial teacher query feature comprises at least one feature vector corresponding to at least one potential target object one by one;

combining the second initial teacher query feature with a second real query feature to obtain a second teacher query feature, wherein the second real query feature is obtained by performing position coding based on third labeling information, and the third labeling information is used for indicating the three-dimensional position of a target object contained in the third sample image;

inputting the second teacher coding feature and the second teacher query feature into a teacher decoding module of the trained teacher processing model for decoding to obtain a second teacher processing result of the third sample image, wherein the second teacher processing result comprises a target detection result and/or an instance segmentation result, the teacher decoding module comprises a decoder module and a processing head, and the inputting the second teacher coding feature and the second teacher query feature into the teacher decoding module of the trained teacher processing model for decoding to obtain the second teacher processing result of the third sample image comprises: inputting the second teacher encoding feature and the second teacher query feature into a decoder module of the trained teacher processing model to obtain a second teacher decoding feature; inputting the second teacher decoding feature into a processing head of the trained teacher processing model to obtain a second teacher processing result;

Inputting the third sample image into a student coding module of the first initial student processing model for coding to obtain a second student coding characteristic;

acquiring second student query characteristics;

inputting the second student coding feature and the second student query feature into a student decoding module of the first initial student processing model for decoding to obtain a second student processing result of the third sample image, wherein the second student processing result comprises a target detection result and/or an instance segmentation result, the student decoding module comprises a decoder module and a processing head, and the second student coding feature and the second student query feature are input into the student decoding module of the first initial student processing model for decoding to obtain a second student processing result of the third sample image, and the method comprises the following steps: inputting the second student coding feature and the second student query feature into a decoder module of the first initial student processing model to obtain a second student decoding feature; inputting the second student decoding characteristics into a processing head of the first initial student processing model to obtain a second student processing result;

Determining a third prediction loss based at least on the second teacher decoding feature and the second student decoding feature;

determining a second total loss based at least on the third predicted loss;

and optimizing parameters in the first initial student treatment model based on the second total loss to obtain the trained student treatment model.

7. The method of claim 6, wherein the decoder module of the teacher decoding module and the decoder module of the student decoding module each comprise N attention layers, N being an integer greater than 1;

the decoder module that inputs the second teacher encoded feature and the second teacher query feature into the trained teacher processing model to obtain a second teacher decoded feature, comprising:

for each current attention layer in the decoder module of the trained teacher processing model, inputting input features into the current attention layer, obtaining output features, wherein in the case that the current attention layer is the attention layer located at the front of the N attention layers, the input features include the second teacher encoding feature and the second teacher query feature, and in the case that the current attention layer is any attention layer of the N attention layers other than the attention layer located at the front, the input features include the second teacher encoding feature and output features of a preceding attention layer, and the output features of a last attention layer are the second teacher decoding features;

The decoder module for inputting the second student coding feature and the second student query feature into the first initial student processing model to obtain a second student decoding feature, including:

for each current attention layer in the decoder module of the first initial student processing model, inputting input features into the current attention layer to obtain output features, wherein in the case that the current attention layer is the attention layer positioned at the front in the N attention layers, the input features comprise the second student coding features and the second student query features, and in the case that the current attention layer is any attention layer except the attention layer positioned at the front in the N attention layers, the input features comprise the second student coding features and the output features of the previous attention layer, and the output features of the last attention layer are the second student decoding features;

the determining a third prediction loss based at least on the second teacher decoding feature and the second student decoding feature comprises:

calculating an i-th sub-prediction loss based on the output characteristics of the i-th attention layer contained in the decoder module of the trained teacher processing model and the output characteristics of the i-th attention layer contained in the decoder module of the first initial student processing model, wherein i=1, 2,3, …, N;

The calculated N sub-prediction losses are summed or averaged to obtain the third prediction loss.

8. An electronic device comprising a processor and a memory, wherein the memory has stored therein computer program instructions which, when executed by the processor, are adapted to carry out the image processing method of any of claims 1 to 7.

9. A storage medium having stored thereon program instructions, wherein the program instructions, when run, are for performing the image processing method of any of claims 1 to 7.

10. A computer program product comprising a computer program, wherein the computer program is operative when run to perform the image processing method as claimed in any one of claims 1 to 7.