CN111292331B

CN111292331B - Image processing method and device

Info

Publication number: CN111292331B
Application number: CN202010110152.2A
Authority: CN
Inventors: 王涌壮; 张晓鹏; 谢凌曦; 钮敏哲; 张维; 田奇
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2020-02-23
Filing date: 2020-02-23
Publication date: 2023-09-12
Anticipated expiration: 2040-02-23
Also published as: CN111292331A

Abstract

The application provides an image processing method and device. Relates to the field of artificial intelligence, in particular to the field of computer vision. The method comprises the following steps: acquiring first spatial feature information based on original feature data of a first image processing task; acquiring second characteristic data according to the original characteristic data of the second image processing task and the first spatial characteristic information; performing second image processing on the second characteristic data to obtain a processing result of a second image processing task; the first image processing task and the second image processing task are one of a target detection task and an instance segmentation task and the other of the target detection task and the instance segmentation task, respectively. By providing spatial feature information to one party of object detection and instance segmentation, feature data of object detection and/or instance segmentation can be corrected, and prediction accuracy of instance segmentation tasks can be improved.

Description

Image processing method and device

Technical Field

The present application relates to the field of image processing, and in particular, to a method and apparatus for image processing.

Background

In recent years, deep neural networks have achieved excellent performance in terms of automated understanding of visual signals such as images and videos. In order to understand the semantic information contained in each pixel in the image, object detection and semantic segmentation are proposed. The object detection or semantic segmentation can only roughly determine which object the pixel belongs to the rectangular detection frame or semantic category. To achieve finer image understanding, example segmentation is proposed. The instance segmentation can further judge which object in which semantic category each pixel in the image belongs to on the basis of object detection and semantic segmentation. Instance segmentation may be applied to video surveillance or autopilot tasks.

In the prior art, an instance segmentation task model based on a multi-task learning framework is adopted to realize instance segmentation. The example segmentation task model takes a target detection task model as a priori output, and then predicts whether the target belongs to pixel by pixel within a target detection frame given by the target detection task model using an additional segmentation mask prediction model.

It should be understood that, the target detection task and the instance segmentation task can both perform position judgment on the same target, but when the target detection task and the instance segmentation task are executed, the existing instance segmentation task model may have inconsistent prediction results of the two tasks, so that the prediction results of the instance segmentation are inaccurate.

Improving the prediction accuracy of the instance segmentation task is a problem to be solved.

Disclosure of Invention

The application provides an image processing method and device, which can effectively improve the prediction accuracy of an instance segmentation task.

In a first aspect, there is provided a method of image processing, the method comprising: acquiring first spatial feature information based on original feature data of a first image processing task; acquiring second characteristic data according to the original characteristic data of the second image processing task and the first spatial characteristic information; performing second image processing on the second characteristic data to obtain a processing result of the second image processing task; wherein the first image processing task and the second image processing task are respectively one of a target detection task and an instance segmentation task and the other of the target detection task and the instance segmentation task; the original characteristic data of the first image processing task and the original characteristic data of the second image processing task are acquired based on the image data to be processed.

The spatial feature information is provided for the instance segmentation through the target detection, and for the instance segmentation, the feature data can be corrected through the spatial feature information of the target detection, so that the target detection is consistent with the prediction result of the instance segmentation to a certain extent, and the prediction accuracy of the instance segmentation task can be improved.

The spatial feature information is provided for the target detection through the instance segmentation, and for the target detection, the feature data can be corrected through the spatial feature information of the target detection, so that the target detection is consistent with the prediction result of the instance segmentation to a certain extent, and the prediction accuracy of the instance segmentation task can be improved.

Therefore, according to the method and the device, one of the target detection and the instance segmentation provides the spatial characteristic information for the other, and the provided spatial characteristic information of the other can be corrected, so that the prediction results of the target detection and the instance segmentation are consistent to a certain extent, and the prediction accuracy of the instance segmentation task can be improved.

With reference to the first aspect, in a possible implementation manner of the first aspect, the method further includes: acquiring second spatial feature information based on the original feature data of the second image processing task; acquiring first characteristic data according to the original characteristic data of the first image processing task and the second spatial characteristic information; and performing first image processing on the first characteristic data to obtain a processing result of the first image processing task.

The spatial feature information is provided by the target detection and the instance segmentation, and for the target detection and the instance segmentation, the feature data can be corrected by the spatial feature information of the other party, so that the prediction results of the target detection and the instance segmentation are consistent to a greater extent, and the prediction accuracy of the instance segmentation task can be improved.

Therefore, the application can further improve the prediction accuracy of the instance segmentation task by mutually providing the spatial characteristic information through the target detection and the instance segmentation.

With reference to the first aspect, in a possible implementation manner of the first aspect, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task; the obtaining the first spatial feature information based on the original feature data of the first image processing task includes: acquiring third spatial feature information based on the original feature data of the target detection task; respectively acquiring transverse characteristic information and longitudinal characteristic information according to the third spatial characteristic information; and recombining the transverse characteristic information and the longitudinal characteristic information to obtain the first spatial characteristic information.

Therefore, the method and the device can be used for improving the accuracy of the spatial feature information of the instance segmentation by firstly carrying out the transverse feature acquisition and the longitudinal feature acquisition on the spatial feature information of the target detection, then carrying out the recombination processing on the transverse feature and the longitudinal feature and then providing the spatial feature information after the recombination to the segmentation instance, thereby improving the prediction accuracy of the instance segmentation task.

With reference to the first aspect, in one possible implementation manner of the first aspect, the first feature data and the second feature data are obtained by performing the following operation, where an initial value of i is 1, and n is a positive integer.

In step S1, spatial feature information X1 is acquired based on the feature data if1_i.

Step S2, spatial feature information X2 is acquired based on the feature data if2_i.

Step S3, obtaining feature data OF1_i according to the feature data OF if1_i and the spatial feature information X2.

Step S4, obtaining feature data OF2_i according to the feature data OF if2_i and the spatial feature information X1.

Step S5, judging whether the value of i is equal to N, if not, turning to step S6, and if so, turning to step S7.

Step S6, the value OF i is added to 1, the feature data OF1_ (i-1) is set as feature data IF1_i, and the feature data OF2_ (i-1) is set as feature data IF2_i, and the process goes to step S1.

In step S7, the feature data OF1_i is used as the first feature data, and the feature data OF2_i is used as the second feature data.

When the value of i is 1, the feature data if1_i is the original feature data of the first image processing task, and the feature data if2_i is the original feature data of the second image processing task.

In the application, the operation of providing the spatial characteristic information mutually by the multi-round target detection and the instance segmentation can better enable the characteristic data of the target detection and the instance segmentation to be corrected, thereby enabling the prediction results of the target detection and the instance segmentation to be consistent to a greater extent, and improving the prediction accuracy of the instance segmentation task.

With reference to the first aspect, in a possible implementation manner of the first aspect, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task; wherein the performing a first image processing task on the first feature data includes: processing the first characteristic data by using a detection frame prediction model to obtain a target detection prediction result of the first characteristic data; the second image processing is performed on the second feature data to obtain a processing result of the second image processing task, including: and processing the second feature data by using a segmentation mask prediction model to obtain a segmentation mask prediction result of the second feature data, wherein the segmentation mask prediction model is trained by using a detection auxiliary loss function, the detection auxiliary loss function is used for restraining the output of the segmentation mask prediction model through target detection tag information, and the target detection tag information is used for training the detection frame prediction model.

In the present application, in training the segmentation mask prediction model, the model accuracy of the segmentation mask prediction model can be improved by constraining the output of the segmentation mask prediction model using the target detection tag information.

With reference to the first aspect, in a possible implementation manner of the first aspect, the auxiliary detection loss function includes a longitudinal auxiliary detection loss function and a lateral auxiliary detection loss function, where the longitudinal auxiliary detection loss function constrains longitudinal information of the prediction result output by the segmentation mask prediction model through the target detection tag information, and the lateral auxiliary detection loss function constrains lateral information of the prediction result output by the segmentation mask prediction model through the target detection tag information.

In the application, the performance of the division mask prediction model can be further improved by respectively constraining the transverse information and the longitudinal information of the division mask prediction result output by the division mask prediction model by using the target detection tag information.

With reference to the first aspect, in a possible implementation manner of the first aspect, the obtaining, according to the original feature data of the second image processing task and the first spatial feature information, second feature data includes: and processing the original characteristic data of the second image processing task and the first spatial characteristic information by using a convolution layer to acquire the second characteristic data.

With reference to the first aspect, in a possible implementation manner of the first aspect, the acquiring third spatial feature information based on the original feature data of the target detection task includes: processing the original characteristic data of the target detection task by using a convolution layer to acquire the third spatial characteristic information; the step of respectively obtaining the transverse characteristic information and the longitudinal characteristic information according to the third spatial characteristic information includes: and processing the third spatial characteristic information by using a pooling layer to acquire the transverse characteristic information and the longitudinal characteristic information.

In a second aspect, there is provided a method of image processing, the method comprising: inputting image data to be processed into a segmentation mask prediction model; and obtaining a segmentation mask prediction result of the image data to be processed by using the segmentation mask prediction model, wherein the segmentation mask prediction model is trained by using a detection auxiliary loss function, and the detection auxiliary loss function constrains the output of the segmentation mask prediction model through target detection tag information.

With reference to the second aspect, in a possible implementation manner of the second aspect, the auxiliary detection loss function includes a longitudinal auxiliary detection loss function and a lateral auxiliary detection loss function, where the longitudinal auxiliary detection loss function constrains longitudinal information of the prediction result output by the segmentation mask prediction model through the target detection tag information, and the lateral auxiliary detection loss function constrains lateral information of the prediction result output by the segmentation mask prediction model through the target detection tag information.

In a third aspect, there is provided a method of image processing, the method comprising: acquiring target detection tag information; and training by using a detection auxiliary loss function to obtain a segmentation mask prediction model, wherein the detection auxiliary loss function constrains the output of the segmentation mask prediction model through the target detection label information.

With reference to the third aspect, in a possible implementation manner of the third aspect, the auxiliary detection loss function includes a longitudinal auxiliary detection loss function and a lateral auxiliary detection loss function, where the longitudinal auxiliary detection loss function constrains longitudinal information of the prediction result output by the segmentation mask prediction model through the target detection tag information, and the lateral auxiliary detection loss function constrains lateral information of the prediction result output by the segmentation mask prediction model through the target detection tag information.

It should be understood that the performance of the division mask prediction model may be further improved by constraining the lateral information and the longitudinal information of the division mask prediction result output by the division mask prediction model, respectively, using the target detection tag information.

In a fourth aspect, an apparatus for image processing is provided, the apparatus comprising the following units.

And the first acquisition unit is used for acquiring first spatial characteristic information based on the original characteristic data of the first image processing task.

And the second acquisition unit is used for acquiring second characteristic data according to the original characteristic data of the second image processing task and the first spatial characteristic information.

And the first processing unit is used for performing second image processing on the second characteristic data to obtain a processing result of a second image processing task.

The first image processing task and the second image processing task are one of a target detection task and an instance segmentation task and the other of the target detection task and the instance segmentation task respectively.

The original characteristic data of the first image processing task and the original characteristic data of the second image processing task are acquired based on the image data to be processed.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the apparatus further includes the following unit.

And a third acquisition unit for acquiring second spatial feature information based on the original feature data of the second image processing task.

And a fourth acquiring unit, configured to acquire the first feature data according to the original feature data of the first image processing task and the second spatial feature information.

And the second processing unit is used for performing first image processing on the first characteristic data to obtain a processing result of the first image processing task.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task.

Wherein, this first acquisition unit is used for: acquiring third spatial feature information based on the original feature data of the target detection task; respectively acquiring transverse characteristic information and longitudinal characteristic information according to the third spatial characteristic information; and recombining the transverse characteristic information and the longitudinal characteristic information to obtain the first spatial characteristic information.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the apparatus obtains the first feature data and the second feature data by performing the following operation, where i is initially taken as 1, and n is a positive integer:

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task; the second processing unit is used for processing the first characteristic data by using a detection frame prediction model to obtain a target detection prediction result of the first characteristic data; the first processing unit is used for processing the second characteristic data by using a segmentation mask prediction model to obtain a segmentation mask prediction result of the second characteristic data.

The segmentation mask prediction model is trained by using a detection auxiliary loss function, and the detection auxiliary loss function is used for restraining the output of the segmentation mask prediction model through target detection label information, wherein the target detection label information is used for training the detection frame prediction model.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the auxiliary detection loss function includes a longitudinal auxiliary detection loss function and a lateral auxiliary detection loss function, where the longitudinal auxiliary detection loss function constrains longitudinal information of the prediction result output by the segmentation mask prediction model through the target detection tag information, and the lateral auxiliary detection loss function constrains lateral information of the prediction result output by the segmentation mask prediction model through the target detection tag information.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the second obtaining unit is configured to obtain the second feature data by processing the original feature data of the second image processing task and the first spatial feature information using a convolution layer.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the first obtaining unit is configured to: processing the original characteristic data of the target detection task by using a convolution layer to acquire the third spatial characteristic information; and processing the third spatial characteristic information by using a pooling layer to acquire the transverse characteristic information and the longitudinal characteristic information.

In a fifth aspect, an apparatus for image processing is provided, the apparatus comprising the following units.

And an input unit for inputting the image data to be processed into the segmentation mask prediction model.

And the processing unit is used for obtaining a segmentation mask prediction result of the image data to be processed by using the segmentation mask prediction model.

The segmentation mask prediction model is trained by using a detection auxiliary loss function, and the detection auxiliary loss function constrains the output of the segmentation mask prediction model through target detection tag information.

With reference to the fifth aspect, in a possible implementation manner of the fifth aspect, the detection auxiliary loss function includes a longitudinal detection auxiliary loss function and a lateral detection auxiliary loss function.

The longitudinal detection auxiliary loss function is used for restraining longitudinal information of a prediction result output by the segmentation mask prediction model through the target detection label information, and the transverse detection auxiliary loss function is used for restraining transverse information of the prediction result output by the segmentation mask prediction model through the target detection label information.

In a sixth aspect, an apparatus for image processing is provided, the apparatus comprising the following units.

And the acquisition unit is used for acquiring the target detection tag information.

And the training unit is used for training to obtain a segmentation mask prediction model by using the detection auxiliary loss function, wherein the detection auxiliary loss function is used for restraining the output of the segmentation mask prediction model through the target detection label information.

With reference to the sixth aspect, in a possible implementation manner of the sixth aspect, the detection auxiliary loss function includes a longitudinal detection auxiliary loss function and a lateral detection auxiliary loss function.

In a seventh aspect, there is provided an apparatus for image processing, the apparatus comprising: a memory for storing a program; a processor for executing a memory-stored program, the processor being for performing the method of the first, second or third aspects described above when the memory-stored program is executed.

In an eighth aspect, there is provided a computer readable medium storing program code for execution by a device, the program code comprising instructions for performing the method of the first, second or third aspects described above.

A ninth aspect provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first, second or third aspects described above.

In a tenth aspect, there is provided a chip comprising a processor and a data interface, the processor reading instructions stored on a memory via the data interface, performing the method of the first, second or third aspects above.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, where the instructions, when executed, are configured to perform the method in the first aspect, the second aspect, or the third aspect.

An eleventh aspect provides an electronic device comprising the apparatus provided in the fourth, fifth, sixth or seventh aspects.

Based on the above description, the application provides the spatial feature information for the other party through one party in the object detection and the instance segmentation, and the feature data of the provided party can be corrected through the spatial feature information of the other party, so that the prediction results of the object detection and the instance segmentation are consistent to a certain extent, and the prediction accuracy of the instance segmentation task can be improved.

In addition, in the process of training the segmentation mask prediction model, the method can improve the model accuracy of the segmentation mask prediction model by using the target detection label information to restrict the output of the segmentation mask prediction model.

Drawings

FIG. 1 is a conceptual diagram of image classification, object detection, semantic segmentation, and instance segmentation.

FIG. 2 is a schematic block diagram of an example segmentation task model based on a multi-task learning framework.

Fig. 3 is a schematic flow chart of a method of image processing provided by an embodiment of the present application.

Fig. 4 is another schematic flow chart of a method of image processing provided by an embodiment of the present application.

Fig. 5 is a further schematic flow chart of a method of image processing provided by an embodiment of the present application.

Fig. 6 is a further schematic flow chart of a method of image processing provided by an embodiment of the present application.

Fig. 7 is a schematic flow chart of a method of image processing provided by another embodiment of the present application.

Fig. 8 is a schematic flow chart of a method of image processing provided by a further embodiment of the present application.

Fig. 9 is a schematic block diagram of an apparatus for image processing provided by an embodiment of the present application.

Fig. 10 is a schematic block diagram of block 831 in fig. 9.

FIG. 11 is another schematic block diagram of an apparatus for image processing according to an embodiment of the present application

Fig. 12 is a further schematic block diagram of an apparatus for image processing provided by an embodiment of the present application.

Fig. 13 is a schematic block diagram of a system for image processing provided by an embodiment of the present application.

Fig. 14 and 15 are schematic diagrams of application scenarios of the present application.

Fig. 16 is a further schematic block diagram of an apparatus for image processing provided by an embodiment of the present application.

FIG. 17 is a further schematic block diagram of an apparatus for image processing according to an embodiment of the present application

Fig. 18 is a further schematic block diagram of an apparatus for image processing provided by an embodiment of the present application.

Fig. 19 is a further schematic block diagram of an apparatus for image processing provided by an embodiment of the present application.

Fig. 20 is a schematic diagram of a chip hardware structure according to an embodiment of the present application.

Detailed Description

The technical scheme of the application will be described below with reference to the accompanying drawings.

In order to facilitate an understanding of embodiments of the present application, several concepts related to the embodiments of the present application are described below.

In recent years, deep neural networks have achieved excellent performance in terms of automated understanding of visual signals such as images and videos. Currently, the tasks of computer vision include image classification (image classification), object detection (object detection), semantic segmentation (semantic segmentation), and instance segmentation (instance segmentation). These concepts are described below in conjunction with fig. 1. In the example of fig. 1, the picture contains 1 person, 2 sheep and 1 dog.

As shown in the upper left corner of fig. 1, image classification refers to a classification to which an image is judged to belong. For example, in learning classifications, there are four data sets, namely person (person), sheep (dog), dog (dog), and cat (cat), and image classifications are those classifications that are included in a given picture to be obtained (or output). For example, in the example of fig. 1, the output of the image classification task is to label the classifications in the picture: human, sheep, dogs.

As shown in the upper right hand corner of fig. 1, object detection simply is to find out what objects are inside the picture and the positions of these objects (e.g., the objects are framed with a rectangular frame, which may be referred to as a detection frame). For example, in the example of fig. 1, the output of the target detection task is to label the bounding boxes of 1 person, 2 sheep, and 1 dog in the picture (e.g., the rectangular box in the top right corner picture of fig. 1).

As shown in the lower left corner of fig. 1, semantic segmentation refers to the need to distinguish every point pixel in a picture, rather than just framing the target with a rectangular frame, but different instances of the same object need not be separately segmented. For example, in the example of fig. 1, the output of the semantic segmentation task is to label the person, sheep, dog in the picture, but sheep 1 and sheep 2 need not be labeled. Semantic segmentation is also a target segmentation in the general sense.

As shown in the lower right hand corner of FIG. 1, instance segmentation refers to a combination of object detection and semantic segmentation. With respect to bounding boxes for object detection, instance segmentation may be accurate to the edges of objects, with respect to semantic segmentation, instance segmentation requires labeling of different instances of the same object on the graph. For example, in the example of FIG. 1, a person has 1 instance, a sheep has 2 instances, a dog has 1 instance, and the instance segmentation task is to label all of these instances.

The prediction result of an instance segmentation may be referred to as a segmentation mask. The segmentation mask quality may characterize the goodness of the prediction of the instance segmentation.

It should be understood that fig. 1 is by way of example only and not by way of limitation.

The present application relates generally to object detection and instance segmentation.

The existing mainstream instance segmentation task model is often based on a multi-task learning framework. The multi-task learning framework refers to a model that can be used to perform a plurality of tasks simultaneously, the model being divided into a backbone network (backbone network 210 as shown in fig. 2) and branch networks (branch networks 221, 222, 223 as shown in fig. 2), wherein data is input into the backbone network to obtain a feature map, and then different branch networks perform different task outputs.

Fig. 2 is a schematic diagram of a conventional example segmentation task model 200. The example segmentation task model 200 includes a backbone network 210, a multi-classification branch network 221, a detection branch network 222, and a segmentation branch network 223. The example segmentation task model 200 takes the detection branch network 222 as an a priori output and then uses an additional segmentation branch network 223 to predict, pixel by pixel, whether the object belongs within a given object detection frame. The split branch network 223, the multi-classification branch network 221, and the detection branch network 222 all process based on the feature map acquired by the backbone network 210. The multi-class branch network 221, the detection branch network 222 uses a shared full connection layer for feature processing and task output, while the split branch network 223 uses independent convolution layers for feature processing and task output.

It should be appreciated that both the object detection task and the instance segmentation task may make position decisions (coarse rectangular detection box positions and fine pixel positions, as shown in FIG. 1) for the object. However, performing the object detection task and the instance segmentation task using the instance segmentation task model 200 shown in fig. 2 may cause inconsistent prediction results for the two tasks, which indicates that at least one of the predictions for the two tasks is inaccurate, thereby reducing the prediction accuracy of the instance segmentation task.

The application aims at the problems and provides an image processing method and device which can effectively improve the prediction accuracy of an instance segmentation task.

Fig. 3 is a schematic illustration of a method 300 of image processing provided by an embodiment of the present application. As shown in fig. 3, the method 300 includes the following steps S310, S320 and S330.

S310, acquiring first spatial feature information based on original feature data of a first image processing task.

S320, obtaining second characteristic data according to the original characteristic data and the first spatial characteristic information of the second image processing task.

The first image processing task and the second image processing task are one of a target detection task and an instance segmentation task and the other of the target detection task and the instance segmentation task, respectively. In other words, the first image processing task is one of the object detection task and the instance segmentation task, and the second image processing task is the other

The image data to be processed represents an image to be subjected to detection frame prediction and segmentation mask prediction, such as the image input to the backbone network 210 shown in fig. 2.

For example, the original feature data of the target detection task represents data obtained after the image data to be processed is processed by the feature acquisition network of the target detection task.

As another example, the raw feature data of the object detection task may represent the resultant data of the image data to be processed after the processing of the backbone network 200 as shown in fig. 2 and the processing of the feature acquisition network of the object detection task.

As another example, the raw feature data of the object detection task may represent the resulting data of the image data to be processed after the processing of the backbone network 200, the processing of the area proposal network, and the processing of the feature acquisition network of the object detection task as shown in fig. 2.

The meaning of the original feature data of the instance segmentation task is similar to the description of the "original feature data of the object detection task" above, and will not be repeated here.

For example, the original feature data of the first image processing task is obtained by processing the image data to be processed through the feature acquisition network of the first image processing task. The original characteristic data of the second image processing task is obtained by processing the image data to be processed through a characteristic acquisition network of the second image processing task.

S330, performing second image processing on the second characteristic data to obtain a processing result of the second image processing task.

In the case where the first image processing task is a target detection task and the second image processing task is an instance segmentation task, the method 300 provided in this embodiment includes the following steps.

In step S310, first spatial feature information is acquired based on raw feature data of the object detection task. In step S320, second feature data is obtained according to the original feature data and the first spatial feature information of the instance segmentation task. At S330, an instance segmentation task is performed on the second feature data, and a segmentation mask prediction result of the second feature data is obtained.

In step S320, a plurality of methods may be used to obtain second feature data according to the original feature data of the instance segmentation task and the first spatial feature information. For example, the first spatial feature information is directly stitched with the original feature data of the instance segmentation task. For another example, the original feature data of the instance segmentation task is processed to a certain extent, and then the first spatial feature information is spliced with the processed original feature data.

It should be appreciated that, by providing spatial feature information to the instance segmentation by object detection, for instance segmentation, the feature data may be corrected by the spatial feature information of the object detection, so that the object detection may be made consistent with the prediction result of the instance segmentation to some extent, and thus the prediction accuracy of the instance segmentation task may be improved.

Reference herein to "target detection consistent with the prediction of instance segmentation" means that the pixels within the target detection frame of target detection prediction all belong to this target.

Therefore, the embodiment provides the spatial feature information for the instance segmentation through the target detection, so that the accuracy of the spatial feature information of the instance segmentation can be improved, and the prediction accuracy of the instance segmentation task can be improved.

In the case where the first image processing task is an instance segmentation task and the second image processing task is a target detection task, the method 300 provided in this embodiment includes the following steps.

In step S310, first spatial feature information is acquired based on the original feature data of the instance segmentation task. In step S320, second feature data is obtained according to the original feature data and the first spatial feature information of the target detection task. At S330, a target detection task is performed on the second feature data, and a target detection prediction result of the second feature data is obtained.

In step S320, a plurality of methods may be used to obtain the second feature data according to the original feature data of the target detection task and the first spatial feature information. For example, the first spatial feature information is directly spliced with the original feature data of the object detection task. For another example, the original feature data of the target detection task is processed to a certain extent, and then the first spatial feature information is spliced with the processed original feature data.

It should be understood that, by providing spatial feature information to the object detection by the instance segmentation, for the object detection, the feature data thereof can be corrected by the spatial feature information of the object detection, so that the object detection can be made consistent with the prediction result of the instance segmentation to some extent, and thus the prediction accuracy of the instance segmentation task can be improved.

Therefore, the embodiment provides the spatial feature information for the target detection through the instance segmentation, so that the accuracy of the spatial feature information of the instance segmentation can be improved, and the prediction accuracy of the instance segmentation task can be improved.

As can be seen from the foregoing, in the embodiment of the present application, one of the target detection and the instance segmentation provides the other with the spatial feature information, and for the provided one, the feature data can be corrected by the spatial feature information of the other, so that the prediction results of the target detection and the instance segmentation are consistent to a certain extent, and thus the prediction accuracy of the instance segmentation task can be improved.

Optionally, as shown in fig. 4, the method 300 further includes steps S340, S350, and S360.

And S340, acquiring second spatial characteristic information based on the original characteristic data of the second image processing task.

S350, acquiring first characteristic data according to the original characteristic data and the second spatial characteristic information of the first image processing task.

S360, performing first image processing on the first characteristic data to obtain a processing result of the first image processing task.

Taking the first image processing task as a target detection task and the second image processing task as an example segmentation task as an example, the method 300 provided in this embodiment includes the following steps.

In step S310, first spatial feature information is acquired based on raw feature data of the object detection task. In step S320, second feature data is obtained according to the original feature data and the first spatial feature information of the instance segmentation task. At S330, an instance segmentation task is performed on the second feature data, and a segmentation mask prediction result of the second feature data is obtained. In step S340, second spatial feature information is acquired based on the original feature data of the instance segmentation task. S350, acquiring first characteristic data according to the original characteristic data and the second spatial characteristic information of the target detection task. S360, executing a target detection task on the first characteristic data to obtain a target detection prediction result of the first characteristic data.

It should be understood that, by providing the spatial feature information for each other by the object detection and the instance segmentation, the feature data of each object detection and the instance segmentation can be corrected by the spatial feature information of the other object, so that the prediction results of the object detection and the instance segmentation can be more consistent, and the prediction accuracy of the instance segmentation task can be improved.

It should also be appreciated that by providing spatial feature information to each other for object detection and instance segmentation, mutual supervision of the object detection task and the instance segmentation task may be achieved, so that prediction accuracy of the instance segmentation task may be improved together.

Therefore, the embodiment of the application can further improve the prediction accuracy of the instance segmentation task by mutually providing the spatial characteristic information through the target detection and the instance segmentation.

For convenience of description and understanding, the following convention is made hereinafter. The method comprises the steps of recording spatial feature information acquired based on original feature data of a target detection task as first spatial feature information; the spatial feature information acquired based on the original feature data of the example segmentation task is recorded as second spatial feature information; the original characteristic data of the example segmentation task and the characteristic data acquired by the first space characteristic information are recorded as second characteristic data; and recording the characteristic data acquired according to the original characteristic data and the second spatial characteristic information of the target detection task as first characteristic data.

Alternatively, in the embodiment shown in fig. 3 or fig. 4, in the case where the first image processing task is the object detection task and the second image processing task is the instance division task, step S310 includes the following steps S311, S312, and S313, as shown in fig. 5.

S311, acquiring third spatial feature information based on the original feature data of the target detection task.

S312, according to the third spatial feature information, the transverse feature information and the longitudinal feature information are respectively acquired.

S313, recombining the transverse characteristic information and the longitudinal characteristic information to obtain first spatial characteristic information.

As an example. It is assumed that in step S311, the third spatial feature information acquired based on the original feature data of the target detection task is a feature map having a length and width and a channel number of h×w×c. In step S312, global maximum pooling is performed on the feature map along the lateral direction and the longitudinal direction, respectively, to obtain a lateral feature with a dimension of w×c and a longitudinal feature with a dimension of h×c. In step S313, the transverse feature with the dimension w×c and the longitudinal feature with the dimension h×c are recombined into a feature map with the dimension h×w×c, which is the first spatial feature information for providing to the instance segmentation task. In the feature map obtained in step S313, the feature response of each position is the mean value of the transverse-longitudinal feature responses of the corresponding row and column.

It should be appreciated that the target detection predicted detection frame information is relatively coarse with respect to the example segmented mask, i.e., pixel information. Alternatively, the rough detection frame information is error-prone to the pixel information. For example, the object detection and the instance division both have the corresponding feature map h×w×c, but if no special processing is performed, the detection frame branch eventually only needs to output the upper left and lower right vertex coordinates of the frame, whereas the instance division prediction outputs every pixel belonging to the detection object, so the feature map of the detection frame branch is rough and erroneous compared to the instance division branch with respect to the instance division.

In the embodiment of the application, the transverse characteristic acquisition and the longitudinal characteristic acquisition are firstly carried out on the spatial characteristic information of the target detection, and then the transverse characteristic and the longitudinal characteristic are recombined to obtain the recombined spatial characteristic information. The method is equivalent to acquiring transverse information and longitudinal information from spatial characteristic information of target detection to replace original pixel information. The original pixel information refers to original spatial feature information of target detection. The spatial characteristic information after recombination is shared to the segmentation example, so that the error of a coarser detection frame relative to a more accurate segmentation mask can be reduced, and the spatial characteristic information after recombination is more beneficial to improving the accuracy of the spatial characteristic information for example segmentation.

Therefore, in the embodiment of the application, the accuracy of the spatial feature information of the instance segmentation can be improved by firstly carrying out the transverse feature acquisition and the longitudinal feature acquisition on the spatial feature information of the target detection, then carrying out the recombination processing on the transverse feature and the longitudinal feature, and then providing the spatial feature information after the recombination to the segmentation instance, so that the prediction accuracy of the instance segmentation task can be improved.

In some embodiments described above, taking the first image processing task as the target detection task and the second image processing task as the example segmentation task as an example, in step S330, the second feature data is input into the segmentation mask prediction model, and the segmentation mask prediction result of the second feature data is obtained using the segmentation mask prediction model.

For example, the segmentation mask prediction model is trained using a pixel-by-pixel classification loss function that constrains the output of the segmentation mask prediction model through segmentation mask tag information.

Alternatively, the segmentation mask predictive model may be trained by the method 800 of the following embodiment.

In some embodiments, taking the first image processing task as the target detection task and the second image processing task as the example segmentation task as an example, in step S360, the target detection prediction result of the first feature data may be obtained using the detection frame prediction model.

For example, the detection frame prediction model is trained using a detection regression loss function that constrains the output of the detection frame prediction model through target detection tag information.

For example, in the embodiment shown in fig. 3 or fig. 4, in step 320, the second feature data is obtained by processing the original feature data of the second image processing task and the first spatial feature information using a convolution layer.

For example, in the embodiment shown in fig. 4, in step 350, the first feature data is obtained by processing the original feature data of the first image processing task and the second spatial feature information using a convolution layer.

As an example, taking the first image processing task as the target detection task, the second image processing task is the example segmentation task. In step 350, the first feature data may be obtained by processing the second spatial feature information with the original feature data of the object detection task using a convolution layer as in block 910 in fig. 11. In step 320, the convolution layer as in block 920 of fig. 11 may be used to process the first spatial feature information and the original feature data of the instance segmentation task and obtain the second feature data.

For example, in the embodiment shown in fig. 3 or fig. 4, in step 310, first spatial feature information is obtained by processing raw feature data of a first image processing task using a convolution layer.

For another example, in the embodiment shown in fig. 4, in step 340, second spatial feature information is obtained by processing raw feature data of a second image processing task using a convolution layer.

For another example, in the embodiment shown in fig. 5, in step S311, the original feature data of the target detection task is processed by using the convolution layer to obtain third spatial feature information; in step S312, the lateral feature information and the longitudinal feature information are acquired by processing the third spatial feature information using the pooling layer; in step S313, the transverse feature information and the longitudinal feature information are processed by the rebinning layer to obtain first spatial feature information.

As an example, taking the first image processing task as the target detection task, the second image processing task is the example segmentation task. In step S311, the original feature data of the target detection task may be processed using the convolution layer as in the subunit 931 in fig. 11 to acquire third spatial feature information. In step S312, the third spatial feature information may be processed using the convolution layer with direction as in subunit 931 in fig. 11, obtaining lateral feature information and longitudinal feature information. In step S313, the transverse feature information and the longitudinal feature information may be recombined using a recombination layer with a tape direction as in subunit 931 in fig. 11 to obtain first spatial feature information. In step 340, the raw feature data of the instance segmentation task may be processed to obtain second spatial feature information using a convolution layer as in block 932 in fig. 11.

For example, the method 300 in the above embodiments may be performed by the apparatus 900, the apparatus 1100, or the apparatus 1200 in the following embodiments.

As an example, the method 300 is performed by the apparatus 900 in the following embodiments. Referring to fig. 4 and 9, step S310 may be performed by the first spatial feature information obtaining unit 931, step S320 may be performed by the instance segmentation task feature obtaining module 920, step S340 may be performed by the second spatial feature information obtaining unit 932, and step S350 may be performed by the object detection task feature obtaining module 910. Referring to fig. 5 and 10, step S311 may be performed by the subunit 1001 in the first spatial feature information obtaining unit 931, and steps S312 and S313 may be performed by the subunit 1002 in the first spatial feature information obtaining unit 931, where step S312 may be implemented by a pooling layer in the tape direction in the subunit 1002, and step S313 may be implemented by a reorganization layer in the tape direction in the subunit 1002.

Therefore, in the embodiment of the application, one of the target detection and the instance segmentation provides the spatial characteristic information for the other, and the provided spatial characteristic information of the other party can be corrected for the characteristic data of the provided party, so that the prediction results of the target detection and the instance segmentation are consistent to a certain extent, and the prediction accuracy of the instance segmentation task can be improved.

Further, the spatial feature information is provided by the object detection and the instance segmentation, and for the object detection and the instance segmentation, the feature data can be corrected by the spatial feature information of the other side, so that the prediction results of the object detection and the instance segmentation are more consistent, and the prediction accuracy of the instance segmentation task can be improved.

As shown in fig. 6, an embodiment of the present application further provides a method 600 for image processing. The method 600 includes the following steps S610 and S620.

S610, acquiring first characteristic data according to the original characteristic data of the target detection task, and acquiring second characteristic data according to the original characteristic data of the instance segmentation task.

S620, obtaining a target detection prediction result of the first characteristic data and obtaining an instance mask prediction result of the second characteristic data.

As shown in fig. 6, in step S610, first feature data and second feature data are acquired by performing the following operations. Wherein, the initial value of i is 1, and N is a positive integer.

S0, taking the original characteristic data OF the target detection task as characteristic data OF1_1, and taking the original characteristic data OF the instance segmentation task as characteristic data OF2_1. In other words, when the value of i is 1, the feature data if1_i is the original feature data of the target detection task, and the feature data if2_i is the original feature data of the instance division task.

S1, spatial characteristic information X1 is acquired based on the characteristic data IF1_i.

S2, spatial characteristic information X2 is acquired based on the characteristic data IF2_i.

S3, acquiring the characteristic data OF1_i according to the characteristic data IF1_i and the spatial characteristic information X2.

S4, acquiring the characteristic data OF2_i according to the characteristic data IF2_i and the spatial characteristic information X1.

S5, judging whether the value of i is equal to N, if not, turning to step S6, and if so, turning to step S7.

S6, adding 1 to the value OF i, taking the characteristic data OF1_ (i-1) as the characteristic data IF1_i, and taking the characteristic data OF2_ (i-1) as the characteristic data IF2_i, and turning to the step S1.

In step S6, the following expression can be used with the feature data OF1_ (i-1) as the feature data IF1_i and the feature data OF2_ (i-1) as the feature data IF2_i.

IF1_i＝OF1_(i-1)

IF2_i＝OF2_(i-1)。

S7, the feature data OF1_i is used as the first feature data, and the feature data OF2_i is used as the second feature data.

Alternatively, step S1 in method 600 may employ an implementation of step 310 as shown in fig. 5. The related descriptions are detailed above, and are not repeated here.

In the embodiment of the application, the spatial characteristic information is provided by the object detection and the instance segmentation, and for the object detection and the instance segmentation, the characteristic data can be corrected by the spatial characteristic information of the other party, so that the prediction results of the object detection and the instance segmentation are consistent to a greater extent, and the prediction accuracy of the instance segmentation task can be improved.

In addition, by performing the operation of providing spatial feature information to each other by the multi-round object detection and the instance division, the feature data of the object detection and the instance division can be corrected better, so that the prediction results of the object detection and the instance division can be made to coincide to a greater extent, and thus the prediction accuracy of the instance division task can be improved.

For example, the method 600 may be performed by the apparatus 1200 of the following embodiments.

As shown in fig. 7, an embodiment of the present application further provides a method 700 for image processing, where the method 700 includes the following steps S710 and S720.

S710, inputting the image data to be processed into the segmentation mask prediction model.

S720, obtaining a segmentation mask prediction result of the image data to be processed by using the segmentation mask prediction model.

In other words, the detection assistance loss function uses the target detection tag information to constrain the segmentation mask prediction result of the segmentation mask prediction model.

It should be appreciated that the target detection tag information is typically used to train a detection frame prediction model, and the detection regression loss function as shown in FIG. 13 uses the target detection tag information to constrain the output of the detection frame prediction model.

For example, the segmentation mask prediction model may be obtained by the method 800 of the following embodiment.

It should be appreciated that in training the segmentation mask prediction model, the model accuracy of the segmentation mask prediction model may be improved by constraining the segmentation mask prediction result output by the segmentation mask prediction model using the target detection tag information. Therefore, the prediction accuracy of the instance segmentation task can be improved by adopting the segmentation mask prediction model to execute the instance segmentation task.

It should be appreciated that in addition to detecting the auxiliary penalty function, a pixel-by-pixel classification penalty function is used in training the segmentation mask prediction model, which uses the segmentation mask label information to constrain the output of the segmentation mask prediction model.

In addition, since the current instance segmentation task needs to use a rectangular detection frame region output by the target detection task as a priori information (as shown in fig. 2), when the prediction result of the rectangular detection frame region is inaccurate, the prediction accuracy of the instance segmentation task may be reduced. In other words, when inaccurate target detection predictors are used as a priori information for the instance segmentation task, the predictors of the instance segmentation may be affected, e.g., a worse quality segmentation mask may be predicted.

In the embodiment of the application, in the process of training the segmentation mask prediction model, the model accuracy of the segmentation mask prediction model can be improved by using the target detection label information to restrict the output of the segmentation mask prediction model.

Optionally, detecting the auxiliary loss function includes detecting the auxiliary loss function longitudinally and detecting the auxiliary loss function laterally.

The longitudinal detection auxiliary loss function is used for constraining longitudinal information of the prediction result output by the segmentation mask prediction model through the target detection label information, and the transverse detection auxiliary loss function is used for constraining transverse information of the prediction result output by the segmentation mask prediction model through the target detection label information.

Alternatively, in the embodiment shown in fig. 7, the image data to be processed may be the second feature data obtained in the method 300 or the method 600 of the above-described embodiment.

It should be appreciated that the embodiment of the application can more effectively improve the prediction accuracy of the instance segmentation task.

As shown in fig. 8, an embodiment of the present application further provides a method 800 for training a neural network, where the method 800 includes the following steps S810 and S820.

S810, acquiring target detection tag information.

S820, training to obtain a segmentation mask prediction model by using a detection auxiliary loss function, wherein the detection auxiliary loss function is used for restraining the output of the segmentation mask prediction model through target detection label information.

In the process of training the segmentation mask prediction model, the model accuracy of the segmentation mask prediction model can be improved by using the target detection tag information to constrain the segmentation mask prediction result output by the segmentation mask prediction model.

It should be appreciated that performing the instance segmentation task using the segmentation mask prediction model obtained by the method 800 of fig. 8 may improve the prediction accuracy of the instance segmentation task.

Optionally, the detection assistance loss function includes a longitudinal detection assistance loss function and a lateral detection assistance loss function. The longitudinal detection auxiliary loss function is used for constraining longitudinal information of the prediction result output by the segmentation mask prediction model through the target detection label information, and the transverse detection auxiliary loss function is used for constraining transverse information of the prediction result output by the segmentation mask prediction model through the target detection label information.

For example, for a division mask prediction result of size w×h and corresponding detection frame label information, it is first uniformly divided into n×n blocks, each block having a size of (w/n) × (h/n), then a global max pooling operation of lateral and longitudinal directions is performed on each block to obtain corresponding lateral and longitudinal masks, and then the output of the division mask prediction model is constrained using the results obtained from the detection frame label information and the results obtained from the division mask prediction to calculate lateral and longitudinal auxiliary loss functions.

Alternatively, the segmentation mask prediction model obtained in the embodiment shown in fig. 8 may be used to process the second feature data in the method 300 or the method 600 to obtain a segmentation mask prediction result of the second feature data.

The various embodiments described herein may be separate solutions or may be combined according to inherent logic, which fall within the scope of the present application.

The method embodiments provided by the present application are described above, and the device embodiments provided by the present application will be described below. It should be understood that the descriptions of the apparatus embodiments and the descriptions of the method embodiments correspond to each other, and thus, descriptions of details not described may be referred to the above method embodiments, which are not repeated herein for brevity.

As shown in fig. 9, the embodiment of the application further provides an apparatus 900 for image processing. The apparatus 900 includes a target detection task feature acquisition module 910, an instance segmentation task feature acquisition module 920, and a spatial feature information alignment module 930.

The target detection task feature acquisition module 910 is configured to acquire first feature data based on original feature data of a target detection task. The first feature data is used for performing a target detection task, and as shown in fig. 9, detection frame prediction is performed.

The instance segmentation task feature acquisition module 920 is configured to acquire second feature data based on the original feature data of the instance segmentation task. This second feature data is used to perform an example segmentation task, as shown in FIG. 9, for segmentation mask prediction.

The spatial feature information alignment module 930 is configured to align spatial feature information of the object detection task and the instance segmentation task.

Alternatively, as shown in fig. 9, the spatial feature information alignment module 930 includes a first spatial feature information obtaining unit 931 configured to obtain first spatial feature information according to original feature data of the target detection task, and provide the first spatial feature information to the instance segmentation task feature obtaining module 920. Accordingly, the instance segmentation task feature acquisition module 920 is configured to fuse the first spatial feature information with the original feature data of the instance segmentation task and output the first feature data.

Optionally, as shown in fig. 9, the spatial feature information alignment module 930 further includes a second spatial feature information obtaining unit 932, configured to obtain second spatial feature information from the raw feature data of the instance segmentation task, and provide the second spatial feature information to the instance target detection task feature obtaining module 910. Accordingly, the target detection task feature acquisition module 910 is configured to fuse the second spatial feature information with the original feature data of the target detection task, and output the second feature data.

Alternatively, as shown in fig. 10, the first spatial feature information obtaining unit 931 includes a spatial feature information obtaining subunit 1001 and a spatial feature information processing subunit 1002.

The spatial feature information obtaining subunit 1001 is configured to obtain third spatial feature information based on the original feature data of the target detection task.

The spatial signature information processing subunit 1002 includes a pooling layer with a tape direction and a reorganization layer with a tape direction. The pooling layer with the direction is used for acquiring transverse features and longitudinal features from the third spatial feature information. The recombination layer with the direction is used for recombining the transverse characteristic and the longitudinal characteristic and outputting first spatial characteristic information.

For example, the apparatus 900 may be used to perform the method 300 in the above embodiments.

Referring to fig. 4 and 9, the object detection task feature acquisition module 910 is configured to perform step S350 in the above embodiment, the instance segmentation task feature acquisition module 920 is configured to perform step S320 in the above embodiment, the spatial feature information alignment module 930 is configured to perform steps S310 and S340 in the above embodiment, the first spatial feature information acquisition unit 931 is configured to perform step S310 in the above embodiment, and the second spatial feature information acquisition unit 932 is configured to perform step S340 in the above embodiment.

Referring to fig. 5 and 10, the subunit 1001 in the first spatial feature information obtaining unit 931 is configured to perform step S311 in the above embodiment, the subunit 1002 in the first spatial feature information obtaining unit 931 is configured to perform step S312 and step S313 in the above embodiment, wherein the pooling layer of the tape direction in the subunit 1002 is configured to implement step S312, and the reorganizing layer of the tape direction in the subunit 1002 is configured to implement step S313.

The related descriptions are detailed above, and are not repeated here for brevity.

An example of the apparatus 900 is described below with reference to fig. 11.

By way of example, and not limitation, one example of apparatus 900 is as apparatus 1100 in fig. 11.

For example, the object detection task feature acquisition module 910 and the instance segmentation task feature acquisition module 920 are each formed by stacked convolution layers. It should be appreciated that the target detection task and the instance segmentation task are two different computer vision tasks, and thus, the stacked convolution layers used by the target detection task feature acquisition module 910 are different from the stacked convolution layers used by the instance segmentation task feature acquisition module 920.

As shown in fig. 11, the target detection task feature acquisition module 910 includes 2 convolution layers with a convolution kernel size of 1×1 and 1 convolution layer with a convolution kernel size of 3×3. The example segmentation task feature acquisition module 1020 includes 1 convolution layer with a convolution kernel size of 1 x 1, 1 convolution layer with a convolution kernel size of 3 x 3.

By way of example and not limitation, the internal data flow of the first feature data of the target detection task feature acquisition module 910 generating output based on the input is as follows. The input of the object detection task feature acquisition module 910 is denoted as raw feature data A0. The original characteristic data A0 firstly passes through a convolution layer with the convolution kernel size of 1 multiplied by 1 to obtain a characteristic diagram A1 with the channel number of 1024; the feature map A1 passes through a convolution layer with a convolution kernel size of 3 multiplied by 3 to obtain a feature map A2 with a channel number of 256; the feature map A2 and the second spatial feature information (may also be referred to as spatial information feature map) C2 from the spatial feature information alignment module 930 are spliced together along the channel dimension (as in 910 in fig. 11), and then pass through a convolution layer with a convolution kernel size of 1×1 to obtain a feature map A3 with a channel number of 1024; the feature map A3 is added to the original feature data A0 to be output O1. The output O1 is the first feature data output by the object detection task feature acquisition module 910.

By way of example and not limitation, the internal data flow of the example segmentation task feature acquisition module 920 to generate the output second feature data based on the input is as follows. The input to the example segmentation task feature acquisition module 920 is denoted as raw feature data B0. The original characteristic data B0 firstly passes through a convolution layer with the convolution kernel size of 3 multiplied by 3 to obtain a characteristic diagram B1 with the channel number of 256; the feature map B1 and the second spatial feature information (may also be referred to as spatial information feature map) C1 from the spatial feature information alignment module 930 are spliced together along the channel dimension (as in 920 in fig. 11), and then pass through a convolution layer with a convolution kernel size of 1×1, to obtain a feature map B2 with 256 channels as output O2. The output O2 is the second feature data output by the instance segmentation task feature acquisition module 920.

As shown in fig. 11, the spatial feature information alignment module 930 may be implemented by a convolution layer, or by a convolution layer and a pooling layer. The subunit 1001 in the first spatial feature information acquisition unit 931 includes 1 convolution layer having a convolution kernel size of 1×1. The sub-unit 1002 in the first spatial feature information obtaining unit 931 includes a pooling layer with a tape direction and a reorganizing layer with a tape direction. The second spatial feature information acquisition unit 932 includes 1 convolution layer having a convolution kernel size of 1×1.

By way of example, and not limitation, the internal data stream in which the first spatial feature information acquisition unit 931 acquires the first spatial feature information is as follows. The input of the first spatial feature information obtaining unit 931 is denoted as raw feature data C10. The original characteristic data C10 passes through a convolution layer with the convolution kernel size of 1 multiplied by 1 to obtain third spatial characteristic information C11; the third spatial feature information C11 passes through a pooling layer with a direction to obtain transverse features and longitudinal features; the transverse features and the longitudinal features pass through a recombination layer with a tape direction to obtain first spatial feature information (may also be referred to as spatial information feature map) C1. The first spatial feature information C1 is provided to an instance segmentation task feature acquisition module 920.

By way of example and not limitation, the internal data stream in which the second spatial feature information acquisition unit 932 acquires the second spatial feature information is as follows. The input of the first spatial feature information obtaining unit 931 is denoted as raw feature data C20. The original characteristic data C20 is subjected to a convolution layer with a convolution kernel size of 1×1, and a second spatial information characteristic diagram C2 is obtained. The second spatial feature information C2 is provided to the object detection task feature acquisition module 910.

It should be noted that fig. 11 is only an example and not a limitation. That is, apparatus 900 the apparatus 1100 illustrated in fig. 11 is merely an alternative implementation of apparatus 900.

The apparatus 900 may have a variety of possible variations on the structures that may be implemented as the method 300 of the above-described embodiments.

For example, the structure of the object detection task feature acquisition module 910 may be a stacked convolution layer different from that shown in fig. 11, the structure of the example segmentation task feature acquisition module 920 may be a stacked convolution layer different from that shown in fig. 11, and the structure of the spatial feature information alignment module 930 may be a convolution layer different from that shown in fig. 11 and a pooling layer.

In fig. 11, the input (C10) of the first spatial feature information acquisition unit 931 is the same as the input (A0) of the target detection task feature acquisition module 910, and the input (C20) of the second spatial feature information acquisition unit 932 is the same as the input (B0) of the instance segmentation task feature acquisition module 920. But the application is not so limited.

Alternatively, the input (C10) of the first spatial feature information obtaining unit 931 may be different from the input (A0) of the target detection task feature obtaining module 910, and the input (C20) of the second spatial feature information obtaining unit 932 may be different from the input (B0) of the instance segmentation task feature obtaining module 920.

Referring to fig. 11, as an example, the input (C10) of the first spatial feature information obtaining unit 931 may be a feature map A1 obtained after the input A0 of the target detection task feature obtaining module 910 passes through a convolution layer having a convolution kernel size of 1×1, or may be a feature map A2 obtained after A0 passes through a convolution layer having a convolution kernel size of 1×1 and then passes through a convolution layer having a convolution kernel size of 3×3.

Referring to fig. 11, as an example, the input (C20) of the second spatial feature information obtaining unit 932 may be a feature map B1 obtained after the input B0 of the example segmentation task feature obtaining module 920 passes through a convolution layer having a convolution kernel size of 3×3.

The floor product forms of the apparatus 900 and the apparatus 1100 may be multi-target accurate positioning services for custom scenarios. For example, the apparatus 900 or 1100 may be deployed in a computing node of an associated device.

As shown in fig. 12, an embodiment of the present application further provides an apparatus 1200 for image processing. The apparatus 1200 includes n apparatuses 900 in the above embodiments, such as the apparatus 900 (1), the apparatus 900 (2), …, and the apparatus 900 (n) shown in fig. 12.

n is a positive integer. Device 900 (i) represents an i-th device 900 in device 1200, i being 1, 2. In practical application, the value of n can be determined according to the application requirement, which is not limited by the application.

Each apparatus 900 comprised by the apparatus 1200 may be referred to as an interleaved tributary sub-network 900, and the apparatus 1200 as an interleaved tributary network 1200.

In apparatus 1200, the output of each interleaved tributary sub-network 900 is used as the input to the next interleaved tributary sub-network 900. That is, the output of the object detection task feature acquisition module 910 in each of the interleaved sub-networks 900 (i) serves as the input to the object detection task feature acquisition module 910 in the next interleaved sub-network 900 (i+1), and the output of the instance segmentation task feature acquisition module 920 in each of the interleaved sub-networks 900 (i) serves as the input to the instance segmentation task feature acquisition module 920 in the next interleaved sub-network 900 (i+1).

Optionally, the structure and parameters of the n staggered branched subnetworks 900 in the device 1200 are the same.

For example, each of the interleaved sub-networks 900 in the apparatus 1200 is the apparatus 1100 shown in fig. 11.

Optionally, the structure and parameters of the n staggered branched subnetworks 900 in the apparatus 1200 are not exactly the same.

For example, the architecture of each of the interleaving sub-networks 900 in the apparatus 1200 is identical to the architecture of the apparatus 900 shown in fig. 9, but a part of the interleaving sub-networks 900 has a specific structure as the apparatus 1100 shown in fig. 11, and another part of the interleaving sub-networks 900 has a specific structure different from that of fig. 11.

Apparatus 1200 may be used to perform method 600 of the above embodiments.

The floor product form of the apparatus 1200 may be a multi-target accurate positioning service for custom scenarios. For example, the apparatus 1200 may be deployed in a computing node of an associated device.

As shown in fig. 13, an embodiment of the present application further provides a system 1300 for image processing. The system 1300 includes a backbone network 1310, a region proposal network 1320, a full connectivity layer 1330, a staggered branch network 1340, a multi-classification prediction model 1350, a detection-frame prediction model 1360, and a segmentation mask prediction model 1370. Wherein the interleaved branch network 1340 is the device 1200 in the above embodiments.

The system 1300 may be used to perform image classification tasks, object detection tasks, and instance segmentation tasks. For example, the image data to be processed is input to the system 1300, and the system 1300 may output the category, detection frame, and segmentation mask prediction result of each object.

As an example, the operational flow of the system 1300 to perform the image classification task, the object detection task, and the instance segmentation task includes the following steps.

Step 1), feature acquisition is performed on the image data to be processed by using the backbone network 1310, so as to obtain the image features of the whole image.

Step 2), candidate region positions of a plurality of targets are generated using the region proposal network 1320, and a feature map of each candidate region is acquired, that is, candidate region features as shown in fig. 13 are obtained.

Step 3), the full connection layer 1330 is used to process the candidate region features, that is, the feature map of each candidate region generated by the region proposal network 1320, to obtain classification feature data for inputting the multi-classification prediction model 1350.

Step 4), the classification characteristic data is processed by using the multi-classification prediction model 1350 to obtain multi-classification prediction results.

Step 5), the candidate region features are processed using the interleaved branch network 1340, that is, the feature map of each candidate region generated by the region proposal network 1320 is processed to obtain detected feature data (corresponding to the first feature data in the above embodiment) for the input detection frame prediction model 1360, and segmented feature data (corresponding to the second feature data in the above embodiment) for the segmentation mask prediction model 1370.

And 6), performing target detection processing on the detection characteristic data by using a detection frame prediction model 1360 to obtain a detection frame prediction result.

Step 7), the segmentation feature data is processed by using the segmentation mask prediction model 1370 to obtain a segmentation mask prediction result.

The execution sequence of steps 1) to 7) is determined by the internal logic relationship, and is not limited to the sequence of text presentation.

For example, in system 1300, full connectivity layer 1330 and multi-class prediction model 1350 may be collectively referred to as a multi-class branch network; the interleaved branch network 1340 and the detection-box prediction model 1360 may be collectively referred to as a target detection branch network; the interleaved branch network 1340 and the segmentation mask prediction model 1370 may be collectively referred to as an example segmentation branch network.

In system 1300, the classification branch network uses a separate full connection layer to obtain feature maps and make class predictions. The target detection branch network and the instance segmentation branch network together use the interleaved branch network 1340 to obtain respective feature data.

It should be appreciated that in the system 1300, by using the staggered branch network 1340 (i.e., the apparatus 1200 shown in fig. 11) provided in the embodiment of the present application to obtain feature data for the target detection task and the instance segmentation task, mutual supervision of the target detection task and the instance segmentation task can be implemented, so that prediction accuracy of the instance segmentation task can be improved together.

Optionally, the system 1300 can also be used to train and deploy an instance segmentation task model. As shown in fig. 13, the system 1300 can build an instance-partitioned network model of a generic scene by retrieving given image data from an image training data repository, retrieving given tag information from a tag data repository

By way of example, the operational flow of training an instance segmentation task model by the system 1300 includes the following steps.

Step (1), given image data is acquired from the image training data repository and input into the backbone network 1310.

Step (2), the steps 1) to 7) are performed, and details are described above, and are not repeated here.

In step (2), the multi-classification prediction model 1350, the detection frame prediction model 1360, and the segmentation mask prediction model 1370 are trained by acquiring the target classification tag information, the target detection tag information, and the segmentation mask tag information from the tag data warehouse.

For example, the multi-class prediction model 1350 is trained with multi-class loss functions that use target class label information to constrain the output of the multi-class prediction model 1350.

For example, the detection box prediction model 1360 is trained by detecting a regression loss function that uses target detection tag information to constrain the output of the detection box prediction model 1360.

For example, the segmentation mask prediction model 1370 is trained with a pixel-by-pixel classification loss function that uses segmentation mask label information to constrain the output of the segmentation mask prediction model 1370.

Optionally, in step (2), the segmentation mask prediction model 1370 is trained by segmenting the mask tag information and the object detection tag information.

For example, the split mask prediction model 1370 is trained using a pixel-by-pixel classification loss function that uses split mask label information to constrain the output of the split mask prediction model and a detection auxiliary loss function that constrains the output of the split mask prediction model by target detection label information.

As an example, segmentation mask prediction model 1370 is trained using method 800 provided by the embodiments above. The related descriptions are detailed above, and are not repeated here.

After training is completed through the system 1300, final parameters of the model are obtained, and then the model and the corresponding parameters are deployed into a test environment, so that an instance segmentation network model of the general scene can be obtained.

In the final deployment process, only image data is input by the algorithm, and the final output result is the target category, detection frame and segmentation mask prediction result, and label information and various loss functions are not needed.

The floor product form of the system 1300 may be a multi-target accurate positioning service for custom scenarios. For example, the system 1300 may be deployed in a computing node of an associated device, capable of generating a pixel-level precision positioning solution for a specified class of targets for a customer by accessing a visual data input interface of a current scene (e.g., the scene shown in fig. 14 and 15).

The application can be applied to the fields of automatic analysis and understanding of image data, including but not limited to automatic driving, video monitoring and the like, which need to accurately analyze the target position.

Application scenario one: a pedestrian vehicle segmentation system in an autopilot system.

In an autopilot mission, a vehicle system collects image data through a camera and then identifies various pedestrians, vehicles, and other vehicles on the road from the image, judging their exact location for helping to select a final vehicle control strategy. As shown in fig. 14, with the system 1300 provided by the present application, using the pedestrian and vehicle data warehouse in a given autopilot scenario, the pedestrian and vehicle segmentation system suitable for the autopilot task can be trained, and then deployed into the autopilot system, so as to improve the accuracy of the system.

And (2) an application scene II: a target segmentation system in video surveillance.

In the field of video monitoring, we need to pay attention to multiple targets in the monitoring video, and meanwhile, automatically judge the accurate positions of the targets and perform tracking analysis. As shown in fig. 15, the system 1300 provided by the application is used for training in a data warehouse under a video monitoring scene and then deployed into a target scene, so that each target position can be more accurately positioned, and further, the relevant attribute and other information of each target position can be analyzed, thereby realizing automatic and accurate monitoring and behavior analysis.

Table 1 shows the performance of the example segmentation task model and other existing models provided by the present application at the same experimental setup for target detection and example segmentation tasks of the disclosed dataset. As can be seen from table 1, the present application achieves excellent performance in both the target detection and the instance segmentation tasks, compared to the existing scheme.

Table 1: object detection and instance segmentation performance effect on a public data set MS COCO based on an instance segmentation task model and an existing model of the application

Table 2 shows the lifting effect brought by the staggered branch network 1200 and the auxiliary detection loss function provided by the present application, where the auxiliary detection loss function mainly improves the example segmentation effect, and the staggered branch network can improve the target detection effect and the example segmentation effect.

Table 2: interleaving branch network 1200 and effect analysis of auxiliary detection loss function (MS COCO)

As shown in fig. 16, an embodiment of the present application further provides an apparatus 1600 for image processing. The apparatus 1600 includes the following elements.

The first obtaining unit 1610 is configured to obtain first spatial feature information based on raw feature data of a first image processing task.

The second obtaining unit 1620 is configured to obtain second feature data according to the original feature data of the second image processing task and the first spatial feature information.

The first processing unit 1630 is configured to perform a second image processing on the second feature data, to obtain a processing result of the second image processing task.

Optionally, the apparatus 1600 further comprises the following units.

A third acquiring unit 1640 is configured to acquire second spatial feature information based on the original feature data of the second image processing task.

The fourth obtaining unit 1650 is configured to obtain the first feature data according to the original feature data of the first image processing task and the second spatial feature information.

And a second processing unit 1660, configured to perform a first image processing on the first feature data to obtain a processing result of the first image processing task.

Optionally, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task; wherein, the first acquisition unit 1610 is configured to: acquiring third spatial feature information based on the original feature data of the target detection task; respectively acquiring transverse characteristic information and longitudinal characteristic information according to the third spatial characteristic information; and recombining the transverse characteristic information and the longitudinal characteristic information to obtain the first spatial characteristic information.

Optionally, the apparatus 1600 obtains the first feature data and the second feature data by performing the following operation, where i is initially taken as 1 and n is a positive integer.

In the embodiment of the application, the operation of providing the spatial characteristic information for each other by executing the multi-round target detection and the instance segmentation can better enable the characteristic data of the target detection and the instance segmentation to be corrected, so that the prediction results of the target detection and the instance segmentation are consistent to a greater extent, and the prediction accuracy of the instance segmentation task can be improved.

Optionally, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task; wherein the second processing unit 1660 is configured to process the first feature data using a detection frame prediction model to obtain a target detection prediction result of the first feature data; the first processing unit 1630 is configured to process the second feature data using a segmentation mask prediction model to obtain a segmentation mask prediction result of the second feature data.

Optionally, the auxiliary detection loss function includes a longitudinal auxiliary detection loss function and a lateral auxiliary detection loss function, where the longitudinal auxiliary detection loss function constrains longitudinal information of the prediction result output by the prediction model of the split mask through the target detection tag information, and the lateral auxiliary detection loss function constrains lateral information of the prediction result output by the prediction model of the split mask through the target detection tag information.

In the embodiment of the application, the performance of the segmentation mask prediction model can be further improved by respectively restricting the transverse information and the longitudinal information of the segmentation mask prediction result output by the segmentation mask prediction model by using the target detection label information.

Optionally, the second obtaining unit 1620 is configured to obtain the second feature data by processing the original feature data of the second image processing task and the first spatial feature information using a convolution layer.

Optionally, the first obtaining unit 1610 is configured to: processing the original characteristic data of the target detection task by using a convolution layer to acquire the third spatial characteristic information; and processing the third spatial characteristic information by using a pooling layer to acquire the transverse characteristic information and the longitudinal characteristic information.

The apparatus 1600 may be integrated on a terminal device, network device, or chip.

The apparatus 1600 may be deployed on a computing node of an associated device, capable of generating a pixel-level precision positioning solution for a specified class of targets for a customer by accessing a visual data input interface of the scene.

As shown in fig. 17, the embodiment of the application further provides an apparatus 1700 for image processing. The apparatus 1700 includes the following elements.

An input unit 1710 for inputting the image data to be processed into the segmentation mask prediction model.

A processing unit 1720 for obtaining a segmentation mask prediction result of the image data to be processed using the segmentation mask prediction model.

The apparatus 1700 may be integrated on a terminal device, network device, or chip.

The apparatus 1700 may be deployed on a computing node of an associated device, capable of generating a pixel-level precision positioning solution for a specified class of targets for a customer by accessing a visual data input interface of the scene.

As shown in fig. 18, the embodiment of the present application further provides an apparatus 1800 for image processing. The apparatus 1800 includes the following elements.

An acquisition unit 1810 for acquiring the target detection tag information.

A training unit 1820 is configured to train to obtain a segmentation mask prediction model using a detection auxiliary loss function, where the detection auxiliary loss function constrains an output of the segmentation mask prediction model by the target detection tag information.

The apparatus 1800 may be integrated on a terminal device, a network device, or a chip.

As shown in fig. 19, the embodiment of the application further provides an apparatus 1900 for image processing. The apparatus 1900 comprises a processor 1910, the processor 1910 being coupled to a memory 1920, the memory 1920 being for storing computer programs or instructions, the processor 1910 being for executing the computer programs or instructions stored by the memory 1920 such that the methods in the above method embodiments are performed.

Optionally, as shown in fig. 19, the apparatus 1900 may further include a memory 1920.

Optionally, as shown in fig. 19, the apparatus 1900 may further include a data interface 1930, where the data interface 1930 is used for data transmission with the outside world.

Alternatively, the apparatus 1900 may be used to implement the method 300 of the above embodiment.

Alternatively, the apparatus 1900 may be used to implement the method 600 in the above embodiments.

Alternatively, as a further alternative, the apparatus 1900 is adapted to implement the method 700 in the above embodiments.

Alternatively, as a further alternative, the apparatus 1900 may be adapted to implement the method 800 in the above embodiments.

The present application also provides a computer readable medium storing program code for execution by a device, the program code comprising means for performing the above-described embodiments.

Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the above embodiments.

The embodiment of the application also provides a chip, which comprises a processor and a data interface, wherein the processor reads the instructions stored in the memory through the data interface, and the method of the embodiment is executed.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, where the instructions, when executed, are configured to perform the method in the foregoing embodiment.

Embodiments of the present application also provide an electronic device including any one or more of the apparatus 900, the apparatus 1100, the apparatus 1200, the system 1300, the apparatus 1500, the apparatus 1600, the apparatus 1700, or the embodiments described above.

Fig. 20 is a chip hardware structure provided in an embodiment of the application, where the chip includes a neural network processor 2000. The chip may be provided in any one or more of the following devices or systems:

apparatus 900 as shown in fig. 9, apparatus 1100 as shown in fig. 11, apparatus 1200 as shown in fig. 12, system 1300 as shown in fig. 13, apparatus 1600 as shown in fig. 16, apparatus 1700 as shown in fig. 17, apparatus 1800 as shown in fig. 18, apparatus 1900 as shown in fig. 19.

The methods 300, 600, 700 or 800 in the above method embodiments may all be implemented in a chip as shown in fig. 20.

The neural network processor 2000 is mounted as a coprocessor to a main processor (Host CPU) and tasks are allocated by the main CPU. The neural network processor 2000 has a core part of an arithmetic circuit 2003, and the controller 2004 controls the arithmetic circuit 2003 to acquire data in a memory (weight memory 2002 or input memory 2001) and perform an operation.

In some implementations, the arithmetic circuit 2003 internally includes a plurality of processing units (PEs). In some implementations, the operational circuit 2003 is a two-dimensional systolic array. The operation circuit 2003 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuit 2003 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit 2003 takes the data corresponding to the matrix B from the weight memory 2002 and buffers it on each PE in the arithmetic circuit 2003. The arithmetic circuit 2003 takes matrix a data from the input memory 2001 and performs matrix operation on the matrix a data and the matrix B, and the partial result or the final result of the matrix obtained is stored in an accumulator (accumulator) 2008.

The vector calculation unit 2007 may further process the output of the operation circuit 2003, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 2007 may be used for network calculations of non-convolutional/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit 2007 can store the vector of processed outputs to a unified memory (which may also be referred to as a unified buffer) 2006. For example, the vector calculation unit 2007 may apply a nonlinear function to an output of the operation circuit 2003, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 2007 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 2003, e.g., for use in subsequent layers in a neural network.

The method 300, 600, 700 or 800 in the above method embodiments may be performed by 2003 or 2007.

The unified memory 2006 is used for storing input data and output data.

The weight data is directly transferred to the input memory 2001 and/or the unified memory 2006 by the memory cell access controller 2005 (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory 2002, and the data in the unified memory 2006 is stored in the external memory.

A bus interface unit (bus interface unit, BIU) 2010 for interfacing between the main CPU, DMAC and finger memory 2009 via a bus.

A fetch memory (instruction fetch buffer) 2009 coupled to the controller 2004 for storing instructions for use by the controller 2004;

the controller 2004 is configured to invoke an instruction cached in the memory 2009 to control a working process of the operation accelerator.

In an embodiment of the present application, the data herein may be image data to be processed.

Generally, the unified memory 2006, input memory 2001, weight memory 2002, and finger memory 2009 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, which may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

It should be noted that, the first, second, third, fourth, etc. numbers are merely for convenience of description and are not intended to limit the scope of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a universal serial bus flash disk (UFD) (UFD may also be simply referred to as a U-disk or a U-disk), a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of image processing, comprising:

acquiring first spatial feature information based on original feature data of a first image processing task;

processing the original characteristic data of the second image processing task and the first spatial characteristic information by using a convolution layer to acquire second characteristic data;

performing second image processing on the second characteristic data to obtain a processing result of the second image processing task;

wherein the first image processing task and the second image processing task are respectively one of a target detection task and an instance segmentation task and the other of the target detection task and the instance segmentation task;

2. The method according to claim 1, wherein the method further comprises:

acquiring second spatial feature information based on the original feature data of the second image processing task;

acquiring first characteristic data according to the original characteristic data of the first image processing task and the second spatial characteristic information;

and performing first image processing on the first characteristic data to obtain a processing result of the first image processing task.

3. The method according to claim 1 or 2, wherein the first image processing task is a target detection task and the second image processing task is an instance segmentation task;

the obtaining the first spatial feature information based on the original feature data of the first image processing task includes:

acquiring third spatial feature information based on the original feature data of the target detection task;

respectively acquiring transverse characteristic information and longitudinal characteristic information according to the third spatial characteristic information;

and recombining the transverse characteristic information and the longitudinal characteristic information to obtain the first spatial characteristic information.

4. The method of claim 2, wherein the acquiring the second characteristic data and the acquiring the first characteristic data comprise:

The first characteristic data and the second characteristic data are obtained by executing the following operation, wherein the initial value of i is 1, and N is a positive integer:

step S1, spatial characteristic information X1 is acquired based on the characteristic data IF1_i;

step S2, spatial characteristic information X2 is acquired based on the characteristic data IF2_i;

step S3, obtaining characteristic data OF1_i according to the characteristic data IF1_i and the spatial characteristic information X2;

step S4, obtaining characteristic data OF2_i according to the characteristic data IF2_i and the spatial characteristic information X1;

step S5, judging whether the value of i is equal to N,

IF not, the value OF i is added by 1, the feature data OF1_ (i-1) is taken as feature data IF1_i, the feature data OF2_ (i-1) is taken as feature data IF2_i, the process goes to step S1,

if yes, taking the characteristic data OF1_i as the first characteristic data and taking the characteristic data OF2_i as the second characteristic data;

5. The method according to claim 2 or 4, wherein the first image processing task is a target detection task and the second image processing task is an instance segmentation task;

Wherein the performing a first image processing task on the first feature data includes:

processing the first characteristic data by using a detection frame prediction model to obtain a target detection prediction result of the first characteristic data;

the second image processing is performed on the second feature data to obtain a processing result of the second image processing task, including:

processing the second feature data using a segmentation mask prediction model, obtaining a segmentation mask prediction result of the second feature data,

the segmentation mask prediction model is trained by using a detection auxiliary loss function, and the detection auxiliary loss function is used for restraining the output of the segmentation mask prediction model through target detection tag information, wherein the target detection tag information is used for training the detection frame prediction model.

6. The method of claim 5, wherein the detection assistance loss function comprises a longitudinal detection assistance loss function and a lateral detection assistance loss function,

7. A method according to claim 3, wherein said obtaining third spatial feature information based on raw feature data of the object detection task comprises:

processing the original characteristic data of the target detection task by using a convolution layer to acquire the third spatial characteristic information;

the step of respectively obtaining the transverse characteristic information and the longitudinal characteristic information according to the third spatial characteristic information includes:

and processing the third spatial characteristic information by using a pooling layer to acquire the transverse characteristic information and the longitudinal characteristic information.

8. A method of image processing, comprising:

inputting image data to be processed into a segmentation mask prediction model;

obtaining a segmentation mask prediction result of the image data to be processed using the segmentation mask prediction model,

the segmentation mask prediction model is trained by using a detection auxiliary loss function, the detection auxiliary loss function is used for restraining the output of the segmentation mask prediction model through target detection label information, and the target detection label information is used for training a detection frame prediction model.

9. The method of claim 8, wherein the detection assistance loss function comprises a longitudinal detection assistance loss function and a lateral detection assistance loss function,

10. A method of image processing, comprising:

acquiring target detection tag information;

and training by using a detection auxiliary loss function to obtain a segmentation mask prediction model, wherein the detection auxiliary loss function is used for restraining the output of the segmentation mask prediction model through the target detection label information, and the target detection label information is used for training a detection frame prediction model.

11. The method of claim 10, wherein the detection assistance loss function comprises a longitudinal detection assistance loss function and a lateral detection assistance loss function,

12. An apparatus for image processing, comprising:

a memory for storing a program;

a processor for executing a program stored in the memory, which processor is for executing the method of any one of claims 1 to 11 when the program stored in the memory is executed.

13. A computer readable storage medium, characterized in that the computer readable medium stores a program code for execution by a device, which when executed, performs the method of any of claims 1 to 11.

14. A chip comprising at least one processor and a data interface;

the at least one processor is configured to invoke and run a computer program stored on a memory via the data interface to cause the chip to perform the method of any of claims 1 to 11.