CN117197208A

CN117197208A - Image processing method, device, electronic equipment and computer readable storage medium

Info

Publication number: CN117197208A
Application number: CN202210582065.6A
Authority: CN
Inventors: 孟俊彪; 刘阳兴
Original assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Current assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2023-12-08

Abstract

The application discloses an image processing method, an image processing device, electronic equipment and a computer readable storage medium. The method comprises the following steps: performing feature extraction processing on the acquired image to be processed through an encoding module of the monocular depth estimation model to obtain intermediate features; the monocular depth estimation model comprises a coding module, a main body segmentation module, a depth estimation module and an information interaction module; performing feature reduction processing on the intermediate features according to the main body segmentation module to obtain initial main body segmentation features; and carrying out depth estimation processing on the intermediate features according to the depth estimation module, the information interaction module and the initial main body segmentation features to obtain a depth information image corresponding to the image to be processed. By adopting the method, the single frame image can be processed, and the main body segmentation feature is applied to depth estimation to obtain a high-precision depth information image.

Description

Image processing method, device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image processing method, an image processing device, an electronic device, and a computer readable storage medium.

Background

In some image processing scenes, depth information of an image needs to be calculated, for example, a scene that an intelligent terminal such as a mobile phone takes a picture in an image mode. The conventional scheme is to calculate depth information through a binocular depth estimation algorithm, and further realize blurring of the background according to the depth information.

However, when the binocular depth estimation needs to be carried out on a mobile phone lens module when the mobile phone leaves the factory, in the use of electronic equipment, the phenomena of equipment falling, collision and the like may occur, or the equipment is used for a long time to cause ageing of cameras, and the phenomena all cause inaccurate calibration parameters of the multi-camera module, so that the result of the binocular depth algorithm is inaccurate, and accurate depth information is difficult to obtain.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, electronic equipment and a computer readable storage medium, which can obtain accurate depth information through monocular images.

In a first aspect, an embodiment of the present application provides an image processing method, including:

performing feature extraction processing on the acquired image to be processed through an encoding module of the monocular depth estimation model to obtain intermediate features; the monocular depth estimation model comprises a coding module, a main body segmentation module, a depth estimation module and an information interaction module;

Performing feature reduction processing on the intermediate features according to the main body segmentation module to obtain initial main body segmentation features;

and carrying out depth estimation processing on the intermediate features according to the depth estimation module, the information interaction module and the initial main body segmentation features to obtain a depth information image corresponding to the image to be processed.

In a second aspect, an embodiment of the present application further provides an image processing apparatus, including:

the extraction module is used for carrying out feature extraction processing on the acquired image to be processed through the coding module of the monocular depth estimation model to obtain intermediate features; the monocular depth estimation model comprises a coding module, a main body segmentation module, a depth estimation module and an information interaction module;

the segmentation module is used for carrying out feature reduction processing on the intermediate features according to the main body segmentation module to obtain initial main body segmentation features;

and the estimation module is used for carrying out depth estimation processing on the intermediate features according to the depth estimation module, the information interaction module and the initial main body segmentation features to obtain a depth information image corresponding to the image to be processed.

In a third aspect, embodiments of the present application further provide a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements steps in an image processing method as described above.

In a fourth aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a processor, a memory, and a computer program stored in the memory and executable on the processor, and the steps in the image processing method are implemented by the processor when the processor executes the computer program.

According to the technical scheme provided by the embodiment of the application, a trained monocular depth estimation model is determined, the monocular depth estimation model comprises a coding module, a main body segmentation module, a depth estimation module and an information interaction module, the coding module is used for carrying out feature extraction processing on an image to be processed to obtain intermediate features, the main body segmentation model is used for carrying out feature reduction processing on the intermediate features to obtain initial main body segmentation features, the main body segmentation features are applied to a depth estimation process according to the depth estimation module and the information interaction module, and the intermediate features are used for carrying out depth estimation processing to obtain a depth information image corresponding to the image to be processed. According to the scheme, a single frame image can be processed, and the main body segmentation feature is applied to depth estimation to obtain a high-precision depth information image.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a first image processing method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of image processing performed by the monocular depth estimation model according to an embodiment of the present application.

Fig. 3 is another schematic diagram of image processing performed by the monocular depth estimation model according to an embodiment of the present application.

Fig. 4 is a further schematic diagram of image processing performed by the monocular depth estimation model according to an embodiment of the present application.

Fig. 5 is a schematic diagram of image processing performed by the monocular depth estimation model according to an embodiment of the present application.

Fig. 6 is a schematic diagram illustrating an effect of the image processing method according to the embodiment of the application.

Fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

FIG. 9 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present application

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present application based on the embodiments of the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The embodiment of the application provides an image processing method, and an execution subject of the image processing method can be electronic equipment. The electronic device may be a smart phone, a tablet computer, a palm computer, a notebook computer, or a desktop computer.

Referring to fig. 1, fig. 1 is a schematic flow chart of a first image processing method according to an embodiment of the application. The specific flow of the image processing method provided by the embodiment of the application can be as follows:

101. performing feature extraction processing on the acquired image to be processed through an encoding module of the monocular depth estimation model to obtain intermediate features; the monocular depth estimation model comprises a coding module, a main body segmentation module, a depth estimation module and an information interaction module.

The scheme of the embodiment of the application can be used for processing the image containing the main body. For example, camera programs of some electronic devices are provided with a portrait shooting mode. In the portrait shooting mode, the electronic device needs to extract depth information in the image, and perform blurring processing on a background portion except a portrait portion in the image according to the depth information. This scenario may use the scheme of the present application.

The above-mentioned portrait shooting mode is not only used for shooting a portrait, but also used for shooting other subjects, where the subjects are parts that a photographer wants to highlight when shooting, and may be any stationary or dynamic moving objects, such as flowers, buildings, animals, and people. In the portrait shooting mode, a photographer usually focuses on a subject to be highlighted, for example, the photographer uses the portrait shooting mode to shoot a pet of the photographer, and then needs to focus on the pet, so that a photo with clear focusing on an area where the pet is located and a blurring background part is obtained, and a focusing area in the photo is located in the area where the pet is located.

It should be noted that the examples herein are only for the convenience of readers to understand the present application, and are not intended to limit the scheme of the present application, and in other scenes where depth information of an image needs to be extracted, the image may also be processed according to the scheme of the present application. In order to facilitate the reader to understand the present application, the present application will be described in detail below taking a scene of a portrait shooting mode of an electronic device as an example.

The image to be processed can be an image sent by other terminals received by the electronic equipment, or can be an image output by a camera of the electronic equipment. The electronic equipment is provided with a pre-trained monocular depth estimation model, the monocular depth estimation model comprises a coding module, a main body segmentation module, a depth estimation module and an information interaction module, the input image can be subjected to feature extraction processing to obtain intermediate features, and then depth estimation is performed according to the intermediate features to obtain a depth information image corresponding to the image to be processed. Wherein the model is trained from a sample image carrying a subject tag and a depth tag, the specific training process is described in detail below.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating image processing performed by the monocular depth estimation model according to an embodiment of the present application. After the image to be processed is obtained, the image to be processed is input into a monocular depth estimation model, and feature extraction processing is carried out on the image to be processed through an encoding module, so that intermediate features are obtained. For example, the coding module may be a convolutional neural network, and the high-dimensional features of the features to be processed are extracted through convolution operation.

102. And carrying out feature reduction processing on the intermediate features according to the main body segmentation module to obtain initial main body segmentation features.

As shown in fig. 2, after the intermediate feature is obtained, the intermediate feature is subjected to feature reduction processing by the main body segmentation module, so as to obtain an initial main body segmentation feature. The feature restoration process can be understood as a process opposite to feature extraction, in which deep features in an image are extracted through convolution operation, and a deep feature with a size far smaller than that of an original image but a dimension far higher than that of a high latitude of the original image is obtained through multi-layer convolution operation. The feature restoration process is a process opposite to the convolution operation, and finally restores deep features with high latitude to a feature with a lower dimension and a size close to or equal to that of the original image. Wherein feature restoration may be achieved by a deconvolution operation, or a convolution operation + an upsampling operation.

103. And carrying out depth estimation processing on the intermediate features according to the depth estimation module, the information interaction module and the initial main body segmentation features to obtain a depth information image corresponding to the image to be processed.

As shown in fig. 2, after the intermediate feature is acquired, the intermediate feature is subjected to depth estimation processing according to the depth estimation module, in this process, the acquired initial subject segmentation feature is combined into the depth estimation processing through the information interaction module of the monocular depth estimation model, that is, in the depth estimation process, the initial subject segmentation feature can be combined to extract depth information more accurately in the process of performing feature restoration on the deep feature (i.e., the intermediate feature) at high latitude, so that the accuracy of the finally output depth information image is improved.

The feature reduction processing process of the depth estimation module on the intermediate feature is similar to the feature reduction processing of the main body segmentation module on the intermediate feature. The difference is that: the depth estimation module and the main body segmentation module have different label data in the model training process, so that the weight parameters of the depth estimation module and the main body segmentation module in the model obtained by training are different, and the results obtained by carrying out feature reduction processing are also different. The feature recovery in the depth estimation module may also be implemented by deconvolution operation, or convolution operation+up-sampling operation.

In particular, the application is not limited by the order of execution of the steps described, as some of the steps may be performed in other orders or concurrently without conflict.

It can be seen from the above that, according to the image processing method provided by the embodiment of the present application, a trained monocular depth estimation model is determined, the monocular depth estimation model includes a coding module, a main body segmentation module, a depth estimation module and an information interaction module, the feature extraction processing is performed on an image to be processed by the coding module, an intermediate feature is obtained, the feature restoration processing is performed on the intermediate feature by the main body segmentation model, an initial main body segmentation feature is obtained, and then the main body segmentation feature is applied to the depth estimation process according to the depth estimation module and the information interaction module, and the depth estimation processing is performed on the intermediate feature, so as to obtain a depth information image corresponding to the image to be processed. According to the scheme, a single frame image can be processed, and the main body segmentation feature is applied to depth estimation to obtain a high-precision depth information image.

Wherein, in an embodiment, the depth estimation module includes a first decoding network and a regression module. According to the depth estimation module, the information interaction module and the initial main body segmentation feature, performing depth estimation processing on the intermediate feature to obtain a depth information image corresponding to the image to be processed, wherein the method comprises the following steps: performing feature reduction processing on the intermediate features according to the first decoding network, the information interaction module and the initial main body segmentation features to obtain depth features; and carrying out convolution operation on the depth features in the channel direction according to the regression module to obtain a depth information image corresponding to the image to be processed.

Referring to fig. 3, fig. 3 is another schematic diagram illustrating image processing performed by the monocular depth estimation model according to an embodiment of the present application. The depth estimation module comprises a first decoding network and a regression module, wherein the first decoding network is used for carrying out feature restoration processing, restoring middle features of high latitude and deep layers into features which are low in dimension and close to or equal to the dimension of an original image, the features are depth features of the image to be processed, and the features are used for representing the features of depth information in the image to be processed. And then, carrying out operation on the depth characteristics through a regression module to obtain the depth information of each pixel point in the image to be processed, wherein the depth information of each pixel point forms a depth information image corresponding to the image to be processed.

For example, in one embodiment, a first decoding network includes a plurality of first residual modules including a first convolutional layer, a first upsampling layer, and a second convolutional layer; performing feature reduction processing on the intermediate features according to the first decoding network to obtain initial depth features, including: and performing convolution operation and up-sampling operation on the intermediate features according to the plurality of first residual modules to obtain initial depth features, wherein the output feature map of the last first residual module is the input feature map of the next first residual module, and the size of the output feature map of the next first residual module is larger than that of the output feature map of the last first residual module. For example, the output characteristic diagram of the next first residual module is 2a times the size of the output characteristic diagram of the previous first residual module, and a is equal to or larger than 1.

In this embodiment, the first decoding network includes N first residual modules, N.gtoreq.1. For example, n=5, the first decoding network includes 5 first residual modules, each including a convolution layer and an upsampling layer, wherein the number of convolution layers in each residual module may be set according to actual needs. For example, in one embodiment, a first residual module includes a first convolutional layer, a first upsampling layer, and a second convolutional layer. The size of the image to be processed is (w, h), comprising three dimensions of RGB, namely the size of input data of the model is (w, h, 3), and after being processed by the coding module, the intermediate features with the channel number of 256 dimensions, the width and the height of w/32 and h/32 respectively, namely the size of the intermediate features is (w/32, h/32, 256) are obtained. The middle feature of the high latitude is input to a first residual module of a first decoding network for processing, the output feature is feature_d0 (w/16, h/16,256), the output feature is input to a second first residual module for processing, the output feature is feature_d1 (w/8,h/8,128), and so on, the output features of the last three first residual modules are feature_d2 (w/4,h/4,64), feature_d3 (w/2,h/2, 32), and feature_d4 (w, h, 16) in sequence, wherein the size of an output feature map of the next first residual module is larger than that of the last first residual module, and the output of the last first residual module is the output of the first decoding network. From the above data, it can be seen that the depth feature_s4 (w, h, 16) output by the first decoding network includes feature maps with 16 dimensions, each feature map having a size equal to that of the original image to be processed, and it can be understood that for the image to be processed with a size w×h, each pixel point has 16 values to represent the depth feature of the pixel point. In the depth information image in the embodiment of the present application, for each pixel point, a value in 0-255 is used to represent the depth information of the pixel point, so after the depth feature is obtained, a regression problem is in fact obtained in the process of calculating the depth information according to the depth feature, therefore, a regression module is provided after the first decoding network, and the depth feature_s4 (w, h, 16) is input into the regression module, where the regression module is used to perform convolution operation on the feature map in the channel direction, that is, for each pixel point, the corresponding 16 dimension feature values can be subjected to convolution operation to obtain the depth information corresponding to the pixel point, and the regression module can be composed of convolution and sigmoid activation functions, and output is output_disp (w, h, 1). Wherein, the third numerical value in the size data is the dimension of the feature in the channel direction.

Alternatively, in another embodiment, the first decoding network may include a plurality of deconvolution modules, through which upsampling of the intermediate features is achieved, resulting in the final depth feature_s4 (w, h, 16).

The network structures of the two first decoding networks are illustrated by way of example, and the scheme of the present application is not limited thereto. In other embodiments, this may be accomplished through other forms of networks, as long as it is possible to restore high-dimensional deep features to lower-dimensional deep features that match the original image size.

The above embodiments describe the processing principle of the first decoding network. In this process, the initial subject segmentation feature output by the subject segmentation module may be incorporated by the information interaction module.

For example, according to the first decoding network, the information interaction module and the initial main body segmentation feature, performing feature reduction processing on the intermediate feature to obtain a depth feature, including: performing feature reduction processing on the intermediate features according to the first decoding network to obtain initial depth features; calculating the initial depth feature and the initial main body segmentation feature according to an attention mechanism through an information interaction module to obtain a first interaction feature; and carrying out fusion processing on the first interaction feature and the initial depth feature in the channel direction to obtain the depth feature.

In this embodiment, the feature that the first decoding network processes the intermediate feature and directly outputs is denoted as the initial depth feature. And calculating the initial depth feature and the initial main body segmentation feature according to an attention mechanism through an information interaction module to obtain a first interaction feature, and then fusing the first interaction feature and the initial depth feature to serve as a finally output depth feature of a first decoding network. The information interaction module may use the initial depth feature as a key vector and a value vector of the attention mechanism, use the initial main body segmentation feature as a query vector to calculate, and finally output a calculation result of the attention mechanism as a first interaction feature.

Further, in the solution of the above embodiment, the output of the monocular depth estimation model is a depth information image. In still other embodiments, a subject segmentation image of the image to be processed may be obtained in addition to the depth information image. For example, in one embodiment, after performing feature reduction processing on the intermediate feature according to the body segmentation module to obtain the initial body segmentation feature, the method further includes: and performing classification operation on the initial main body segmentation characteristics according to the main body segmentation module to obtain a main body segmentation image of the image to be processed.

In this embodiment, after the main body segmentation module obtains the initial main body segmentation feature, the main body segmentation module may further perform a classification operation on the initial main body segmentation feature according to the main body segmentation module, so as to obtain a main body segmentation image of the image to be processed. Taking image segmentation as an example, image segmentation is to determine a human image region from an entire image. For each pixel in the image to be processed, whether each pixel is located in a portrait area or a background area is determined. The final output of the main body segmentation model is to perform classification processing on each pixel point in the image to be processed, so as to obtain a main body segmentation image. Referring to fig. 4, fig. 4 is a schematic diagram illustrating image processing performed by the monocular depth estimation model according to an embodiment of the present application. It can be seen that the pixel value of the pixel point in the subject split image is 0 or 255.

For example, in one embodiment, the body segmentation module includes a second decoding network and a classification module. Performing feature reduction processing on the intermediate features according to the main body segmentation module to obtain initial main body segmentation features, including: and performing feature reduction processing on the intermediate features according to the second decoding network to obtain initial main body segmentation features.

The body segmentation module in this embodiment includes a second decoding network and a classification module. And the second decoding network performs feature reduction processing on the intermediate features to obtain initial main body segmentation features. In this process, the initial depth features output by the depth estimation module may be incorporated by the information interaction module. For example, after performing feature reduction processing on the intermediate features according to the main body segmentation module to obtain initial main body segmentation features, the method further includes: calculating the initial depth feature and the initial main body segmentation feature according to an attention mechanism through an information interaction module to obtain a second interaction feature; fusing the second interaction feature and the initial main body segmentation feature in the channel direction to obtain a main body segmentation feature; and carrying out classification operation on the main body segmentation characteristics according to the classification module to obtain a main body segmentation image of the image to be processed.

The feature that the second decoding network restores the intermediate feature and directly outputs the intermediate feature is recorded as an initial main body segmentation feature. And calculating the initial depth feature and the initial main body segmentation feature according to an attention mechanism through an information interaction module to obtain a second interaction feature, and then fusing the second interaction feature and the initial main body segmentation feature to serve as a main body segmentation feature of the final output of the second decoding network. The information interaction module may use the initial principal segmentation feature as a key vector and a value vector of the attention mechanism, use the initial depth feature as a query vector to calculate, and finally output a calculation result of the attention mechanism as a second interaction feature.

For another example, in one embodiment, the second decoding network includes a plurality of second residual modules including a third convolutional layer, a second upsampling layer, and a fourth convolutional layer; performing feature reduction processing on the intermediate features according to the second decoding network to obtain initial main body segmentation features, including: and performing convolution operation and up-sampling operation on the intermediate features according to the plurality of second residual modules to obtain initial main body segmentation features, wherein the output feature map of the last second residual module is the input feature map of the next second residual module, and the size of the output feature map of the next second residual module is larger than that of the output feature map of the last second residual module. For example, the output characteristic diagram of the next second residual module is 2b times the size of the output characteristic diagram of the previous second residual module, and b is equal to or larger than 1.

The second decoding network comprises N second residual error modules, wherein N is more than or equal to 1. For example, n=5, the second decoding network includes 5 second residual modules, each including a convolution layer and an upsampling layer, wherein the number of convolution layers in each residual module may be set according to actual needs. For example, in one embodiment, one second residual module includes a third convolution layer, a second upsampling layer, and a fourth convolution layer. The method comprises the steps of inputting high-latitude intermediate features with the dimensions of (w/32, h/32, 256) into a first second residual module of a second decoding network for processing, inputting the output features of the first residual module into the second residual module for processing, inputting the output features of the second residual module into the second residual module for processing, outputting the output features of the second residual module into the third residual module for processing, outputting the output features of the third residual module into the fourth residual module for processing, wherein the output features of the third residual module are sequentially features of the fourth residual module (w/4,h/4,64), features of the fifth residual module are sequentially features of the third residual module (w/4,h/4,64), features of the fourth residual module (w/2,h/2, 32), and features of the fifth residual module are larger than the output features of the fourth residual module, and the output of the fourth residual module is the output of the second decoding network. From the above data, it can be seen that the main segmentation feature_s4 (w, h, 16) output by the second decoding network includes feature maps with 16 dimensions, each feature map has a size equal to that of the original image to be processed, and it can be understood that for the image to be processed with a size of w×h, each pixel point has 16 values to represent the image segmentation feature of the pixel point. Then, the main segmentation feature_s4 (w, h, 16) is input into a classification module, the classification module is used for performing convolution operation on the feature map in the channel direction, that is, for each pixel point, the convolution operation can be performed on the feature values of the corresponding 16 dimensions to obtain probability values of the pixel point on two classifications (main body class and background class), and the final output of the convolution module is output_seg (w, h, 2), which can be represented as a human segmentation map as shown in fig. 4, wherein a white part is a human body region, and a black part is a background region. Wherein the probability value of the pixel points of the human body area on the main body class is far higher than the probability value on the background class.

As can be seen from the above solution, the second residual module in this embodiment has a similar structure to the first residual module in the above description, but it can be understood that in the model training process, the label data used by the two modules are different, so the weight parameters in the trained modules are also different, and the specific feature values in the output final feature map are different.

Alternatively, in another embodiment, the second decoding network may include a plurality of deconvolution modules, through which upsampling of the intermediate features is achieved, resulting in final subject segmentation features.

The network structures of the two second decoding networks are illustrated by way of example, and the scheme of the present application is not limited thereto. In other embodiments, this may be accomplished through other forms of networks, as long as it is possible to restore the deep features in high dimensions to body segmentation features in lower dimensions that match the original image dimensions.

Further, in the above embodiment, the super-parameters of the sizes, the step sizes, and the like of the convolution kernels in the first decoding network and the second decoding network may be set by the user as needed, for example, in an embodiment, the size of the convolution kernels may be set to 3×3.

Through the scheme in the embodiment, the information interaction module combines the characteristics of the main body segmentation module and the depth estimation module to realize that the main body segmentation module and the depth estimation module can guide each other to learn, so that the output results of the two modules are more accurate. In the above scheme, the information interaction module combines the overall output of the first decoding network or the second decoding network when performing feature combination. Alternatively, in other embodiments, after each of the first residual modules or the second residual modules outputs the features, the features of the first residual module and the second residual module may be combined with each other through the information interaction module according to the attention mechanism.

For example, the first decoding network comprises N first residual modules and the second decoding network comprises N second residual modules. Through the information interaction module, the initial depth feature and the initial main body segmentation feature are operated according to an attention mechanism to obtain a first interaction feature, wherein the first interaction feature comprises: for the ith first residual error module, acquiring an initial depth feature FDi output by the ith first residual error module, and acquiring an initial main body segmentation feature FSi output by the ith second residual error module, wherein i is [1, N ]; and through an information interaction module, taking the initial depth feature FDi as a key vector and a value vector, taking the initial main body segmentation feature FSi as a query vector, and carrying out operation according to an attention mechanism to obtain a first interaction feature F1Mi.

The first interaction feature and the initial depth feature are fused in the channel direction to obtain the depth feature, which comprises the following steps: and for the ith first residual error module, carrying out fusion processing on the first interaction feature FMi and the initial depth feature FDi in the channel direction to obtain a depth feature FDnewi.

For example, the information interaction module is mainly implemented by an attention mechanism, such as a multi-head attention mechanism, where the size of the convolution kernel may be set to 1×1, the number of heads=4, and the ratio=2. One calculation of the attention mechanism is as follows:

first, for the ith first residual module, the initial depth feature FDi output by the ith first residual module is used as a key vector and a value vector, and the initial body segmentation feature FSi output by the ith second residual module is used as a query vector.

Then, the key vector is subjected to a convolution operation to obtain k_out (w/16, h/16,256 x 4 x 2), and the k_out is rearranged into (w/16, h/16,4,256 x 2); performing a convolution operation on the value vector to obtain v_out (w/16, h/16,256 x 4 x 2), and rearranging the v_out into (w/16, h/16,4,256 x 2); the query vector is subjected to a convolution operation to obtain v_out (w/16, h/16,256 x 4 x 2), and v_out is rearranged into (w/16, h/16,4,256 x 2).

Attention value att=q_out k_out. Summing att (w/16, h/16,4,256 x 2) in a fourth dimension and dividing by the att, solving for softmax in a third dimension to obtain att (w/16, h/16,4,1), multiplying att and v_out to obtain weighted_value (w/16, h/16,4,256 x 2), summing weighted_value in a 3 rd dimension to obtain weighted_value (w/16, h/16,256 x 2), and performing a convolution operation on weighted_value to obtain the output of the attention mechanism, namely the first interaction feature F1Mi (w/16, h/16,256).

And finally, carrying out fusion processing on the first interaction feature FMi and the initial depth feature FDi in the channel direction, for example, splicing the first interaction feature FMi and the initial depth feature FDi together in the channel direction, and then carrying out fusion through a convolution operation to obtain a new depth feature FDnew. For one first residual module, the new depth feature is taken as the input data of the next first residual module.

For another example, by the information interaction module, the computing, according to the attention mechanism, the initial depth feature and the initial subject segmentation feature to obtain a second interaction feature includes: for the ith second residual error module, acquiring an initial depth feature FDi output by the ith first residual error module, and acquiring an initial main body segmentation feature FSi output by the ith second residual error module, wherein i is [1, N ]; and through an information interaction module, taking the initial main body segmentation feature FSi as a key vector and a value vector, taking the initial depth feature FDi as a query vector, and carrying out operation according to an attention mechanism to obtain a second interaction feature F2Mi.

The second interaction feature and the initial main body segmentation feature are fused in the channel direction to obtain the main body segmentation feature, which comprises the following steps: and for the ith second residual error module, carrying out fusion processing on the second interaction feature F2Mi and the initial main body segmentation feature FSi in the channel direction to obtain a main body segmentation feature FSnew.

For the second residual modules in the second decoding network, the new body segmentation feature of each second residual module can be calculated according to a similar principle as the first residual module. The specific process is not described here in detail.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating image processing performed by the monocular depth estimation model according to an embodiment of the present application. Through the scheme of the application, the information interaction module combines the characteristics of the main body segmentation module and the depth estimation module so as to realize that the main body segmentation module and the depth estimation module can guide each other to learn, and the output results of the two modules are more accurate. After the subject split image and the depth information image are obtained, both can be used for downstream task processing such as outputting a photograph in the portrait mode, for example, cropping a portrait area, or the like.

In an embodiment, the output of the main body segmentation module and the output of the depth estimation module may be comprehensively represented in one image, so that not only the portrait region can be represented, but also the depth information in the image to be processed can be clearly expressed, and referring to fig. 6, fig. 6 is a schematic diagram of an effect of the image processing method according to the embodiment of the present application. Fig. 6 is a schematic diagram of an effect of the monocular depth estimation model according to the present application, which is obtained after processing an image to be processed and includes both image segmentation information and depth information.

Next, a detailed description of the training process of the model is provided for the convenience of the reader to understand the principles of the monocular depth estimation model hereinabove.

For example, in one embodiment, the model may be trained as follows:

a. a sample image is acquired, the sample image having a subject tag and a depth tag.

And (5) data pair production. The data pair production is divided into two groups, one group is a data pair of image segmentation, and the other group is a data pair of depth estimation. Acquiring a sample image: a large number of sample images, such as 2 ten thousand sample images, are acquired by a mobile phone camera, 90% of the scenes of the sample images contain figures, and 10% of the scenes of the sample images do not contain figures. The method comprises the steps of dividing a sample image by adopting an existing image dividing model, such as a modnet model, to obtain an image dividing picture, and then further matting the image dividing data by adopting a matting model, such as a backgrounmattingV2, to obtain final image dividing data serving as a main body label. And (3) obtaining a depth map corresponding to the sample image by adopting a depth information extraction model, such as a dense prediction visual model, and repairing the place where the depth map is wrong by using a drawing tool, such as photoshop, so as to obtain final depth map data serving as a depth label of the sample image.

b. And carrying out main body segmentation processing on the sample image through an encoding module and a main body segmentation module of the initial deep learning network to obtain a main body prediction graph.

An initial deep learning network is constructed, and the structure of the network is shown in fig. 5. The initial deep learning network comprises a coding module, a main body segmentation module, a depth estimation module and an information interaction module.

And extracting intermediate features of the sample image through the coding module, and carrying out feature reduction processing on the intermediate features through the main body segmentation module to obtain a main body prediction graph. The process is similar to the process of the model in the practical application stage, and will not be described herein.

c. And carrying out depth estimation processing on the sample image through an encoding module and a depth estimation module of the initial deep learning network to obtain a depth prediction image.

And performing feature reduction processing on the intermediate features through a depth estimation module to obtain a depth prediction image. The process is similar to the process of the model in the practical application stage, and will not be described herein.

d. And calculating a first loss value between the main body prediction graph and the main body label according to the first loss function, and calculating a second loss value between the depth prediction graph and the depth label according to the second loss function.

e. And calculating training loss according to the first loss value and the second loss value, and updating the network weight parameter according to the training loss.

Wherein, in an embodiment, the training loss L comprises a first loss L _d And a second loss L _s . The first loss function is a depth map L1 loss function and the second loss function is a segmentation L1 loss function. The calculation mode is as follows:

L _d ＝L ₁ (d _pred ，d _{t arg et} )；

L _s ＝L ₁ (s _pred ，s _{t arg et} )；

L＝L _d +L _s

L _d corresponding to a depth map L1 loss function, where d _pred Is a depth map (depth prediction map) of model prediction, d _{t arg et} Is the true value of the depth map (depth label). L (L) _s Corresponding to the image segmentation L1 loss function, wherein s _pred A segmentation map (main body prediction map) which is model prediction, s _{t arg et} Is the true value of the portrait segmentation (the subject label).

f. And returning to the step of acquiring the sample image until the model converges to obtain the monocular depth estimation model.

The condition for model convergence may be that the number of iterations reaches a preset number of iterations, or that the training loss is less than a preset threshold.

Wherein in an embodiment, calculating the training loss from the first loss value and the second loss value comprises: calculating a third loss value between the depth prediction graph and the depth label according to the smooth loss function; and calculating training loss according to the first loss value, the second loss value and the third loss value.

In this embodiment, in order to improve the effect of image processing, a smoothing loss function L may also be added in model training _sm 。

Where I is the input RGB image, the loss function may be implemented to smooth the depth map in the non-edge regions of the image.

An image processing apparatus is also provided in an embodiment. Referring to fig. 7, fig. 7 is a schematic diagram of an image processing apparatus 300 according to an embodiment of the application. Wherein the image processing apparatus 300 comprises:

the extracting module 301 is configured to perform feature extraction processing on the acquired image to be processed by using an encoding module of the monocular depth estimation model, so as to obtain an intermediate feature; the monocular depth estimation model comprises a coding module, a main body segmentation module, a depth estimation module and an information interaction module;

the segmentation module 302 is configured to perform feature reduction processing on the intermediate feature according to the main body segmentation module to obtain an initial main body segmentation feature;

and the estimation module 303 is configured to perform depth estimation processing on the intermediate feature according to the depth estimation module, the information interaction module and the initial main body segmentation feature, so as to obtain a depth information image corresponding to the image to be processed.

In some embodiments, the depth estimation module includes a first decoding network and a regression module; the estimation module 303 is further configured to: performing feature reduction processing on the intermediate features according to the first decoding network, the information interaction module and the initial main body segmentation features to obtain depth features; and carrying out convolution operation on the depth features in the channel direction according to the regression module to obtain a depth information image of the image to be processed.

In some embodiments, the segmentation module 302 is further to: performing feature reduction processing on the intermediate features according to the first decoding network to obtain initial depth features; calculating the initial depth feature and the initial main body segmentation feature according to an attention mechanism through an information interaction module to obtain a first interaction feature; and fusing the first interaction feature and the initial depth feature in the channel direction to obtain the depth feature.

In some embodiments, the segmentation module 302 is further to: and performing classification operation on the initial main body segmentation characteristics according to the main body segmentation module to obtain a main body segmentation image of the image to be processed.

In some embodiments, the body segmentation module includes a second decoding network and a classification module; the segmentation module 302 is also configured to: performing feature reduction processing on the intermediate features according to the second decoding network to obtain initial main body segmentation features; calculating the initial depth feature and the initial main body segmentation feature according to an attention mechanism through an information interaction module to obtain a second interaction feature; fusing the second interaction feature and the initial main body segmentation feature in the channel direction to obtain a main body segmentation feature; and performing classification operation on the main body segmentation features according to the classification module to obtain a main body segmentation image of the image to be processed.

In some embodiments, the first decoding network comprises a plurality of first residual modules comprising a first convolutional layer, a first upsampling layer, and a second convolutional layer; the estimation module 303 is further configured to: and performing convolution operation and up-sampling operation on the intermediate features according to the plurality of first residual modules to obtain initial depth features, wherein the output feature map of the last first residual module is the input feature map of the next first residual module, and the size of the output feature map of the next first residual module is larger than that of the output feature map of the last first residual module.

In some embodiments, the second decoding network comprises a plurality of second residual modules comprising a third convolutional layer, a second upsampling layer, and a fourth convolutional layer; the segmentation module 302 is also configured to: and performing convolution operation and up-sampling operation on the intermediate features according to the plurality of second residual modules to obtain initial main body segmentation features, wherein the output feature map of the last second residual module is the input feature map of the next second residual module, and the size of the output feature map of the next second residual module is larger than that of the output feature map of the last second residual module.

In some embodiments, the first decoding network comprises N first residual modules and the second decoding network comprises N second residual modules; the estimation module 303 is further configured to: for the ith first residual error module, acquiring an initial depth feature FDi output by the ith first residual error module, and acquiring an initial main body segmentation feature FSi output by the ith second residual error module, wherein i is [1, N ]; the method comprises the steps that through an information interaction module, an initial depth feature FDi is used as a key vector and a value vector, an initial main body segmentation feature FSi is used as a query vector, and operation is carried out according to an attention mechanism to obtain a first interaction feature F1Mi; and for the ith first residual error module, carrying out fusion processing on the first interaction feature FMi and the initial depth feature FDi in the channel direction to obtain a depth feature FDnewi.

In some embodiments, the segmentation module 302 is further to: for the ith second residual error module, acquiring an initial depth feature FDi output by the ith first residual error module, and acquiring an initial main body segmentation feature FSi output by the ith second residual error module, wherein i is [1, N ]; the method comprises the steps that through an information interaction module, an initial main body segmentation feature FSi is used as a key vector and a value vector, an initial depth feature FDi is used as a query vector, and operation is carried out according to an attention mechanism to obtain a second interaction feature F2Mi; and for the ith second residual error module, carrying out fusion processing on the second interaction feature F2Mi and the initial main body segmentation feature FSi in the channel direction to obtain a main body segmentation feature FSnewi.

In some embodiments, the image processing apparatus 300 further comprises:

the model training module is used for acquiring a sample image, and the sample image is provided with a main body label and a depth label; performing main body segmentation processing on the sample image through an encoding module and a main body segmentation module of the initial deep learning network to obtain a main body prediction graph; performing depth estimation processing on the sample image through an encoding module and a depth estimation module of the initial deep learning network to obtain a depth prediction image; calculating a first loss value between the main body prediction graph and the main body label according to the first loss function, and calculating a second loss value between the depth prediction graph and the depth label according to the second loss function; calculating training loss according to the first loss value and the second loss value, and updating a network weight parameter according to the training loss; and returning to the step of acquiring the sample image until the model converges to obtain a monocular depth estimation model.

In some embodiments, the model training module is further to: calculating a third loss value between the depth prediction graph and the depth label according to the smooth loss function; and calculating training loss according to the first loss value, the second loss value and the third loss value, and updating the network weight parameter according to the training loss.

It can be seen from the above that, in the image processing apparatus provided by the embodiment of the present application, a trained monocular depth estimation model is determined, where the monocular depth estimation model includes a coding module, a main body segmentation module, a depth estimation module and an information interaction module, the coding module performs feature extraction processing on an image to be processed to obtain intermediate features, the main body segmentation model performs feature reduction processing on the intermediate features to obtain initial main body segmentation features, and then the main body segmentation features are applied to a depth estimation process according to the depth estimation module and the information interaction module, and the intermediate features are subjected to depth estimation processing to obtain a depth information image corresponding to the image to be processed. According to the scheme, a single frame image can be processed, and the main body segmentation feature is applied to depth estimation to obtain a high-precision depth information image.

The embodiment of the application also provides electronic equipment which can be a terminal, and the terminal can be terminal equipment such as a smart phone, a tablet personal computer, a notebook computer, a touch screen, a game machine, a personal computer (PC, personal Computer), a personal digital assistant (Personal Digital Assistant, PDA) and the like. Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the application. The electronic device 400 includes a processor 401 having one or more processing cores, a memory 402 having one or more computer readable storage media, and a computer program stored on the memory 402 and executable on the processor. The processor 401 is electrically connected to the memory 402. It will be appreciated by those skilled in the art that the electronic device structure shown in the figures is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The processor 401 is a control center of the electronic device 400, connects various parts of the entire electronic device 400 using various interfaces and lines, and performs various functions of the electronic device 400 and processes data by running or loading software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device 400.

In the embodiment of the present application, the processor 401 in the electronic device 400 loads the instructions corresponding to the processes of one or more application programs into the memory 402 according to the following steps, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Optionally, as shown in fig. 8, the electronic device 400 further includes: a touch display 403, a radio frequency circuit 404, an audio circuit 405, an input unit 406, and a power supply 407. The processor 401 is electrically connected to the touch display 403, the radio frequency circuit 404, the audio circuit 405, the input unit 406, and the power supply 407, respectively. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 8 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The touch display 403 may be used to display a graphical user interface and receive operation instructions generated by a user acting on the graphical user interface. The touch display screen 403 may include a display panel and a touch panel. Wherein the display panel may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of the electronic device, which may be composed of graphics, text, icons, video, and any combination thereof. The touch panel may be used to collect touch operations on or near the user (such as operations on or near the touch panel by the user using any suitable object or accessory such as a finger, stylus, etc.), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends the touch point coordinates to the processor 401, and can receive and execute commands sent from the processor 401.

The radio frequency circuitry 404 may be used to transceive radio frequency signals to establish wireless communication with a network device or other electronic device via wireless communication.

The audio circuitry 405 may be used to provide an audio interface between a user and an electronic device through a speaker, microphone. The audio circuit 405 may transmit the received electrical signal after audio data conversion to a speaker, where the electrical signal is converted into a sound signal for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 405 and converted into audio data, which are processed by the audio data output processor 401 and sent via the radio frequency circuit 404 to e.g. another electronic device, or which are output to the memory 402 for further processing. The audio circuit 405 may also include an ear bud jack to provide communication of the peripheral headphones with the electronic device.

The input unit 406 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 407 is used to power the various components of the electronic device 400. Alternatively, the power supply 407 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, and power consumption management through the power management system. The power supply 407 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown in fig. 8, the electronic device 400 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

It can be seen from the foregoing that, in the electronic device provided in this embodiment, a trained monocular depth estimation model is determined, where the monocular depth estimation model includes an encoding module, a main body segmentation module, a depth estimation module, and an information interaction module, and features extraction processing is performed on an image to be processed by the encoding module to obtain intermediate features, feature reduction processing is performed on the intermediate features by the main body segmentation model to obtain initial main body segmentation features, and then, according to the depth estimation module and the information interaction module, the main body segmentation features are applied to a depth estimation process, and depth estimation processing is performed on the intermediate features to obtain a depth information image corresponding to the image to be processed. According to the scheme, a single frame image can be processed, and the main body segmentation feature is applied to depth estimation to obtain a high-precision depth information image.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

For this reason, an embodiment of the present application provides a computer readable storage medium, please refer to fig. 9, and fig. 9 is a schematic diagram of a structure of the computer readable storage medium 500 according to an embodiment of the present application. The computer-readable storage medium 500 has stored thereon a computer program 501, which when executed by a processor, implements the steps of any one of the image processing methods provided by the embodiments of the present application.

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The steps in any image processing method provided by the embodiment of the present application can be executed by the computer program stored in the storage medium, so that the beneficial effects that can be achieved by any image processing method provided by the embodiment of the present application can be achieved, and detailed descriptions of the previous embodiments are omitted herein.

The foregoing has described in detail the methods, apparatuses, electronic devices and computer readable storage medium for image processing according to the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the methods and core ideas of the present application; also, those skilled in the art will appreciate that the present disclosure should not be construed as limited to the particular embodiments and application areas disclosed herein.

Claims

1. An image processing method, comprising:

performing feature extraction processing on the acquired image to be processed through an encoding module of the monocular depth estimation model to obtain intermediate features; the monocular depth estimation model comprises an encoding module, a main body segmentation module, a depth estimation module and an information interaction module;

2. The method of claim 1, wherein the depth estimation module comprises a first decoding network and a regression module;

and performing depth estimation processing on the intermediate feature according to the depth estimation module, the information interaction module and the initial main body segmentation feature to obtain a depth information image corresponding to the image to be processed, wherein the depth information image comprises:

according to the first decoding network, the information interaction module and the initial main body segmentation feature, carrying out feature reduction processing on the intermediate feature to obtain a depth feature;

and carrying out convolution operation on the depth features in the channel direction according to the regression module to obtain a depth information image corresponding to the image to be processed.

3. The method of claim 2, wherein the performing feature reduction processing on the intermediate features according to the first decoding network, the information interaction module, and the initial subject segmentation feature to obtain depth features comprises:

performing feature reduction processing on the intermediate features according to the first decoding network to obtain initial depth features;

calculating the initial depth feature and the initial main body segmentation feature according to an attention mechanism through the information interaction module to obtain a first interaction feature;

And carrying out fusion processing on the first interaction feature and the initial depth feature in the channel direction to obtain the depth feature.

4. The method of claim 1, wherein after performing feature reduction processing on the intermediate feature according to the body segmentation module to obtain an initial body segmentation feature, the method further comprises:

and performing classification operation on the initial subject segmentation features according to the subject segmentation module to obtain a subject segmentation image of the image to be processed.

5. The method of claim 3, wherein the body segmentation module comprises a second decoding network and a classification module;

the feature reduction processing is performed on the intermediate feature according to the main body segmentation module to obtain an initial main body segmentation feature, which comprises the following steps:

performing feature reduction processing on the intermediate features according to the second decoding network to obtain initial main body segmentation features;

after the feature reduction processing is performed on the intermediate feature according to the main body segmentation module to obtain an initial main body segmentation feature, the method further comprises:

calculating the initial depth feature and the initial main body segmentation feature according to an attention mechanism through the information interaction module to obtain a second interaction feature;

Carrying out fusion processing on the second interaction feature and the initial main body segmentation feature in the channel direction to obtain a main body segmentation feature;

and carrying out classification operation on the main body segmentation features according to the classification module to obtain a main body segmentation image corresponding to the image to be processed.

6. The method of claim 5, wherein the first decoding network comprises a plurality of first residual modules comprising a first convolutional layer, a first upsampling layer, and a second convolutional layer; the performing feature reduction processing on the intermediate feature according to the first decoding network to obtain an initial depth feature, including:

performing convolution operation and up-sampling operation on the intermediate features according to the plurality of first residual error modules to obtain initial depth features; the output feature map of the previous first residual error module is an input feature map of the next first residual error module, and the size of the output feature map of the next first residual error module is larger than that of the output feature map of the previous first residual error module.

7. The method of claim 6, wherein the second decoding network comprises a plurality of second residual modules comprising a third convolutional layer, a second upsampling layer, and a fourth convolutional layer; the performing feature reduction processing on the intermediate feature according to the second decoding network to obtain an initial main body segmentation feature, including:

Performing convolution operation and up-sampling operation on the intermediate features according to the plurality of second residual error modules to obtain initial main body segmentation features; the output feature map of the previous second residual error module is an input feature map of the next second residual error module, and the size of the output feature map of the next second residual error module is larger than that of the output feature map of the previous second residual error module.

8. The method of claim 5, wherein the first decoding network comprises N first residual modules and the second decoding network comprises N second residual modules;

the calculating, by the information interaction module, the initial depth feature and the initial main body segmentation feature according to an attention mechanism to obtain a first interaction feature, including:

for an ith first residual error module, acquiring an initial depth feature FDi output by the ith first residual error module, and acquiring an initial main body segmentation feature FSi output by an ith second residual error module, wherein i is E [1, N ];

the information interaction module is used for taking the initial depth feature FDi as a key vector and a value vector, taking the initial main body segmentation feature FSi as a query vector, and carrying out operation according to an attention mechanism to obtain a first interaction feature F1Mi;

The fusing processing is performed on the first interaction feature and the initial depth feature in the channel direction to obtain a depth feature, including:

and for the ith first residual error module, carrying out fusion processing on the first interaction feature FMi and the initial depth feature FDi in the channel direction to obtain a depth feature FDnewi.

9. The method of claim 8, wherein the computing, by the information interaction module, the initial depth feature and the initial subject segmentation feature according to an attention mechanism to obtain a second interaction feature comprises:

for the ith second residual error module, acquiring an initial depth feature FDi output by the ith first residual error module, and acquiring an initial main body segmentation feature FSi output by the ith second residual error module, wherein i is [1, N ];

the information interaction module is used for taking the initial main body segmentation feature FSi as a key vector and a value vector, taking the initial depth feature FDi as a query vector, and carrying out operation according to an attention mechanism to obtain a second interaction feature F2Mi;

the fusing processing is performed on the second interaction feature and the initial main body segmentation feature in the channel direction to obtain a main body segmentation feature, which comprises the following steps:

And for the ith second residual error module, carrying out fusion processing on the second interaction feature F2Mi and the initial main body segmentation feature FSi in the channel direction to obtain a main body segmentation feature FSnewi.

10. The method of claim 1, wherein before extracting features of the acquired image to be processed by the encoding module of the monocular depth estimation model to obtain intermediate features, the method further comprises:

acquiring a sample image, wherein the sample image is provided with a main body label and a depth label;

performing main body segmentation processing on the sample image through an encoding module and a main body segmentation module of an initial deep learning network to obtain a main body prediction graph;

performing depth estimation processing on the sample image through an encoding module and a depth estimation module of the initial deep learning network to obtain a depth prediction image;

calculating a first loss value between the main body prediction graph and the main body label according to a first loss function, and calculating a second loss value between the depth prediction graph and the depth label according to a second loss function;

calculating training loss according to the first loss value and the second loss value, and updating a network weight parameter according to the training loss;

And returning to the step of acquiring the sample image until the model converges, and obtaining the monocular depth estimation model.

11. The method of claim 10, wherein the calculating training loss from the first loss value and the second loss value comprises:

calculating a third loss value between the depth prediction map and the depth label according to a smooth loss function;

and calculating training loss according to the first loss value, the second loss value and the third loss value.

12. An image processing apparatus, comprising:

the extraction module is used for carrying out feature extraction processing on the acquired image to be processed through the coding module of the monocular depth estimation model to obtain intermediate features; the monocular depth estimation model comprises an encoding module, a main body segmentation module, a depth estimation module and an information interaction module;

13. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the image processing method according to any one of claims 1 to 11.

14. An electronic device comprising a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the steps in the image processing method according to any one of claims 1 to 11 when the computer program is executed by the processor.