CN112070817B

CN112070817B - Image depth estimation method, terminal equipment and computer readable storage medium

Info

Publication number: CN112070817B
Application number: CN202010863390.0A
Authority: CN
Inventors: 张锲石; 程俊; 林典; 高向阳; 任子良; 康宇航
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2024-05-28
Anticipated expiration: 2040-08-25
Also published as: CN112070817A; WO2022041506A1

Abstract

The application is applicable to the technical field of image processing, and provides an image depth estimation method, terminal equipment and a computer readable storage medium. Because the key features in the scheme are used for describing the information of the pixel points contained in the non-blank area in the target image, namely, the scheme does not consider the information of the pixel points contained in the blank area in the target image when determining the target depth image of the target image, and the target depth is determined only based on the key features of the target image, the estimation accuracy of the image depth estimation can be improved, and the calculation cost of the image depth estimation is reduced.

Description

Image depth estimation method, terminal equipment and computer readable storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image depth estimation method, a terminal device, and a computer readable storage medium.

Background

Image depth estimation is a process of predicting depth information of a two-dimensional image. The depth information obtained through image depth estimation can describe the distance degree of the shot object corresponding to each pixel point in the image from the camera, so that the image depth estimation is widely applied to the fields of three-dimensional modeling, autonomous obstacle avoidance of robots and the like. Most existing image depth estimation methods are generally based on multiple images, and image depth estimation is performed according to an inherent projective relationship between the multiple images. For a single image, because no internal mapping relation exists between the images, limited information can be obtained only through the characteristics and priori knowledge of the images so as to finish image depth estimation, and therefore, the method has higher technical difficulty.

In the prior art, a plurality of image depth estimation methods based on depth learning are provided for a single image, and good estimation results can be obtained. However, the existing image depth estimation method for a single image generally performs depth estimation based on global information of the image, which reduces the accuracy of image depth estimation.

Disclosure of Invention

In view of the above, the embodiments of the present application provide an image depth estimation method, a terminal device, and a computer readable storage medium, so as to solve the problem of low estimation accuracy in the existing image depth estimation method for a single image.

In a first aspect, an embodiment of the present application provides an image depth estimation method, including:

Acquiring a target image;

Inputting the target image into a preset depth estimation model, extracting global features of the target image, extracting key features of the target image from the global features, and determining a target depth map of the target image according to the key features; the key features are used for describing information of pixel points contained in a non-blank area in the target image, wherein the non-blank area refers to an area formed by non-white pixel points and non-black pixel points; the value of each pixel point in the target depth map is used for describing the distance degree of the shot object corresponding to the pixel point from the camera.

In a second aspect, an embodiment of the present application provides a terminal device including:

A first acquisition unit configured to acquire a target image;

The first processing unit is used for inputting the target image into a preset depth estimation model, extracting global features of the target image, extracting key features of the target image from the global features, and determining a target depth map of the target image according to the key features; the key features are used for describing information of pixel points contained in a non-blank area in the target image, wherein the non-blank area refers to an area formed by non-white pixel points and non-black pixel points; the value of each pixel point in the target depth map is used for describing the distance degree of the shot object corresponding to the pixel point from the camera.

In a third aspect, an embodiment of the present application provides a terminal device comprising a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the method according to the first aspect or any alternative of the first aspect when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when executed by a processor, implements a method as in the first aspect or any of the alternatives of the first aspect.

In a fifth aspect, an embodiment of the application provides a computer program product for causing a terminal device to carry out the method of the first aspect or any of the alternatives of the first aspect, when the computer program product is run on the terminal device.

The image depth estimation method, the terminal equipment, the computer readable storage medium and the computer program product provided by the embodiment of the application have the following beneficial effects:

According to the image depth estimation method provided by the embodiment of the application, the target image is acquired and is input into the preset depth estimation model, the global feature of the target image is extracted, the key feature of the target image is extracted from the global feature, and the target depth map of the target image is determined according to the key feature; the key features are used for describing information of pixel points contained in a non-blank area in the target image, wherein the non-blank area refers to an area formed by non-white pixel points and non-black pixel points; the value of each pixel point in the target depth map is used for describing the distance degree of the shot object corresponding to the pixel point from the camera. Because the shot object corresponding to the pixel points contained in the blank area in the target image is usually infinitely far away from the camera, if the information of the pixel points contained in the blank area is used for image depth estimation, the calculation cost of the image depth estimation can be increased, the estimation precision of the image depth estimation can be influenced, and the key features in the scheme are used for describing the information of the pixel points contained in the non-blank area in the target image, namely, when the target depth map of the target image is determined, the information of the pixel points contained in the blank area in the target image is not considered, and the target depth is determined only based on the key features of the target image, so that the estimation precision of the image depth estimation can be improved, and the calculation cost of the image depth estimation is reduced.

Drawings

FIG. 1 is a schematic flow chart of an image depth estimation method provided by an embodiment of the application;

FIG. 2 is a schematic structural diagram of a depth estimation model according to an embodiment of the present application;

FIG. 3 is a schematic view of a depth estimation model according to another embodiment of the present application;

FIG. 4 is a flowchart of a specific implementation of a step of extracting global features of a target image in an image depth estimation method according to an embodiment of the present application;

FIG. 5 is a flowchart of a specific implementation of a step of determining a target depth map of a target image in an image depth estimation method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a spatial attention network according to an embodiment of the present application;

fig. 7 is a flowchart of a specific implementation of S51 in an image depth estimation method according to an embodiment of the present application;

fig. 8 is a flowchart of a specific implementation of S71 in an image depth estimation method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a specific process of pooling operation involved in an image depth estimation method according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a specific procedure of an equivalent convolution operation involved in an image depth estimation method according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a depth information processing network according to an embodiment of the present application;

Fig. 12 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

Fig. 13 is a schematic structural diagram of a terminal device according to another embodiment of the present application;

Fig. 14 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

It is noted that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

Before explaining the image depth estimation method provided by the embodiment of the application, related concepts related to the embodiment of the application are explained.

In the fields of autonomous navigation of a robot or automatic driving of a vehicle, the robot or the vehicle is usually required to automatically avoid when encountering an obstacle, so that the robot or the vehicle needs to know the distance from the front obstacle in real time in the advancing process, and then performs obstacle avoidance operation when approaching the front obstacle, so as to realize an automatic obstacle avoidance function. Taking autonomous navigation of a robot as an example, a camera for photographing road conditions in front of the robot is generally mounted on the robot, and when an obstacle exists in front of the robot, the camera photographs an image including the obstacle.

Since an image captured by a camera is generally a two-dimensional image, and the two-dimensional image does not include depth information for describing the degree of distance of a subject to be captured from the camera, it is necessary to perform image depth estimation on the image captured by the camera to obtain a depth map of the image. Because the value of each pixel point in the depth map is used for describing the distance degree of the shot object corresponding to the pixel point from the camera, the robot can know the distance degree of the obstacle from the camera through the depth map corresponding to the image containing the obstacle, namely the distance degree of the robot from the obstacle, and the automatic obstacle avoidance function can be realized based on the distance degree.

Specifically, image depth estimation is a process of predicting depth information of a two-dimensional image. The depth information of the image can be represented by a depth map with the same size as the image, and the value of each pixel in the depth map is used for describing the distance degree of the shot object corresponding to the corresponding pixel point in the image from the camera. The value range of the value of each pixel in the depth map is (0, 1). If the values of some pixel points in the depth map are closer to 0, the closer the shot objects corresponding to the pixel points are to the camera; if the values of some pixels in the depth map are closer to 1, the more distant the object corresponding to the pixels is from the camera. It should be noted that, a plurality of pixels in the depth map may correspond to the same photographed object, and different pixels in the plurality of pixels generally correspond to different portions of the photographed object. When a plurality of pixels in the depth map correspond to the same photographed object, the values of the plurality of pixels are generally closer.

For example, taking autonomous navigation of a robot as an example, when the robot detects that pixels with values smaller than a preset depth value threshold exist in a depth map of a certain picture shot by a camera, it is confirmed that the robot is currently close to a shot object corresponding to the pixels, and at this time, the robot can execute obstacle avoidance operation. The preset depth value threshold may be set according to actual requirements, for example, the preset depth value threshold may be 0.05.

In the embodiment of the application, an image needing to be subjected to image depth estimation can be defined as a target image, and a depth map which is the same size as the target image and used for representing the depth information of the target image is defined as a target depth map of the target image. Illustratively, if the size of the target image is mxn pixels, the size of the target depth map of the target image is also mxn pixels.

In practical applications, the target image may be a three primary color (i.e., red green blue) image, or may be another type of color image. For ease of understanding, the following embodiments of the present application will be described by taking the target image as an RGB image as an example.

Since the RGB image includes three color channels, respectively: the R channel, the G channel and the B channel, so that the target image can be represented by three two-dimensional images, each two-dimensional image corresponds to one color channel, the value range of each pixel point in each two-dimensional image is [0,255], and the value of each pixel point in each two-dimensional image is used for representing the component of the pixel value of the corresponding pixel point in the target image on the color channel corresponding to the two-dimensional image. For example, if the two-dimensional image corresponding to the R channel of the target image is the image R, the value of the pixel point in the first row and the first column in the image R is used to represent the component of the pixel value of the pixel point in the first row and the first column in the target image on the R channel.

In the embodiment of the application, each target image can be subjected to operation on a space domain and operation on a color channel domain. The operation on the spatial domain refers to the operation on a two-dimensional plane of each two-dimensional image corresponding to the target image. By way of example, operations on the spatial domain may include scaling operations on the size of each two-dimensional image (e.g., height or width of the image, etc.), and so on. The operation on the color channel domain may include an operation such as a change in the number of color channels of the target image.

In general, a target image may include some blank areas, and a shot object (for example, sky) corresponding to the blank areas is usually infinitely distant from the camera, so that depth estimation of the blank areas in the target image is generally unnecessary, if information of pixels included in the blank areas is used for image depth estimation, computing overhead of image depth estimation is increased, estimation accuracy of image depth estimation is affected, and therefore, in the embodiment of the application, a spatial attention mechanism is entered when image depth estimation is performed.

The spatial attention mechanism refers to an attention mechanism based on a spatial domain, which is capable of focusing attention on a critical area in an image, for example, a non-blank area, i.e., information focusing only on the critical area in the image. Wherein, the non-blank area refers to an area except for the blank area in the image. The blank area includes an area constituted by a continuous plurality of black pixels and/or an area constituted by a continuous plurality of white pixels. For an RGB image, a black pixel point refers to a pixel point where pixel value components on three color channels are all 0, and a white pixel point refers to a pixel point where pixel value components on three color channels are all 255.

Referring to fig. 1, fig. 1 is a schematic flowchart of an image depth estimation method according to an embodiment of the present application. The execution main body of the image depth estimation method provided by the embodiment of the application is terminal equipment, and the terminal equipment can be mobile terminals such as smart phones, tablet computers or wearable equipment, and can also be cameras or robots in various application scenes.

The image depth estimation method shown in fig. 1 may include S11 to S12, which are described in detail as follows:

S11: a target image is acquired.

In the embodiment of the application, the number of the target images can be one or a plurality of target images.

In one possible implementation, the terminal device may directly acquire one or more images captured by the camera as the target image. In another possible implementation manner, the terminal device may acquire one or more pieces of video shot by the camera, and frame-divide the video, and use one or more video frames obtained by frame-division as the target image.

S12: and inputting the target image into a preset depth estimation model for processing to obtain a target depth map of the target image.

In the embodiment of the application, the depth estimation model is used for determining the target depth map of the target image, namely, the input of the depth estimation model is the target image, and the input is the target depth map of the target image. It should be noted that, the value of each pixel point in the target depth map of the target image is used to describe the distance between the photographed object corresponding to the pixel point and the camera.

When the number of the target images is one, the terminal equipment can input the target images into a preset depth estimation model, and the target depth map of the target images is determined through the depth estimation model; when the number of the target images is a plurality of, the terminal device may input the plurality of target images into the depth estimation model, respectively, and determine the target depth map of each target image through the depth estimation model.

The terminal device inputs the target image into a preset depth estimation model for processing, and specifically comprises the following steps:

Extracting global features of the target image, extracting key features of the target image from the global features, and determining a target depth map of the target image according to the key features.

In the embodiment of the application, a terminal device inputs a target image into a preset depth estimation model, the preset depth model extracts global features of the target image, extracts key features of the target image from the global features, and determines a target depth map of the target image according to the key features.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a depth estimation model according to an embodiment of the application. As shown in fig. 2, in one embodiment of the application, the depth estimation model 20 may include a first feature extraction network 21 and a spatial attention mechanism based depth estimation network 22.

The first feature extraction network 21 is used to extract global features of the target image. The global feature is used for describing information of all pixel points contained in the target image.

In the embodiment of the application, the global feature of the target image can be represented by a three-dimensional feature map. It should be noted that, the dimension herein refers to a dimension on a color channel domain of an image. For example, the three-dimensional feature map may be formed by one-dimensional feature map corresponding to each of the three color channels. In the embodiment of the present application, the size of the feature map (i.e., the width×height of the feature map) for representing the global feature of the target image is smaller than the size of the target image. Illustratively, if the size of the target image is m×n pixels and the size of the feature map for representing the global features of the target image is a×b pixels, a < M and/or B < N.

The depth estimation network 22 is configured to extract key features of the target image from the global features based on the spatial attention mechanism, and determine a target depth map of the target image based on the key features. The key features are used for describing information of pixel points contained in a non-blank area in the target image, and the non-blank area refers to an area formed by non-white pixel points and non-black pixel points according to the previous description of the non-blank area.

In the embodiment of the application, the key features of the target image can be represented by a one-dimensional feature map. It should be noted that, the dimension herein also refers to a dimension in a color channel domain of an image. For example, the one-dimensional feature map may be obtained by performing a dimension reduction process on the three-dimensional feature map in a color channel domain. In the embodiment of the present application, the size of the feature map for representing the key feature of the target image is generally smaller than or equal to the size of the feature map for representing the global feature of the target image. Illustratively, if the feature map for representing the global feature of the target image has a size of A×B pixels and the feature map for representing the key feature of the target image has a size of K×J pixels, K.ltoreq.A and/or J.ltoreq.B.

It should be noted that, the depth estimation model in the embodiment of the present application may be obtained by training a depth estimation model built in advance by adopting a deep learning manner based on a preset sample data set. The training process of the depth estimation model may specifically include the following steps:

(1) And constructing a depth estimation model. In the embodiment of the present application, a depth estimation model as shown in fig. 3 may be constructed, where an initial value of each network parameter (for example, each parameter of the convolution kernel) involved in the depth estimation model may be any value given randomly, and a final value of each network parameter involved in the depth estimation model may be learned during the training process of the depth estimation model.

(2) A sample dataset is acquired. In the embodiment of the application, two images with time continuity are taken as one sample image group, namely each sample image group comprises a target image I _t and a later frame image I _t+1 of the target image. In training the depth estimation model, a plurality of sample image sets may be acquired from an existing common image library, the plurality of sample image sets constituting a sample dataset.

(3) A depth estimation model is trained.

When training the depth estimation model, the target image I _t in each sample image group can be input into the pre-constructed depth estimation model for processing to obtain the target depth map of the target image I _t

Meanwhile, the target image I _t in each sample image group can be input into a preset pose estimation model for processing to obtain the conversion relationship between the target image I _t and the image I _t+1 of the next framePredicted image/>, corresponding to the next frame image I _t+1 The preset pose estimation model can be an existing pose estimation model based on a neural network.

Then, the predicted image corresponding to the target image I _t is determined by the following formula

Where K denotes the built-in parameters of the camera,Representing the conversion relationship between the target image I _t and the subsequent frame image I _t+1,/>A target depth map representing a target image I _t obtained by a depth estimation model, and p _t+1 represents a predicted image/>Coordinates of each pixel in the map, p _t represents a predicted image/>Predicted image/>, corresponding to each pixel point in target image I _t Is a coordinate of (b) a coordinate of (c).

Since only p _t is unknown in the above formula, the predicted image can be calculated from the above formulaPredicted image/>, corresponding to each pixel point in target image I _t The coordinates p _t in the image can be further used for obtaining a predicted image/>, which corresponds to the target image I _t Based on this, a predicted image/>, corresponding to the target image I _t, can be obtainedAn error Δi from the target image I _t. As can be seen from the above formula, the error ΔI is equal to the target depth map/>, of the target image I _t And (5) correlation.

Thus, the target depth map of the target image I _t obtained by the depth estimation model may be made by continuously adjusting the values of the various network parameters of the depth estimation modelContinuously change to further enable the depth map to be matched with the target depth mapThe associated error Δi is continuously varied until the error Δi is below a preset error threshold. When the error delta I is lower than a preset error threshold, training of the depth estimation model is completed, and at the moment, the value of each network parameter of the depth estimation model is the final value of each network parameter.

The terminal device may determine the trained depth estimation model as a preset depth estimation model, that is, the preset depth estimation model described in S12.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a depth estimation model according to another embodiment of the application. In another embodiment of the present application, as shown in fig. 3, the first feature extraction network 21 may include a first convolution network 211, a first residual network 212, and a first pooling network 213.

Based on this, extracting the global features of the target image may specifically include S41 to S43 as shown in fig. 4, as follows:

S41: and carrying out convolution operation on the target image by adopting a first convolution check to obtain a first feature map corresponding to each of the three color channels of the target image.

In the embodiment of the application, the terminal equipment can carry out convolution operation on the target image by adopting convolution check in the first convolution network to obtain the first feature images corresponding to the three color channels of the target image.

In one possible implementation, the first convolution network may be configured with only one first convolution kernel that may act on three color channels of the target image. Based on the above, the terminal device may perform convolution operation on the two-dimensional graphs corresponding to the three color channels of the target image respectively by using the first convolution kernel in the first convolution network, so as to obtain the first feature graphs corresponding to the three color channels of the target image respectively.

For example, if the first convolution network is configured with only the first convolution kernel C ₁₁, and two-dimensional images corresponding to the R channel, the G channel, and the B channel of the target image are the image R, the image G, and the image B, respectively, then the terminal device may perform convolution operations on the image R, the image G, and the image B using the first convolution kernel C ₁₁ to obtain a first feature map f _r1, a first feature map f _g1, and a first feature map f _b1 corresponding to the R channel, the G channel, and the B channel, respectively.

In another possible implementation, the first convolution network may be configured with three different first convolution kernels that respectively act on different color channels of the target image. Based on the above, the terminal device may perform convolution operation on the two-dimensional graphs corresponding to the three color channels of the target image respectively by using the three different first convolution kernels in the first convolution network, so as to obtain the first feature graphs corresponding to the three color channels of the target image respectively.

For example, if the first convolution network is configured with a first convolution kernel C ₁₁, a first convolution kernel C ₁₂, and a first convolution kernel C ₁₃ that respectively act on the R channel, the G channel, and the B channel, and two-dimensional images corresponding to the R channel, the G channel, and the B channel of the target image are respectively an image R, an image G, and an image B, the terminal device may perform a convolution operation on the image R using the first convolution kernel C ₁₁ to obtain a first feature map f _r1 corresponding to the R channel; performing convolution operation on the image G by adopting a first convolution kernel C ₁₂ to obtain a first feature map f _g1 corresponding to the G channel; and performing convolution operation on the image B by adopting a first convolution kernel C ₁₃ to obtain a first feature map f _b1 corresponding to the B channel.

It should be noted that, the dimension of the first convolution kernel in the color channel domain is one-dimensional.

In order to reduce the calculation overhead of the terminal device, in another possible implementation manner, the terminal device may perform convolution operation on the target image in a depth separable convolution manner by adopting a first convolution kernel in the first convolution network, so as to obtain first feature maps corresponding to three color channels of the target image respectively.

Since the specific operation principle of the depth separable convolution is the prior art, reference may be made to the related description of the operation principle of the depth separable convolution in the prior art, and no further description is given here.

It should be noted that, the values of the parameters of all convolution kernels involved in the embodiment of the present application may be learned during the training process of the depth estimation model.

S42: and processing each first characteristic map by adopting at least one preset residual block to obtain a second characteristic map corresponding to each first characteristic map.

In the embodiment of the present application, the first residual network includes at least one preset residual block. After obtaining the first feature images corresponding to the three color channels of the target image, the terminal device may process each first feature image in the first residual network by using the at least one preset residual block, so as to obtain a second feature image corresponding to each first feature image.

In one possible implementation manner, when only one preset residual block is included in the first residual network, the processing, by the terminal device, the first feature map by using the preset residual block may include: and processing the first feature map by adopting a network structure corresponding to the preset residual block to obtain a preset feature map corresponding to the first feature map, and carrying out feature fusion on the first feature map and the preset feature map to obtain a second feature map corresponding to the first feature map.

The step of the terminal device performing feature fusion on the first feature map and the preset feature map may specifically include: and carrying out summation operation on the value of each pixel point in the first feature map and the value of the corresponding pixel point in the preset feature map, and taking the result of the summation operation corresponding to each pixel point as the value of the corresponding pixel point in the second feature map to obtain the second feature map. It should be noted that, when the size of the preset feature map is smaller than the size of the first feature map, the zero-filling operation may be performed on the preset feature map, so that the size of the preset feature map is the same as the size of the first feature map, and then the feature fusion operation is performed on the first feature map and the preset feature map. The zero padding operation refers to increasing the number of rows and/or columns of the preset feature map, and setting the value of each pixel contained in the increased number of rows and/or columns to 0.

In another possible implementation manner, when a plurality of preset residual blocks are included in the first residual network, the terminal device processing the first feature map with the preset residual blocks includes: processing the first feature map by adopting a network structure corresponding to the first preset residual block to obtain a preset feature map corresponding to the first feature map; for each residual block after the first residual block, the terminal equipment adopts a network structure corresponding to the residual block to process a preset feature map obtained through the previous residual block to obtain a candidate feature map corresponding to the residual block, and performs feature fusion on the preset feature map obtained through the previous residual block and the candidate feature map corresponding to the residual block to obtain the preset feature map corresponding to the residual block; based on this, the preset feature map obtained by the last residual block is the second feature map.

It should be noted that, the network structure corresponding to each preset residual block may be set according to actual requirements, for example, the network structure corresponding to the preset residual block may be an existing VGG (convolutional neural network proposed by the visual geometry group (Visual Geometry Group, VGG) of oxford university) structure. The values of the network parameters included in the network structure corresponding to the preset residual block may be learned during the training process of the depth estimation model.

S43: carrying out maximum pooling operation on the space domain on each second feature map to obtain third feature maps corresponding to three color channels of the target image; wherein all the third feature maps are used for describing global features of the target image.

The terminal equipment can adopt a first preset pooling window to carry out maximum pooling operation on the space domain on the second characteristic image corresponding to each color channel of the target image in the first pooling network, so as to obtain a third characteristic image corresponding to each of the three color channels of the target image. The third feature maps corresponding to the three color channels respectively form the three-dimensional feature map for representing the global feature of the target image, that is, the three third feature maps corresponding to the three color channels are used for describing the global feature of the target image.

The size of the first preset pooling window may be learned during training of the depth estimation model. The size of the first predefined pooling window may be denoted as H ₁×H₁, typically H ₁ is an integer greater than 1, based on which the size of the third feature map is smaller than the size of the second feature map.

For example, if the size of the second feature image corresponding to each color channel of the target image is 2h×2h pixels, and the size of the first preset pooling window is h×h, after performing the maximum pooling operation on the spatial domain on each second feature image through the first preset pooling window, the size of each obtained third feature image is 2×2 pixels.

It should be noted that, since the operation principle of the max-pooling operation is the prior art, reference may be made to the related description of the max-pooling operation principle in the prior art, and no further description is given here.

By carrying out pooling operation on the second feature map of the target image in the spatial domain, the dimension of the second feature map in the spatial domain can be reduced while the global feature of the target image is maintained, so that the subsequent calculation cost of the terminal equipment is reduced, and the calculation capability of the terminal equipment is improved.

After the terminal device obtains global features of the target image (namely, third feature maps corresponding to three color channels of the target image), key features of the target image can be determined in the depth estimation network based on the three third feature maps, and a target depth map of the target image can be determined based on the key features.

With continued reference to fig. 3, as shown in fig. 3, in yet another embodiment of the present application, the depth estimation network 22 may include a spatial attention network 221 and a depth information processing network 222.

The spatial attention network 221 is configured to extract key features of the target image from global features of the target image, and determine a first depth map of the target image based on the key features. The size of the first depth map is smaller than that of the target image.

The depth information processing network 222 is configured to perform up-scaling processing on the first depth map of the target image in a spatial domain, so as to obtain a target depth map of the target image.

Based on this, the key features of the target image are extracted from the global features, and the target depth map of the target image is determined according to the key features, which may specifically include S51 to S52 shown in fig. 5, and the details are as follows:

s51: the key features are determined based on all the third feature maps, and a first depth map of the target image is determined based on the key features.

The terminal device may determine key features of the target image in the spatial attention network based on all third feature maps and determine the first depth map of the target image based on the key features.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a spatial attention network according to an embodiment of the present application. As shown in fig. 6, the spatial attention network 221 may include a second pooling network 2211, a second convolution network 2212, and a logistic regression network 2213. Based on this, S51 may include S71 to S73 as shown in fig. 7, and the details are as follows:

S71: and carrying out pooling operation on the color channel domain on all the third feature images to obtain a fourth feature image of the target image.

In the embodiment of the application, the terminal equipment can carry out pooling operation on the color channel domain on all the third feature images in the second pooling network to obtain the fourth feature image of the target image. The fourth feature map is used for describing key features of the target image. It should be noted that, the fourth feature map is a one-dimensional feature map, and the dimension herein refers to the dimension in the color channel domain.

By carrying out pooling operation on the color channel domain on the three third feature images, the dimension reduction processing on the color channel domain of the three-dimensional feature images for representing the global features of the target image is realized, so that not only can the key features of the target image be obtained, but also the subsequent calculation cost of the terminal equipment can be reduced.

With continued reference to fig. 6, in one possible implementation, the second pooling network 2211 may include a maximum pooling layer, an average pooling layer, and a feature fusion layer. In view of this, S71 may specifically include S81 to S83 shown in fig. 8, and the details are as follows:

s81: and carrying out maximum pooling operation on the color channel domain on all the third feature images to obtain a sixth feature image of the target image.

S82: and carrying out average pooling operation on the color channel domain on all the third feature images to obtain a seventh feature image of the target image.

S83: determining the fourth feature map based on the sixth feature map and the seventh feature map; and the value of each pixel point in the fourth feature map is the sum of the value of the corresponding pixel point in the sixth feature map and the value of the corresponding pixel point in the seventh feature map.

In this embodiment, the terminal device may perform, at the maximum pooling layer, a maximum pooling operation on the color channel domain on all the third feature maps based on the second preset pooling window, to obtain a sixth feature map of the target image; and carrying out average pooling operation on the color channel domain on all the third feature images based on a second preset pooling window in an average pooling layer to obtain a seventh feature image of the target image. It should be noted that, the sixth feature map and the seventh feature map are both one-dimensional feature maps, and the dimension herein refers to the dimension in the color channel domain.

The size of the second preset pooling window may be expressed as H ₂×H₂,H₂ being an integer greater than or equal to 1. The size of the second preset pooling window can be set according to actual requirements, or can be obtained through learning in the training process of the depth estimation model.

Specifically, the performing, by the terminal device, the maximum pooling operation on the color channel domain on all the third feature graphs based on the second preset pooling window in the maximum pooling layer may include: and synchronously sliding the second preset pooling window on the three third characteristic diagrams at the same time. Because the second preset pooling window slides for the first time in each third feature map, an area with the same size as the second preset pooling window corresponds to each third feature map, so that when the second preset pooling window slides for each time, the value of the pixel point with the largest value among all the pixel points contained in the area corresponding to the second preset pooling window in the three third feature maps is determined as the value of the pixel point corresponding to the current position of the second preset pooling window in the sixth feature map.

The performing, by the terminal device, the average pooling operation on the color channel domain on all the third feature graphs based on the second preset pooling window in the average pooling layer may specifically include: and synchronously sliding the second preset pooling window on the three third characteristic diagrams at the same time. Because the second preset pooling window slides for the first time in each third feature map, an area with the same size as the second preset pooling window corresponds to each third feature map, so that when the second preset pooling window slides each time, the average value of the values of all the pixels contained in the area corresponding to the second preset pooling window in the three third feature maps is determined as the value of the pixel corresponding to the current position of the second preset pooling window in the sixth feature map.

Exemplary, as shown in fig. 9, assuming that the third feature maps corresponding to the R channel, the G channel, and the B channel of the target image are f _r3、f_g3 and f _b3, respectively, and the dimensions of the third feature maps f _r3、f_g3 and f _b3 are 4×4 pixels, and the size of the second preset pooling window is 2×2, the corresponding regions in the third feature map f _r3, the third feature map f _g3, and the third feature map f _b3 are the first region 91 when the second preset pooling window slides for the first time, and the corresponding regions in the third feature map f _r3 when the second preset pooling window slides for the second time, The corresponding areas in the third feature map f _g3 and the third feature map f _b3 are the second area 92, the corresponding areas in the third feature map f _r3, the third feature map f _g3 and the third feature map f _b3 are the third area 93 when the second preset pooling window slides for the third time, and the corresponding areas in the third feature map f _r3, the third feature map f _g3 and the third feature map f _b3 are the fourth area 94 when the second preset pooling window slides for the fourth time. Taking the first sliding of the second preset pooling window as an example, since the first region 91 corresponding to the third feature map f _r3 includes the pixels a1, a2, a5 and a6 when the second preset pooling window slides for the first time, the first region 91 corresponding to the third feature map f _g3 includes the pixels b1, b2, b5 and b6 when the second preset pooling window slides for the first time, and the first region 91 corresponding to the third feature map f _b3 includes the pixels c1, c2, c5 and c6 when the second preset pooling window slides for the first time, all the pixels a1, c5 and c6 included in the three first regions 91 can be obtained, The value of the pixel having the largest value among a2, a5, a6, b1, b2, b5, b6, c1, c2, c5, and c6 (assuming that the value is 1) is determined as the value of the pixel in the first row and the first column in the sixth feature map, and the average value (assuming that the average value is 0) of the values of all the pixels a1, a2, a5, a6, b1, b2, b5, b6, c1, c2, c5, and c6 included in the three first regions 91 is determined as the value of the pixel in the first row and the first column in the seventh feature map. The values of other pixels in the sixth feature map and the seventh feature map are similar, and will not be described herein.

In the embodiment of the present application, after the terminal device obtains the sixth feature map and the seventh feature map, a summation operation may be performed on the value of each pixel point in the sixth feature map and the value of the corresponding pixel point in the seventh feature map at the feature fusion layer, and the value obtained by the summation operation is used as the value of the corresponding pixel point in the fourth feature map, that is, the value of each pixel point in the fourth feature map is the sum of the value of the corresponding pixel point in the sixth feature map and the value of the corresponding pixel point in the seventh feature map.

The fourth feature map is a one-dimensional feature map, and the dimension herein refers to a dimension in the color channel domain. The fourth feature map has the same size as the sixth feature map and the seventh feature map has the same size as the sixth feature map.

For example, if the size of the sixth feature map and the size of the seventh feature map are 2×2 pixels, the values of the first row and first column pixels, the values of the first row and second column pixels, the values of the second row and first column pixels, and the values of the second row and second column pixels in the sixth feature map are 1,3, 4, and 3, respectively, and the values of the first row and first column pixels, the values of the first row and second column pixels, the values of the second row and first column pixels, and the values of the second row and second column pixels in the seventh feature map are 0, 1,4, and 2, respectively, then the values of the first row and first column pixels, the values of the second row and first column pixels, and the values of the second row and second column pixels in the fourth feature map are 1,4, 8, and 5, respectively.

S72: and performing convolution operation on the fourth feature map based on the second convolution check to obtain a fifth feature map of the target image.

In the embodiment of the application, the terminal equipment can perform equivalent convolution operation on the fourth feature map based on the second convolution check in the second convolution network to obtain a fifth feature map with the same size as the fourth feature map, namely, the size of the fifth feature map is the same as the size of the fourth feature map. It should be noted that the equivalent convolution operation can obtain finer image features without changing the original feature map size.

The performing, by the terminal device, the equivalent convolution operation on the fourth feature map based on the second convolution kernel may specifically include: the terminal equipment firstly carries out space domain dimension-increasing processing on the fourth feature map, and then carries out convolution operation on the fourth feature map after dimension-increasing processing by adopting the second convolution check to obtain a fifth feature map. The dimension of the fourth feature map rising in the spatial domain may be determined according to the size of the fourth feature map, the size of the second convolution kernel, and the step size of the convolution operation.

By way of example and not limitation, the performing, by the terminal device, the dimension-up processing on the fourth feature map in the spatial domain may specifically include: the terminal device adds at least one column or row of pixels around the fourth feature map, and sets the added pixels to 0. For example, as shown in fig. 10, if the size of the fourth feature map is 2×2 pixels, the size of the second convolution kernel is 3×3 (not shown in the figure), and the step size of the convolution operation is 1, a column or a row of pixels may be added around the fourth feature map, and the added pixels may be set to 0, so as to obtain a feature map 101, and then the second convolution kernel feature map 101 is used to perform the convolution operation, so as to obtain a fifth feature map, where the size of the fifth feature map is also 2×2 pixels.

S73: and carrying out normalization processing on the fifth feature map to obtain the first depth map.

In the embodiment of the application, the terminal equipment can normalize the fifth feature map in the logistic regression network to obtain the first depth map of the target image.

The logistic regression network is configured with a preset activation function, and the preset activation function is used for mapping the value of each pixel point in the fifth characteristic diagram to the range of [0,1 ]. In a specific application, the expression of the preset activation function may be set according to actual requirements, which is not limited herein.

Based on this, the normalization processing of the fifth feature map in the logistic regression network by the terminal device may specifically include: and the terminal equipment calculates a normalized value corresponding to the value of each pixel point in the fifth feature map by adopting a preset activation function in the logistic regression network, and takes the normalized value corresponding to the value of each pixel point in the fifth feature map as the value of the corresponding pixel point in the first depth map, so as to obtain the first depth map.

The size of the first depth map is the same as the size of the fifth feature map. The value range of the normalized value corresponding to the value of each pixel point in the fifth feature map is [0,1], namely the value range of the value of each pixel point in the first depth map is [0,1].

S52: and carrying out dimension ascending processing on the first depth map in a space domain to obtain a target depth map of the target image.

In the embodiment of the present application, since the size of the first depth map is smaller than the size of the target image, if the target depth map with the same size as the target image is to be obtained, the first depth map needs to be subjected to the dimension-up processing in the spatial domain. It should be noted that, the value of each pixel point in the target depth map is used to describe the scene depth corresponding to the corresponding pixel point in the target image.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a depth information processing network according to an embodiment of the application. As shown in fig. 11, in one possible implementation, the depth information processing network 222 may include at least one set (three sets are illustrated in fig. 11) of a third convolution layer 2221 and an upsampling convolution layer 2222.

Based on this, S52 may specifically include the steps of:

Performing convolution operation on the first depth map by adopting a third convolution check to obtain a second depth map of the target image;

and performing up-sampling convolution operation on the second depth map by adopting a fourth convolution check to obtain the target depth map.

In the embodiment of the application, after the terminal equipment obtains the first feature image of the target image, the terminal equipment can adopt a third convolution check to carry out convolution operation on the first depth image in the third convolution layer to obtain the second depth image of the target image, and adopts a fourth convolution check to carry out up-sampling convolution operation on the second depth image in the up-sampling convolution layer to obtain the target depth image of the target image.

In one possible implementation manner, when the depth information processing network includes only one set of third convolution layer and up-sampling convolution layer, the terminal device may perform the equivalent convolution operation on the first depth map by using the third convolution layer to check the first depth map to obtain a second depth map of the target image, and perform the up-sampling convolution operation on the up-sampling convolution layer by using the fourth convolution layer to check the second depth map to obtain the target depth map. The size of the second depth map is the same as the size of the first depth map, i.e. the size of the second depth map is smaller than the size of the target depth map.

It should be noted that, the specific process of the equivalent convolution operation may refer to the related description in the corresponding embodiment of fig. 10, which is not described herein. Since the operation principle of the up-sampling convolution operation is the prior art, reference may be made to the description of the up-sampling convolution operation principle in the prior art, and no further description is given here.

In another possible implementation manner, when the depth information processing network includes at least two sets of third convolution layers and up-sampling convolution layers, the terminal device may perform an equivalent convolution operation on the first depth map by using a third convolution check on the third convolution layers of the first set to obtain a second depth map of the target image, and perform an up-sampling convolution operation on the second depth map by using a fourth convolution check on the up-sampling convolution layers of the first set to obtain a depth map having a size greater than that of the second depth map. In each group after the first group, the terminal equipment performs equivalent convolution operation on the depth map obtained by the up-sampling convolution layer of the previous group by adopting a third convolution layer of the current group, and performs up-sampling convolution operation on the depth map obtained by the up-sampling convolution layer of the current group by adopting a fourth convolution layer to obtain a depth map with a size larger than that of the depth map obtained by the up-sampling convolution layer of the previous group. By the above-mentioned method, the depth map obtained by the last group of up-sampling convolution layers is the target depth map of the target image.

It should be noted that, each parameter of the third convolution kernel in the different groups is different, and each parameter of the fourth convolution kernel in the different groups is also different.

It can be seen from the foregoing that, in the image depth estimation method provided by the embodiment of the present application, by acquiring a target image, inputting the target image into a preset depth estimation model, extracting global features of the target image, extracting key features of the target image from the global features, and determining a target depth map of the target image according to the key features; the key features are used for describing information of pixel points contained in a non-blank area in the target image, wherein the non-blank area refers to an area formed by non-white pixel points and non-black pixel points; the value of each pixel point in the target depth map is used for describing the distance degree of the shot object corresponding to the pixel point from the camera. Because the shot object corresponding to the pixel points contained in the blank area in the target image is usually infinitely far away from the camera, if the information of the pixel points contained in the blank area is used for image depth estimation, the calculation cost of the image depth estimation can be increased, the estimation precision of the image depth estimation can be influenced, and the key features in the scheme are used for describing the information of the pixel points contained in the non-blank area in the target image, namely, when the target depth map of the target image is determined, the information of the pixel points contained in the blank area in the target image is not considered, and the target depth is determined only based on the key features of the target image, so that the estimation precision of the image depth estimation can be improved, and the calculation cost of the image depth estimation is reduced.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Based on the image depth estimation method provided by the embodiment, the embodiment of the invention further provides an embodiment of terminal equipment for realizing the embodiment of the method.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a terminal device according to an embodiment of the present application. In the embodiment of the present application, each unit included in the terminal device is used to execute each step in the embodiments corresponding to fig. 1 to 11. Please refer to fig. 1 to 11 and the related descriptions in the corresponding embodiments of fig. 1 to 11. For convenience of explanation, only the portions related to the present embodiment are shown. As shown in fig. 12, the terminal device 120 includes: a first acquisition unit 121 and a first processing unit 122. Wherein:

the first acquisition unit 121 is configured to acquire a target image.

The first processing unit 122 is configured to input the target image into a preset depth estimation model, extract global features of the target image, extract key features of the target image from the global features, and determine a target depth map of the target image according to the key features; the key features are used for describing information of pixel points contained in a non-blank area in the target image, wherein the non-blank area refers to an area formed by non-white pixel points and non-black pixel points; the value of each pixel point in the target depth map is used for describing the distance degree of the shot object corresponding to the pixel point from the camera.

Optionally, the target image is a three primary color RGB image; the first processing unit 122 may specifically include: a first convolution unit, a first residual unit and a first pooling unit. Wherein:

The first convolution unit is used for carrying out convolution operation on the target image by adopting a first convolution check to obtain a first feature map corresponding to each of the three color channels of the target image.

The first residual unit is used for processing each first characteristic diagram by adopting at least one preset residual block to obtain a second characteristic diagram corresponding to each first characteristic diagram.

The first pooling unit is used for carrying out maximum pooling operation on the space domain on each second characteristic image to obtain third characteristic images corresponding to three color channels of the target image; wherein all the third feature maps are used for describing global features of the target image.

Optionally, the first processing unit 122 may further include: and a key feature determining unit and a depth information determining unit. Wherein:

The key feature determining unit is used for determining the key features based on all the third feature maps and determining a first depth map of the target image based on the key features; the size of the first depth map is smaller than that of the target image.

The depth information determining unit is used for carrying out dimension ascending processing on the first depth map in a space domain to obtain a target depth map of the target image; the size of the target depth map is the same as that of the target image.

Optionally, the key feature determining unit specifically includes: the device comprises a second pooling unit, a second convolution unit and a normalization unit. Wherein:

The second pooling unit is used for performing pooling operation on the color channel domain on all the third feature images to obtain a fourth feature image of the target image; the fourth feature map is used for describing key features of the target image.

The second convolution unit is used for carrying out convolution operation on the fourth feature map based on a second convolution check to obtain a fifth feature map of the target image; wherein the size of the fifth feature map is the same as the size of the fourth feature map.

And the normalization unit is used for performing normalization processing on the fifth feature map to obtain the first depth map.

Optionally, the second pooling unit is specifically configured to:

performing maximum pooling operation on the color channel domain on all the third feature images to obtain a sixth feature image of the target image;

Carrying out average pooling operation on the color channel domain on all the third feature images to obtain a seventh feature image of the target image;

Determining the fourth feature map based on the sixth feature map and the seventh feature map; and the value of each pixel point in the fourth feature map is the sum of the value of the corresponding pixel point in the sixth feature map and the value of the corresponding pixel point in the seventh feature map.

Optionally, the depth information determining unit may specifically include: a third convolution unit and a fourth convolution unit. Wherein:

The third convolution unit is used for carrying out convolution operation on the first depth map by adopting a third convolution check to obtain a second depth map of the target image; the size of the second depth map is smaller than that of the target image.

And the fourth convolution unit is used for performing up-sampling convolution operation on the second depth map by adopting a fourth convolution check to obtain the target depth map.

Optionally, the first convolution unit is specifically configured to:

And carrying out convolution operation on the target image in a depth separable convolution mode by adopting a first convolution kernel to obtain first feature images corresponding to three color channels of the target image.

It should be noted that, because the content of information interaction and execution process between the modules and the embodiment of the method of the present application are based on the same concept, specific functions and technical effects thereof may be referred to in the method embodiment section specifically, and will not be described herein.

Fig. 13 is a schematic structural diagram of a terminal device according to another embodiment of the present application. As shown in fig. 13, the terminal device 13 provided in this embodiment includes: a processor 130, a memory 131 and a computer program 132, such as an image depth estimation program, stored in the memory 131 and executable on the processor 130. The steps of the respective image depth estimation method embodiments described above, such as S11 to S14 shown in fig. 1, are implemented when the processor 130 executes the computer program 132. Or the processor 130, when executing the computer program 132, performs the functions of the modules/units in the embodiments of the terminal device, for example, the functions of the units 121 to 122 shown in fig. 12.

Illustratively, the computer program 132 may be partitioned into one or more modules/units that are stored in the memory 131 and executed by the processor 130 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 132 in the terminal device 13. For example, the computer program 132 may be divided into a first acquisition unit and a first processing unit, and the specific functions of each unit are described in the corresponding embodiment of fig. 12, which is not repeated herein.

The terminal device may include, but is not limited to, a processor 130, a memory 131. It will be appreciated by those skilled in the art that fig. 13 is merely an example of a terminal device 13 and does not constitute a limitation of the terminal device 13, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.

The Processor 130 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 131 may be an internal storage unit of the terminal device 13, for example, a hard disk or a memory of the terminal device 13. The memory 131 may also be an external storage device of the terminal device 13, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the terminal device 13. Further, the memory 131 may also include both an internal storage unit and an external storage device of the terminal device 13. The memory 131 is used for storing the computer program and other programs and data required by the terminal device. The memory 131 may also be used to temporarily store data that has been output or is to be output.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 14, fig. 14 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the present application, as shown in fig. 14, a computer program 141 is stored in the computer readable storage medium 14, and the computer program 141 can implement the image depth estimation method when executed by a processor.

The embodiment of the application provides a computer program product which can realize the image depth estimation method when being executed by terminal equipment when being run on the terminal equipment.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of each functional unit and module is illustrated, and in practical application, the above-mentioned functional allocation may be performed by different functional units and modules, that is, the internal structure of the terminal device is divided into different functional units or modules, so as to perform all or part of the above-mentioned functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference may be made to related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. An image depth estimation method, comprising:

Acquiring a target image; the target image is a three primary color RGB image;

Inputting the target image into a preset depth estimation model, and performing convolution operation on the target image by adopting a first convolution check to obtain first feature images corresponding to three color channels of the target image; processing each first characteristic diagram by adopting at least one preset residual block to obtain a second characteristic diagram corresponding to each first characteristic diagram; carrying out maximum pooling operation on the space domain of each second feature map to obtain third feature maps corresponding to three color channels of the target image, wherein all the third feature maps are used for describing global features of the target image and extracting key features of the target image from the global features; determining a target depth map of the target image according to the key features;

The global feature is represented by a three-dimensional feature map, the key feature is represented by a one-dimensional feature map, and the dimensions of the three-dimensional feature map and the one-dimensional feature map refer to the dimensions on a color channel domain; the key features are used for describing information of pixel points contained in a non-blank area in the target image, wherein the non-blank area refers to an area formed by non-white pixel points and non-black pixel points; the value of each pixel point in the target depth map is used for describing the distance degree of the shot object corresponding to the pixel point from the camera.

2. The method of claim 1, wherein the step of extracting key features of the target image from the global features, and determining a target depth map of the target image from the key features comprises:

determining the key features based on all the third feature maps, and determining a first depth map of the target image based on the key features; the size of the first depth map is smaller than that of the target image;

performing dimension ascending processing on the first depth map in a spatial domain to obtain a target depth map of the target image; the size of the target depth map is the same as that of the target image.

3. The method of claim 2, wherein the step of determining the key features based on all of the third feature maps and determining the first depth map of the target image based on the key features comprises:

Performing pooling operation on the color channel domain on all the third feature images to obtain a fourth feature image of the target image; the fourth feature map is used for describing key features of the target image;

Performing convolution operation on the fourth feature map based on a second convolution check to obtain a fifth feature map of the target image; wherein the size of the fifth feature map is the same as the size of the fourth feature map;

and carrying out normalization processing on the fifth feature map to obtain the first depth map.

4. A method according to claim 3, wherein the step of pooling all of the third feature maps over a color channel domain to obtain a fourth feature map of the target image comprises:

5. The method of claim 2, wherein the step of performing spatial up-scaling on the first depth map to obtain the target depth map of the target image comprises:

Performing convolution operation on the first depth map by adopting a third convolution check to obtain a second depth map of the target image; wherein the second depth map has a size smaller than the size of the target image;

6. The method according to any one of claims 1 to 5, wherein the step of convolving the target image with a first convolution kernel to obtain first feature maps corresponding to three color channels of the target image respectively includes:

7. A terminal device, characterized in that the steps include:

a first acquisition unit configured to acquire a target image; the target image is a three primary color RGB image;

The first processing unit is used for inputting the target image into a preset depth estimation model, and performing convolution operation on the target image by adopting a first convolution check to obtain a first feature map corresponding to each of three color channels of the target image; processing each first characteristic diagram by adopting at least one preset residual block to obtain a second characteristic diagram corresponding to each first characteristic diagram; carrying out maximum pooling operation on the space domain of each second feature map to obtain third feature maps corresponding to three color channels of the target image, wherein all the third feature maps are used for describing global features of the target image and extracting key features of the target image from the global features; determining a target depth map of the target image according to the key features;

8. A terminal device, characterized in that it comprises a processor, a memory and a computer program stored in the memory and executable on the processor, which processor, when executing the computer program, implements the method according to any of claims 1 to 6.

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 6.