WO2023245321A1

WO2023245321A1 - Image depth prediction method and apparatus, device, and storage medium

Info

Publication number: WO2023245321A1
Application number: PCT/CN2022/099713
Authority: WO
Inventors: 倪鹏程; 张亚森; 苏海军; 陈凌颖
Original assignee: 北京小米移动软件有限公司; 北京小米松果电子有限公司
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2023-12-28
Also published as: CN117616457A

Abstract

The present invention relates to an image depth prediction method and apparatus, a device, and a storage medium. The method comprises: obtaining an image to be processed; and inputting the image to be processed into a deep network model to predict the image depth of the image to be processed, wherein the deep network model consists of multiple depthwise separable convolution layers, and the deep network model is trained by means of a target loss function determined on the basis of at least one of an error weight parameter, a depth gradient error, and a depth structure loss parameter. The deep network model used in the present invention has depthwise separable convolutions; thus, deployment of the deep network model on a terminal device is implemented, the running time is effectively reduced, and the situation where large models take a long time to run on terminal devices and do not adapt to terminal devices is avoided.

Description

An image depth prediction method, device, equipment and storage medium

Technical field

The present disclosure relates to the field of computer vision technology, and in particular, to an image depth prediction method, device, equipment and storage medium.

Background technique

With the rapid development of science and technology, terminal equipment has become an indispensable item in people's daily lives. In some cases, the terminal device needs to estimate the depth of a picture to complete subsequent tasks. For example, when people take pictures, they often want to achieve the blur produced by the camera when taking pictures. However, the lens on the mobile phone cannot be compared with the camera lens. At this time, the mobile phone needs to estimate the depth of the picture taken, so that the areas of different depths in the picture can be blurred accordingly. Or the vehicle-mounted device can realize autonomous driving situation perception by acquiring images and performing depth estimation.

In related technologies, the depth estimation scheme used usually uses a network with a large number of parameters and a large amount of calculation. Such solutions are often deployed on large-scale computing equipment, such as large-scale service equipment, service clusters, etc. Obviously, such a solution cannot be adapted to terminal equipment.

Contents of the invention

In order to overcome problems existing in related technologies, the present disclosure provides an image depth prediction method, device, equipment and storage medium.

According to a first aspect of an embodiment of the present disclosure, an image depth prediction method is provided. The method includes: acquiring an image to be processed; inputting the image to be processed into a deep network model, and predicting the image depth of the image to be processed. The deep network model is composed of multiple It consists of layer depth separable convolutions; among them, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of the error weight parameter, depth gradient error, and depth structure loss parameter; the error weight parameter is used to characterize the estimation The weight of the difference between depth and label depth. The depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth. The depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image; the estimated depth is the depth of the training stage. The network model is determined based on the training images, and the label depth corresponds to the training images.

In one implementation, the target loss function is determined in the following manner: the target loss function is determined based on at least one of the first loss function, the second loss function, and the third loss function; wherein the first loss function is based on the error weight The parameters are determined, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameters.

In one implementation, the first loss function is determined in the following manner: based on the estimated depth and the label depth, the absolute value of the error between the estimated depth and the label depth is determined; based on the absolute value of the error and the error weight parameter, the first loss function is determined .

In one embodiment, the depth gradient error is determined in the following manner: based on the estimated depth and a preset gradient function, the estimated depth gradient in at least two directions is determined; based on the label depth and the gradient function, the estimated depth gradient in at least two directions is determined. the label depth gradient; determine the depth gradient error in at least two directions according to the estimated depth gradient in at least two directions and the label depth gradient in at least two directions; the second loss function is determined in the following way: Depth gradient error, determine the second loss function.

In one implementation, the depth structure loss parameters are determined in the following manner: multiple training data groups are determined, wherein the training data group contains at least two pixels, at least two pixels are pixels of the training image, and the label depth includes Training labels corresponding to at least two pixels in the training image; for each training data group, determine the depth structure loss parameters based on the depth of the labels corresponding to at least two pixels; the third loss function is determined in the following way: for each training The data set determines the third loss function based on the depth structure loss parameter and the estimated depth corresponding to at least two pixel points.

In one implementation, determining multiple training data groups includes: using a preset gradient function to determine the gradient boundary of the training image; sampling the training image according to the gradient boundary to determine multiple training data groups.

In one implementation, the training images are determined in the following manner: determine at least one training image set in any scene, and the images in the training image set are images after mask processing for the set area; At least one training image set determines the training image.

According to a second aspect of the embodiment of the present disclosure, an image depth prediction device is provided. The device includes: an acquisition module for acquiring an image to be processed; a prediction module for inputting the image to be processed into a deep network model and predicting the image to be processed. The image depth of the image, the deep network model is composed of multi-layer depth separable convolutions; among them, the deep network model is trained using a target loss function, and the target loss function is based on at least one of the error weight parameter, the depth gradient error, and the depth structure loss parameter. The term is determined; the error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth, the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used to characterize the weight corresponding to different positions in the training image. Label depth difference; the estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image.

According to a third aspect of an embodiment of the present disclosure, an image depth prediction device is provided, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to perform the first aspect or the first aspect method in any embodiment.

According to a fourth aspect of an embodiment of the present disclosure, a non-transitory computer-readable storage medium is provided, which when instructions in the storage medium are executed by a processor of a computer, enables the computer to execute the first aspect or any one of the first aspects. methods described in the embodiments.

The technical solution provided by the embodiments of the present disclosure can include the following beneficial effects: the adopted deep network model has depth-separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the need for large-scale models. It takes a long time to run on the terminal device and is not suitable.

It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and do not limit the present disclosure.

Description of the drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Figure 1 is a schematic diagram of a scene according to an exemplary embodiment.

Figure 2 is a schematic diagram of a blurred image based on depth prediction according to an exemplary embodiment.

Figure 3 is a schematic diagram of a fully supervised depth estimation process according to an exemplary embodiment.

Figure 4 is a schematic structural diagram of a deep network model according to an exemplary embodiment.

Figure 5 is a schematic structural diagram of a semantic perception unit according to an exemplary embodiment.

Figure 6 is a flow chart of an image depth prediction method according to an exemplary embodiment.

Figure 7 is a flow chart of a deep network model training method according to an exemplary embodiment.

Figure 8 is a flow chart of a method for determining a target loss function according to an exemplary embodiment.

Figure 9 is a flow chart of a method for determining a first loss function according to an exemplary embodiment.

Figure 10 is a flow chart of a method for determining a second loss function according to an exemplary embodiment.

Figure 11 is a flow chart of a method for determining a third loss function according to an exemplary embodiment.

Figure 12 is a schematic diagram of a depth map according to an exemplary embodiment.

Figure 13 is a schematic diagram of depth structure sampling according to an exemplary embodiment.

Figure 14 is a schematic diagram of random sampling according to an exemplary embodiment.

Figure 15 is a schematic diagram of an image depth prediction device according to an exemplary embodiment.

Figure 16 is a schematic diagram of an image depth prediction device according to an exemplary embodiment.

Figure 17 is a schematic diagram of a deep network model training device according to an exemplary embodiment.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present disclosure.

The methods involved in the present disclosure can be applied to scenes of depth prediction and estimation of images. For example, in the scene shown in Figure 1, users take selfies with their mobile phones and often hope that their characters in the images will be clearer, while the background can be Perform appropriate blurring. Obviously, what the user wants is to process areas with different depths in the image accordingly. For example, areas with smaller depth, that is, the character area, can remain clear, while areas with greater depth, such as the background area, can be blurred accordingly. deal with.

However, usually, the clarity of the image at the shooting location is related to the unique design of the lens. The focal plane of the captured image is clear but the sides are blurred. However, due to the limitations of the lens design on mobile phones, the blurring effect of images captured by mobile phone lenses cannot achieve the blurring effect of images captured by cameras. At this time, some mobile phones can perform depth prediction on different areas in the image and layer the image based on the predicted depth information as guidance information. The mobile phone can use different blur convolution kernels to blur different image layers, and finally fuse the blur results of different layers to obtain the final blurred image.

For example, FIG. 2 is a schematic diagram of a blurred image based on depth prediction according to an exemplary embodiment. It can be seen that, for example, users use mobile phones to take selfies in certain complex environments. At this time, it is necessary to perform depth prediction on the image captured by the mobile phone (ie, the leftmost image in Figure 2), and then the image can be layered based on the predicted depth information. For different layers, blur convolutions of different layers can be used for blurring, and finally the blurring results of different layers are fused to obtain the final blurred image (i.e., the rightmost image in Figure 2).

Under normal circumstances, it can be considered that users use mobile phones to take pictures and use monocular cameras to take pictures.

In some related solutions, depth prediction solutions based on monocular cameras can use self-supervised depth estimation. It can be understood that depth prediction can also be called depth estimation. For example, a depth estimation network and a relative pose estimation network can be constructed, and continuous video frames are used as network input. During the training process, the depth map and relative pose relationship of these frames can be estimated through calculation. Then, the mutual mapping relationship between the 3D (dimension) and 2D scenes can be used to minimize the photometric reconstruction error, thereby optimizing the depth map and relative pose relationship. In actual use, only the trained depth estimation network is used to predict the depth of the input video and obtain the corresponding depth map. For example, monocular depth (MonoDepth) can be used as a basis to develop prediction of depth information from video sequences.

Of course, in other related solutions, fully supervised depth estimation can be adopted, such as building a deep network that uses paired color pictures and depth pictures (i.e. labels) as input. By using the loss function to update the network parameters using the depth map predicted from the color image and the depth image as the label, the trained deep network is obtained. For example, taking Big to Small (BTS) as a representative, input paired color images and depth images (i.e. labels) to complete the network training task. During the training process, the method can be to input a color picture, and obtain the depth estimation result through the feature encoder, feature aggregation network, and feature decoder. Then the depth estimation result will be compared with the real data (ground truth, GT) depth to calculate the structural similarity index (SSIM) loss, and the parameters will be updated. During use, as shown in Figure 3, a color picture can be input, and then the feature map F of the input picture can be obtained after convolution. Based on this feature map F, we can perform convolution from different dimensions through operations such as full-image encoder, convolution, and atrous spatial pyramid pooling (ASPP) to obtain different dimensions. deep features. Then, the depth features obtained by convolution of different dimensions can be superimposed and convolved again, such as convolution x in Figure 3. After that, the feature y obtained by convolution x can be used to determine the depth map of each different level using ordinal regression. For example, the depth map of different depth intervals can be determined. For example, the depth interval can be divided into L">l ₀ , L”>l ₁ ,…, L”>l _k-1 and so on. In order to finally superpose the depth maps corresponding to different depth intervals to obtain the final output depth map.

In some related solutions, monocular cameras and structured light can also be used. Among them, structured light can be, for example, laser radar, millimeter wave radar, etc. The monocular camera and structured light can be fixed using a rigid fixation method. Then, the monocular camera and structured light can be calibrated in the early stage, and then the depth result of the corresponding pixel position can be obtained through the triangulation ranging principle.

Of course, the specific implementation processes of the above different related solutions can be implemented with reference to existing methods, and will not be described in detail in this disclosure.

However, in the related solutions mentioned above, in order to improve the effect of depth prediction as much as possible, networks with large parameters and calculations are usually selected, and operators that are not suitable for handheld terminal devices are used, which will make There are great difficulties in deploying and running deep networks on handheld terminal devices.

Therefore, the present disclosure provides an image depth prediction method that uses a deep network model composed of multi-layer depth-separable convolutions to perform depth prediction on input images. Among them, the deep network model is trained using a target loss function, which is determined based on at least one of an error weight parameter, a depth gradient error, and a depth structure loss parameter. The error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth. The depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth. The depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image. . The estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image. The deep network model used in this disclosure has depth-separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long-running and unsuitable situation of large-scale models on the terminal device. .

Next, solutions involved in the present disclosure will be described in detail with reference to the accompanying drawings.

Figure 4 is a schematic structural diagram of a deep network model according to an exemplary embodiment. It can be seen that the structure of the deep network model may include at least one encoder (encoder), a semantic perception unit and at least one decoder (decoder). It can be understood that the network structure can also be a trained deep network model. Of course, it can also be an untrained deep network model. In other words, the structure of an untrained deep network model is the same as that of a trained deep network model, and training does not change the model structure. Among them, the semantic perception unit can include multiple layers of depth-separable convolutions. Therefore, compared with related technologies, the computing efficiency of ASPP will be higher and the redundancy of the network structure will be smaller. In some examples, the semantic perception unit of the present disclosure can also be considered as a more efficient ASPP, for example, it can be called efficient atrous spatial pyramid pooling (EASPP).

It is worth noting that in this disclosure, the trained deep network model can be called a deep network model, and the untrained deep network model can be called an initial deep model.

In some examples, the encoder and decoder can be composed of depthwise separable convolutions. Also, EASPP can also include multiple layers of depthwise separable convolutions. The number of encoders and decoders can be the same and correspond one to one. For example, in some cases there may be 5 encoders and therefore 5 decoders. For example, Encoder0, Encoder1, Encoder2, Encoder3, and Encoder4, and Decoder0, Decoder1, Decoder2, Decoder3, and Decoder4. Among them, encoder 0 corresponds to decoder 0, encoder 1 corresponds to decoder 1, encoder 2 corresponds to decoder 2, encoder 3 corresponds to decoder 3, and encoder 4 corresponds to decoder 4. correspond. In some examples, depth-separable convolutions may not be included for the first encoder and the first encoder's corresponding decoder, while the remaining encoders and corresponding decoders may include depth-separable convolutions.

In some examples, the image to be processed can be input into a deep network model, for example, the image to be processed can be input into at least one encoder to extract depth feature data of the image to be processed. In some examples, a deep network model may include 5 encoders and 5 decoders. In other examples, the number of encoders and decoders can be more or less according to the actual situation, and can be set arbitrarily according to the actual situation, which is not limited by this disclosure. It can be understood that when the number of encoders and decoders is more than 5, it may reduce the model running speed and increase the size of the model. When the number of encoders and decoders is less than 5, the depth of feature extraction from the input image is not enough, causing deep-level feature information to be ignored, resulting in poor model prediction effects. Therefore, this disclosure will be described using the example of a deep network model including 5 encoders and 5 decoders.

In some examples, each input may be a single image to be processed. The image to be processed may be a color image, for example. First, the training image can be input into encoder 0. Encoder 0 can include two convolutional layers. The convolution kernel size of each convolutional layer can be 3*3. When the training image passes through the two convolutional layers, After convolution, depth features with a channel dimension of 16 can be output. As can be seen from Figure 4, the depth feature data output by encoder 0 can be input to the next encoder, that is, encoder 1. Of course, it can be seen that the depth feature data output by encoder 0 can also be input to the corresponding decoder 0. It can be understood that when there are multiple encoders, the first encoder may not have depth-separable convolutions, while the remaining encoders have depth-separable convolutions. Of course, in some examples, all encoders may have depth-separable convolutions, and the details may be adjusted according to actual conditions, and are not specifically limited in this disclosure.

For encoder 1, downsampling layers and depthwise separable convolutional layers can be included. Among them, the downsampling layer can use maximum pooling (max polling, MaxPool) 2D for downsampling. Among them, 2D is represented by two dimensions: width and height during image pooling. Of course, in some examples, pooling can be performed by a 2×2 matrix size when performing pooling. In some examples, Encoder 1 takes the output of Encoder 0 as input, and after MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 32. It can be seen that the depth feature data output by encoder 1 can be input to the next encoder, that is, encoder 2. Of course, it can also be input to the corresponding decoder 1. Encoder 2, Encoder 3, and Encoder 4 are similar to Encoder 1. For example, Encoder 2 may include a downsampling layer and a depth-separable convolutional layer, and the downsampling layer may use MaxPool2D for downsampling. Encoder 2 takes the output of Encoder 1 as input. After MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 64. It can be seen that the depth feature data output by encoder 2 can be input to the next encoder, that is, encoder 3; it can also be input to the corresponding decoder 2. Encoder 3 may include a downsampling layer and a depth-separable convolutional layer, and the downsampling layer may use MaxPool2D for downsampling. Encoder 3 takes the output of Encoder 2 as input. After MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 128. It can be seen that the depth feature data output by encoder 3 can be input to the next encoder, that is, encoder 4; it can also be input to the corresponding decoder 3. Encoder 4 may include a downsampling layer and a depth-separable convolutional layer, and the downsampling layer may use MaxPool2D for downsampling. Encoder 4 takes the output of Encoder 3 as input. After MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 192. It can be seen that the depth feature data output by the encoder 4 can be input to the semantic perception unit, that is, EASPP; it can also be input to the corresponding decoder 4. It can be understood that the channel dimensions of the depth feature data output by each encoder can be obtained by dimensionally upgrading the traditional convolutional layer included in each encoder. The specific number of dimensions can be adjusted arbitrarily according to the actual situation, and is not limited in this disclosure.

It can be understood that in some examples, no matter how many encoders are used, the depth feature data output by the last encoder will be input to the semantic perception unit.

The semantic perception unit can perform semantic extraction based on the output of the last encoder and output deep semantic data containing rich semantic information. For example, as shown in Figure 5, a possible semantic perception unit structure is shown. It can be seen that at least one depthwise separable convolution and fusion layer can be included in the semantic perception unit. Among them, depthwise separable convolution has an expansion coefficient. The expansion coefficients corresponding to different depth-separable convolutions in the semantic sensing unit can be the same or different. The expansion coefficient of each depth-separable convolution can be adjusted arbitrarily according to the actual situation, and this disclosure does not limit it.

In some examples, 5 depthwise separable convolutions may be included. Among them, depth-separable convolution 1 takes the depth feature data output by the last encoder (ie, encoder 4 in Figure 4) as input. The expansion coefficient of depth-separable convolution 1 can be 3. After depth-separable convolution 1 performs semantic extraction on the depth feature data output by the last encoder, the obtained depth semantic data is transferred to the next depth-separable convolution. That is, depthwise separable convolution 2, and the resulting depth semantic data can also be transferred to the fusion layer. Depthwise separable convolution 2 fuses the output of depthwise separable convolution 1 with the depth feature data output by the last encoder as input. The expansion coefficient of depthwise separable convolution 2 can be 6, and the depthwise separable convolution 2 can have an expansion coefficient of 6. After Product 2 performs semantic extraction on the fused data, the obtained depth semantic data is transferred to the next depth-separable convolution, namely depth-separable convolution 3, and the obtained depth semantic data can also be transferred to the fusion layer. Depthwise separable convolution 3 uses the output of depthwise separable convolution 2 and the input of depthwise separable convolution 2 as input at the same time. The expansion coefficient of depthwise separable convolution 3 can be 12, and depthwise separable convolution 3 has a After the data is semantically extracted, the obtained depth semantic data is transferred to the next depthwise separable convolution, namely depthwise separable convolution 4, and the obtained depth semantic data can also be transferred to the fusion layer. Depthwise separable convolution 4 and depthwise separable convolution 5 are similar to depthwise separable convolution 3. Depthwise separable convolution 4 uses the output of depthwise separable convolution 3 and the input of depthwise separable convolution 2 as input at the same time. , the expansion coefficient of depthwise separable convolution 4 can be 18. After depthwise separable convolution 4 performs semantic extraction on the input data, the obtained depth semantic data is transferred to the next depthwise separable convolution, that is, depthwise separable convolution. Product 5, and the resulting deep semantic data can also be transferred to the fusion layer. Depthwise separable convolution 5 uses the output of depthwise separable convolution 4 and the input of depthwise separable convolution 2 as input at the same time. The expansion coefficient of depthwise separable convolution 5 can be 24, and the depthwise separable convolution 5 has After semantic extraction of the data, the obtained deep semantic data is input to the fusion layer. The fusion layer fuses the output of each depth-separable convolution and the depth feature data output by the last encoder (ie, encoder 4 in Figure 4) to determine the final depth semantic data and serve as the output of the semantic perception unit . The semantic sensing unit inputs the determined final depth semantic data into the decoder corresponding to the last encoder, which may be decoder 4 in Figure 4, for example. It can be understood that the input of other depth-separable convolutions in the semantic perception unit except the first depth-separable convolution not only includes the output of the previous depth-separable convolution, but also includes the first depth-separable convolution. The output of the convolution, or the output of the last encoder, ensures that richer semantic information can be extracted. The fusion layer in Figure 5 can represent a concat operation of the channel dimensions of each depthwise separable convolution output data. It can be understood that the concat represents a concat operation on the same channel dimension, that is to say, the data processing in the semantic perception unit does not change the output channel dimension.

Continuing back to Figure 4, decoder 4 will perform decoding based on the output of the semantic perception unit and the output of encoder 4. Among them, decoder 4 can set stride to 8, and stride is a hyperparameter of decoder 4. Decoder 4 will output the feature map with stride=8. Among them, the channel dimension of this feature map is 128. Decoder 4 transmits this feature map to the next decoder, decoder 3. Decoder 3 will decode based on the output of decoder 4 and the output of encoder 3. Among them, decoder 3 can set stride to 4. Decoder 3 will output the feature map with stride=4. Among them, the channel dimension of this feature map is 64. Decoder 3 transmits this feature map to the next decoder, Decoder 2. Decoder 2 will decode based on the output of Decoder 3 and the output of Encoder 2. Among them, decoder 2 can set stride to 2. Decoder 2 will output the feature map with stride=2. Among them, the channel dimension of this feature map is 32. Decoder 2 transmits this feature map to the next decoder, Decoder 1. Decoder 1 will decode based on the output of Decoder 2 and the output of Encoder 1. Among them, decoder 1 can set stride to 1. Decoder 1 will output the feature map with stride=1. Among them, the channel dimension of this feature map is 16. Decoder 1 transmits this feature map to the next decoder, decoder 0.

Decoder 0 will decode based on the output of decoder 1 and the output of encoder 0. Decoder 1 will output the depth estimation result of the input training image, that is, the depth estimation data corresponding to the training image. The depth estimation data may be a depth map with a channel dimension of 1.

It can be understood that depthwise separable convolution can also be used in the decoder, corresponding to the encoder. Therefore, in some examples, the last decoder (i.e., the decoder corresponding to the first encoder) may also not contain depthwise separable convolutions. The structure of the decoder can be similar to that of the encoder and can be used to perform the reverse operation of the encoder.

After that, the last decoder in the deep network model (ie, decoder 0 in Figure 4) can output the image depth of the image to be trained.

In some examples, for deep network models in the training stage, that is, the initial deep model. Its input can be training images and the estimated depth is obtained. It can be understood that the estimated depth represents the image depth of the training image output after the training image passes through the initial depth model. It can be understood that in the training stage, each training image can correspond to a label depth, and the label depth is used to represent the real image depth of the corresponding training image. Then during each training, the initial depth model can calculate the loss function based on the estimated depth and label depth, and adjust each parameter in the initial depth model based on the loss function. For example, stochastic gradient descent (SGD) can be used for gradient backpropagation to update the parameters in the initial depth model.

When the loss function converges, the trained deep network model can be obtained. It can be understood that the data processing process in the training phase is similar to that in the application phase. Therefore, for specific data processing in the training phase, reference can be made to the corresponding description of data processing in the application phase, which will not be described in detail in this disclosure.

The deep network model in this application uses depth-separable convolution, so it can adapt to the scene of the terminal device and be perfectly deployed on the terminal device, so that the terminal device can perform depth estimation of the image based on the trained deep network model.

Next, the above process will be described in more detail with reference to the accompanying drawings.

Figure 6 is a flow chart of an image depth prediction method according to an exemplary embodiment. As shown in Figure 6, this method can be run on the terminal device. Among them, terminal equipment can also be called terminal, user equipment (User Equipment, UE), mobile station (Mobile Station, MS), mobile terminal (Mobile Terminal, MT), etc. It is a type of device that provides voice and/or data connectivity to users. For example, the terminal device can be a handheld device with wireless connection function, a vehicle-mounted device, etc. Currently, some examples of terminal devices include smartphones (Mobile Phone), pocket computers (Pocket Personal Computer, PPC), PDAs, personal digital assistants (Personal Digital Assistant, PDA), notebook computers, tablet computers, wearable devices, or Vehicle equipment, etc. In addition, when it is a vehicle-to-everything (V2X) communication system, the terminal device may also be a vehicle-mounted device. It should be understood that the embodiments of the present disclosure do not limit the specific technology and specific equipment form used by the terminal equipment.

It can be understood that the deep network model involved in this method can adopt the network structure described in Figure 4 and Figure 5 above. In some examples, the present disclosure can use a monocular vision system to implement a 2D image depth estimation task, the input of which is only a single color picture and the output is a depth map represented by grayscale values. In some examples, this method can also be extended to tasks such as computational photography and autonomous driving situation awareness.

Therefore, the method may include the following steps:

In step S11, the image to be processed is obtained.

In some examples, the terminal device may obtain an image to be processed that requires depth prediction. The image to be processed can be obtained from other devices through the network, or can be obtained by photographing the terminal device, or can be pre-stored on the terminal device, which is not limited by this disclosure.

In some examples, the network may adopt Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Time Division Multiple Access (TDMA), frequency Frequency Division Multiple Access (FDMA), Orthogonal Frequency-Division Multiple Access (OFDMA), Single Carrier Frequency Division Multiple Access (Single Carrier FDMA, SC-FDMA), Carrier Sense Multiple Access Access/conflict avoidance (Carrier Sense Multiple Access with Collision Avoidance) and other methods are implemented. According to different network capacity, speed, delay and other factors, the network can be divided into 2G (English: Generation) network, 3G network, 4G network or future evolution network, such as the fifth generation wireless communication system (The 5th Generation Wireless Communication System, 5G) network, 5G network can also be called New Radio (NR).

In step S12, the image to be processed is input to the deep network model, and the image depth of the image to be processed is predicted.

In some examples, the image to be processed obtained in S12 can be input into the deep network model to obtain the predicted image depth of the image to be processed.

Among them, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of the error weight parameter, the depth gradient error, and the depth structure loss parameter. The error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth; the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth; the depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image .

The deep network model used in this disclosure has depth-separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long-running and unsuitable situation of large-scale models on the terminal device. .

In some embodiments, for the deep network model involved in Figures 4 to 6, the training process can be as shown in Figure 7, for example. Figure 7 is a deep network model training according to an exemplary embodiment. Method flowchart. This method can be run on the service device. In some examples, the service device may be a server or server cluster. Of course, in other examples, it can also be a server or server cluster running on a virtual machine.

Therefore, the method may include the following steps:

In step S21, a preconfigured training image set is obtained. The training image set includes training images and label depths corresponding to the training images.

In some examples, the service device may obtain a preconfigured set of training images. The training image set may be pre-stored on the service device. Or the training image set can be stored in a database, and the service device obtains the training image set by connecting with the corresponding data. Among them, the training image set includes training images and the label depth corresponding to the training images.

In some examples, a large amount of dense depth data can be obtained through network data collection, optical flow estimation, binocular stereo matching, and depth estimation teacher model prediction methods to generate training image sets. It can be understood that the dense representation can be obtained through the above method to determine the corresponding label depth for each pixel in the training image.

In some examples, the images in the training image set may be images that have been masked for a set area. For example, the setting area can be the background area of the image, such as the sky area, ocean area, etc. Taking the setting area as the sky area as an example, for training images involving outdoor scenes, such training images usually include the sky. The color of some parts of the sky, clouds, etc. may have a corresponding impact on depth estimation, such as incorrect estimation of the depth of clouds. Therefore, a pre-trained sky segmentation model can also be used to segment the sky area to obtain the sky mask S _mask . Afterwards, S _mask can be used to process the image containing the sky to mark the sky area with the corresponding label depth. For example, you can set the depth of the S _mask processing area in the image to the maximum value, indicating the furthest depth. By annotating the sky part in the training image, the accuracy of estimating the depth of the sky area during depth prediction is improved. For another example, when binocular stereo matching is used to obtain a training image set, there may be some invalid areas. Therefore, the effective area mask can be used to process the image obtained by binocular stereo matching to obtain a processed image. V _mask can be used to represent the effective area mask. After the image is processed through V _mask , the resulting images are all valid area images, thus effectively preventing pixels not in the area (ie, pixels in the invalid area) from participating in corresponding calculations in the subsequent training process, such as loss calculations. It can be understood that the image processed by V _mask can also be processed by S _mask . In other words, the effective area image may include annotated background areas. Afterwards, the masked image can be used as a training image for subsequent training.

In some cases, because current depth prediction solutions usually focus on a relatively fixed scene, the performance of the trained model will significantly decrease when leaving the scene, that is, the generalization ability is very poor. Therefore, the acquired training image set can include training images in any scene, so that the generalization ability of the trained deep network model is significantly improved.

Therefore, in some examples, different training image sets can be divided according to different times and different collection methods, such as {Data ₁ , Data ₂ ,..., Data _n }. Data _n represents the nth training image set. Afterwards, different sampling weights can be set for different training image sets based on the amount of data contained in each training image set. It can be, for example, as shown in Equation 1,

Among them, p _j represents the sampling weight of the j-th training image set during training. N() represents the counting statistics of samples in the training image set, which can be understood as the amount of data in the training image set.

In some examples, the training image and the label depth corresponding to the training image can be obtained from the corresponding training image set based on the sampling weight calculated by Formula 1. This ensures that the training process will not be biased towards a certain type of training images, making the trained deep network model extremely capable of generalization and can be applied to depth estimation of a variety of different scene images.

In some examples, the images in the training image set can also be changed through data augmentation methods such as horizontal flipping of images, random cropping, and color changes, thereby expanding the amount of data in the data set to meet training needs.

In step S22, the training image is input to the initial depth model, and the estimated depth corresponding to the training image is determined.

In some examples, the service device can input the training images in the training image set obtained in S21 into an initial depth model composed of depth-separable convolutions for training, and obtain the estimated depth corresponding to the training images. In some examples, the training images in each training image set obtained based on the sampling weight may be sequentially input into the initial depth model for training.

In some examples, the training images of the initial depth model are input, and the output results can be recorded as p. Afterwards, p can be range-cropped and then multiplied by 255 to obtain the estimated depth corresponding to the training image. Among them, the estimated depth can be recorded as D _pred =clip(p, 0, 1)*255. Among them, clip() means cropping. It can be understood that the output result can be recorded as p, which is usually a value between 0 and 1. In order to more conveniently observe the output depth, the relative depth result D of each pixel can be obtained by cropping and multiplying by 255. The relative depth result D can be used as the estimated depth. The specific calculation process can be implemented with reference to existing methods, and will not be described in detail in this disclosure.

In step S23, a loss function is used to adjust the initial depth model based on the estimated depth and the label depth.

In some examples, the service device may calculate the loss function based on the estimated depth obtained in S22 and the label depth corresponding to the corresponding training image. And adjust the initial depth model based on the calculated loss function.

In some examples, SGD can be used to adjust the initial depth model, that is, update the corresponding parameters in the initial depth model. It is understandable that the hyperparameters in the model will not be adjusted or updated during the training process.

In step S24, until the loss function converges, the trained deep network model is obtained.

In some examples, when the loss function in S23 converges, the trained deep network model can be obtained. It can be understood that since the deep network model is composed of depth-separable convolutions, it is suitable for deployment on terminal devices, so that the terminal devices can perform depth estimation of images based on the deep network model.

The deep network model obtained by training in this disclosure has depth separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long and unsuitable problems of large models running on the terminal device. Condition.

In some embodiments, such as shown in Figure 8, the loss functions involved in Figures 6 and 7 can be confirmed through the following steps:

In step S31, the target loss function is determined based on at least one of the first loss function, the second loss function, and the third loss function.

In some examples, the target loss function for adjusting the initial depth model may be determined based on at least one of the first loss function, the second loss function, and the third loss function. The first loss function can be determined based on the error weight parameter, the second loss function can be determined based on the depth gradient error, and the third loss function can be determined based on the depth structure loss parameter.

The present disclosure can be combined with multiple loss functions to train a deep network model, thereby ensuring that the trained deep network model can more accurately identify image depth.

In some embodiments, as shown in Figure 9, the first loss function involved in S31 can be determined through the following steps:

In step S41, the absolute value of the error is determined based on the estimated depth and the tag depth.

In some examples, the absolute value of the error can be determined based on the estimated depth predicted by the initial depth model and the label depth corresponding to the corresponding training image. For example, it can be recorded as abs(D _pred -D _target ). Among them, abs() represents the absolute value, D _pred represents the estimated depth predicted by the initial depth model, and D _target represents the label depth corresponding to the corresponding training image.

In step S42, the first loss function is determined based on the absolute value of the error and the error weight parameter.

In some examples, the first loss function may be determined based on the error absolute value determined in S21 and the error weight parameter W determined in S22.

Among them, the error weight parameter W can be determined based on the estimated depth and label depth. For example, the error weight parameter W can be determined based on the estimated depth predicted by the initial depth model and the label depth corresponding to the corresponding training image. For example, W can be calculated by formula 2.

W=pow(D _pred -D _target , 2)...Formula 2

Among them, pow(D _pred -D _target , 2) is expressed as the calculation of the 2nd power of D _pred -D _target . It can be understood that this weight value can impose greater loss weight on those areas where the prediction error deviation is relatively large. Therefore, formula 2 can also be considered to express a focal attribute.

Therefore, in some examples, after obtaining the absolute value of the error and the error weight parameter W, the first loss function can be calculated through Formula 3 and Formula 4.

L ₁ =V _mask ·abs(D _pred -D _target )...Formula 4

in,

Represented as the first loss function, which can also be called focal-L ₁ loss function, α and β are preset weight coefficients. It can be seen that the L ₁ loss function can be calculated using the absolute value of the error, and W·L ₁ represents the L ₁ loss function with focal attributes. In some examples, when calculating the L ₁ loss function, it can be calculated based on an effective area mask or sky mask. For example, V _mask indicates whether each pixel in the effective area is effective. For example, 0 and 1 can be used to distinguish effective and invalid. . abs(D _pred -D _target ) represents the absolute value of the error of each pixel in the same area. Of course, in some examples, calculation can also be performed pixel by pixel, which is not limited by this disclosure.

Among them, the specific calculation of the L ₁ loss function can be implemented by referring to existing methods, and will not be described in detail in this disclosure.

This disclosure introduces a weight attribute when determining the first loss function, so as to assign a higher loss weight to the location with a large prediction error deviation, so that the initial depth model can be better trained, making the recognition results of the trained deep network model more accurate. for accuracy.

In order to ensure that the depth results obtained in the depth estimation scheme have clearer outlines, reduce noise, improve performance capabilities, and ensure that they can meet scenes with high depth detail requirements in later tasks, such as computational photography and other scenes. Therefore, the present disclosure can also adjust the initial depth model through one or more of the following loss functions, such as the second loss function and the third loss function.

In some embodiments, the label depth may include label depth gradients in at least two directions. For example, as shown in Figure 10, the second loss function involved in S31 can be determined through the following steps:

In step S51, estimated depth gradients in at least two directions are determined based on the estimated depth and using a preset gradient function.

In some examples, the estimated depth predicted according to the initial depth model can be brought into a preset gradient function to determine the estimated depth gradient in at least two directions. In some examples, two different directions can usually be selected. Of course, in other examples, more directions or fewer directions can be selected, which is not limited by this disclosure.

In one example, the directions may be x and y directions. To determine the estimated depth gradient in the x and y directions, it can be calculated by Formula 5 and Formula 6.

It can be understood that D can be D _pred , then the D _x in the x direction calculated through Formula 5 can be expressed as

And D _y in the y direction calculated through formula 6 can be expressed as

in,

Represented as convolution on the data using the Sobel operator. in,

and

That is the matrix more commonly used by the sobel operator. Of course, the matrix can be adjusted according to the actual situation, so that gradient data in more directions can be obtained.

It can be understood that the gradients involved in Figure 10 have different meanings from the gradients used in SGD for training. The gradients involved in SGD refer to the gradient of the change of the loss function. The gradient involved in Figure 10 refers to the gradient of depth changes in different areas in the training image.

In step S52, label depth gradients in at least two directions are determined according to the label depth and the gradient function.

In some examples, the label depth may include label depth gradients in at least two directions. Wherein, at least two directions involved in the label depth gradient are the same direction as at least two directions involved in the estimated depth gradient. Of course, in other examples, the label depth gradient in the x direction can also be calculated based on Formula 5 and Formula 6 respectively.

And calculate the label depth gradient in the y direction

In step S53, depth gradient errors in at least two directions are determined based on estimated depth gradients in at least two directions and label depth gradients in at least two directions.

In some examples, depth gradient errors in at least two directions may be determined based on the estimated depth gradients in at least two directions determined in S51 and the tag depth gradients in at least two directions determined in S52. For example, based on the x direction, you can

and

The depth gradient error in the x direction is calculated, recorded as

In the same way, based on the y direction, it can be based on

and

The depth gradient error in the y direction is calculated, recorded as

In step S54, a second loss function is determined based on depth gradient errors in at least two directions.

In some examples, the second loss function may be determined based on the depth gradient errors in at least two directions determined in S53. Among them, the second loss function can also be called the gradient loss function.

In some examples, the second loss function can be calculated by formula 7,

Among them, L _grad represents the second loss function. In some examples, when calculating L _grad , it can be calculated based on an effective area mask or sky mask, that is, combining whether each pixel in the effective area mask or sky mask is valid, and the area mask or sky mask The absolute value of the gradient of all pixels in the x direction, and the absolute value of the gradient error of all pixels in the y direction. Of course, in some examples, calculation can also be performed pixel by pixel, which is not limited by this disclosure.

The present disclosure also introduces gradient loss, so that the trained deep network model can more accurately identify the gradient of the image, thereby better identifying the boundaries of different depths.

In some embodiments, such as shown in Figure 11, the third loss function involved in S31 can be determined through the following steps:

In step S61, multiple training data groups are determined for the training images.

In some examples, the service device may determine multiple training data groups for each training image. Each training data group may be composed of at least two pixels in the training image. Therefore, the label depth data of the training image may include depth label data corresponding to the corresponding pixel point.

In some embodiments, determining multiple training data groups may include: using a preset gradient function to determine the gradient boundary of the training image. Then, multiple training data groups are determined by sampling on the training images according to the gradient boundaries. It can be understood that the process of determining multiple training data groups may be referred to as pair sampling.

In some examples, pair sampling is performed based on gradient boundaries, which can also be called deep structure sampling. For example, the gradient of the training image can be calculated based on the sobel operator to determine which areas have large gradient changes. For example, the gradient threshold can be set in advance. When the gradient difference at different locations is greater than or equal to the gradient threshold, the depth gradient can be considered to have changed significantly, and the corresponding gradient boundary can be determined. It can be understood that for the second loss function described in Figure 10, the accuracy of this part of the gradient boundary calculation can be effectively adjusted. For example, as shown in Figure 11, assuming that the depth map corresponding to the training image is the depth map shown in Figure 12, the schematic diagram of pair sampling based on the gradient boundary can be as shown in Figure 13. It can be seen that each pixel point sampled by the depth structure is located around the gradient boundary. In some examples, each training data set can be composed of two pixels in the training image. Then you can collect a pixel on one side of the gradient boundary, and then collect a pixel on the gradient boundary to form a training data group. Or you can also collect a pixel on one side of the gradient boundary and then collect a pixel on the other side of the gradient boundary to form a training data group. In the above manner, multiple training data groups can be determined.

The present disclosure determines the training data group based on the gradient boundary, which can make the distinction between different depths more accurate in the subsequent training process, thereby making the recognition results of the trained deep network model more accurate.

In some embodiments, determining multiple training data groups may also include: randomly sampling on the training images to determine multiple training data groups.

In some examples, in order to avoid training more on the gradient boundary part during the training process, the slowly changing part of the gradient is ignored. Therefore, the present disclosure can also perform random pair sampling on the training image, that is, random sampling. Multiple training groups are thereby determined. For example, as shown in Figure 14, it can be seen that the multiple training data groups obtained by random sampling are evenly distributed, thus effectively retaining pixels corresponding to areas with slow gradient changes. The present disclosure determines the training data group based on the gradient boundary, which can make the distinction between different depths more accurate in the subsequent training process, thereby making the recognition results of the trained deep network model more accurate.

Return to Figure 11. After S41 is executed, the following steps can be continued.

In step S62, for each training data group, determine the depth structure loss parameter based on the label depth corresponding to at least two pixels.

In some examples, the service device can determine the depth structure loss parameter for each training data group based on the label depth corresponding to at least two pixels in the training data group. In some examples, if the training data set includes two pixels, the depth structure loss parameter can be determined based on the label depth corresponding to the two pixels in the training data set.

Among them, the depth structure loss parameter can be recorded as ρ. For example, ρ can be calculated using Equation 8.

Among them, a and b respectively represent two pixels in the training data group.

In step S63, for each training data group, a third loss function is determined based on the depth structure loss parameter and the estimated depth corresponding to at least two pixels.

In some examples, the service device may determine the third loss function based on the depth structure loss parameter ρ determined in S62 and the estimated depth corresponding to at least two pixels included in the training data group. Among them, the third loss function can also be called the depth map structure sampling loss function. For example, the third loss function can be calculated through Formula 9.

Among them, L _pair represents the third loss function,

expressed as e

Power. It can be seen that in some examples, the third loss function uses ranking loss (ranking loss) to calculate the loss function.

The present disclosure can also determine the third loss function through a training group of two pixels as a group, so that in the subsequent training process, different depth areas in the image can be sampled more accurately, so as to enable the recognition of the deep network model obtained by training. The results are more accurate.

In some examples, the target loss function can be determined based on at least one of the above-mentioned first loss function, second loss function, and third loss function. For example, Formula 10 provides a method for determining the target loss function.

Among them, L _total represents the target loss function, the parameter λ ₁ is the weighting coefficient of the first loss function, λ ₂ is the weighting coefficient of the second loss function, and λ ₃ is the weighting coefficient of the third loss function. It can be understood that the specific values of λ ₁ , λ ₂ and λ ₃ can be set according to actual conditions, and are not limited in this disclosure. Obviously, the specific values of λ ₁ , λ ₂ and λ ₃ can be adjusted to use one or more of the first loss function, the second loss function and the third loss function to obtain the target loss function to train the deep network Model.

It can be understood that the first loss function, the second loss function and the third loss function have different purposes for adjusting the model. The more loss functions used when determining the target loss function, the higher the recognition accuracy of the deep network model trained using the target loss function.

In some embodiments, for the input image input to the deep network model, for example, the input image may be an image to be processed or a training image, which may include: normalizing the input image to obtain a normalized input image. Then, the normalized input image is fed into the deep network model.

In some examples, for the normalization of the input image, the mean and variance of the color channels of the input image can be combined, for example, as shown in Equation 11.

I _norm =(Im)/v…Formula 11

Among them, m represents the mean, v represents the variance, I represents the input image, and I _norm represents the normalized input image. In some cases, the color channels of I can be arranged according to BGR. It can be understood that B represents the blue channel, G represents the green channel, and R represents the red channel. In some examples, m can take (0.485, 0.456, 0.506), and v can take (0.229, 0.224, 0.225). It can be understood that the above m and v are only an exemplary description of one value. In other examples, any value can be set according to the actual situation, and this disclosure does not limit it.

In some examples, during the training phase, the depth of the labels corresponding to the training images can also be normalized. Since training image sets come from a wide range of sources, the depth level (scale) and transformation (shift) of the label depth corresponding to the training images in different training image sets may be inconsistent. Therefore, the depth of the labels corresponding to the training images can be normalized. Of course, it is understandable that this method can also normalize the estimated depth predicted by the initial depth model.

For example, the median label depth can be calculated by Equation 12.

D _t =median(D)…Formula 12

Among them, D _t represents the median value of D. D represents the label depth corresponding to each pixel in the effective area.

Then, the mean value after removing the median value in the label depth can be determined by Equation 13.

Among them, M represents the number of effective pixels in the effective area V _mask . It is understood that V _mask can also be replaced by S _mask .

Afterwards, the normalized label depth can be determined based on D _t determined in Equation 12 and D _s in Equation 13. That is, as shown in Equation 14

The present disclosure can also normalize the images of the input model, ensuring that the training can be completed in the same dimension during the training process, thereby making the recognition results of the trained deep network model more accurate.

In some embodiments, during the actual training process, color pictures in jpg image format may be used as training pictures. At the same time, depth images in png image format can be used as depth label data corresponding to training images. For sky segmentation pictures, png picture format can be used. If training is performed based on pair training groups, in some examples, a training data group of about 800,000, that is, a pair group, can be used.

The deep network model used in this disclosure has depth separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long and unsuitable running time of large models on the terminal device. Condition.

It should be noted that those skilled in the art can understand that the various implementations/embodiments mentioned above in the embodiments of the present disclosure can be used in conjunction with the foregoing embodiments or can be used independently. Whether used alone or in conjunction with the foregoing embodiments, the implementation principles are similar. In the implementation of the present disclosure, some embodiments are described in terms of implementations used together. Of course, those skilled in the art can understand that such illustrations do not limit the embodiments of the present disclosure.

Based on the same concept, embodiments of the present disclosure also provide an image depth prediction device.

It can be understood that, in order to implement the above functions, the image depth prediction device provided by the embodiments of the present disclosure includes hardware structures and/or software modules corresponding to each function. Combined with the units and algorithm steps of each example disclosed in the embodiments of the present disclosure, the embodiments of the present disclosure can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving the hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to go beyond the scope of the technical solutions of the embodiments of the present disclosure.

Figure 15 is a schematic diagram of a deep network model training device according to an exemplary embodiment. Referring to Figure 15, the device 100 may include: an acquisition module 101, used to acquire an image to be processed; a prediction module 102, used to input the image to be processed into a deep network model, and predict the image depth of the image to be processed. , the deep network model is composed of multi-layer depth separable convolutions; among them, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of the error weight parameter, depth gradient error, and depth structure loss parameter; the error The weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth, the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image; The estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image.

In a possible implementation, the target loss function is determined in the following manner: the target loss function is determined based on at least one of the first loss function, the second loss function, and the third loss function; wherein the first loss function is based on The error weight parameter is determined, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameter.

The present disclosure can be combined with multiple loss functions to train a deep network model, thereby ensuring that the trained deep network model can more accurately identify the depth of the image.

In a possible implementation, the first loss function is determined in the following manner: based on the estimated depth and the label depth, the absolute value of the error between the estimated depth and the label depth is determined; based on the absolute value of the error and the error weight parameter, the first loss function is determined. loss function.

In a possible implementation, the depth gradient error is determined in the following manner: based on the estimated depth and a preset gradient function, determine the estimated depth gradient in at least two directions; based on the label depth and the gradient function, determine at least two direction; determine the depth gradient error in at least two directions according to the estimated depth gradient in at least two directions and the label depth gradient in at least two directions; the second loss function is determined in the following way: based on at least two directions The depth gradient error in each direction determines the second loss function.

In a possible implementation, the depth structure loss parameters are determined in the following manner: multiple training data groups are determined, wherein the training data groups include at least two pixels, and the at least two pixels are pixels of the training image, and the label The depth includes training labels corresponding to at least two pixels in the training image; for each training data group, the depth structure loss parameters are determined based on the depth of the labels corresponding to at least two pixels; the third loss function is determined in the following way: for each A training data group is formed, and a third loss function is determined based on the depth structure loss parameter and the estimated depth corresponding to at least two pixels.

In a possible implementation, determining multiple training data groups includes: using a preset gradient function to determine the gradient boundary of the training image; sampling the training image according to the gradient boundary to determine multiple training data groups.

In a possible implementation, the training images are determined in the following manner: determine at least one training image set in any scene, and the images in the training image set are images after mask processing for the set area; according to the preconfigured sampling The weights determine training images from at least one training image set.

By acquiring training images in multiple scenarios, the present disclosure enables the trained deep network model to have stronger generalization capabilities and can achieve depth prediction in any scenario. And by masking the set area, the processed image is more conducive to model training and improves the prediction accuracy after model training.

Regarding the device 100 in the above embodiments, the specific manner in which each module performs operations has been described in detail in the relevant method embodiments, and will not be described in detail here.

FIG. 16 is a block diagram of an image depth prediction device 200 according to an exemplary embodiment. For example, the device 200 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, a RedCap terminal and other terminal devices.

Referring to Figure 16, device 200 may include one or more of the following components: processing component 202, memory 204, power component 206, multimedia component 208, audio component 210, input/output (I/O) interface 212, sensor component 214, and Communication component 216.

Processing component 202 generally controls the overall operations of device 200, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 202 may include one or more processors 220 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 202 may include one or more modules that facilitate interaction between processing component 202 and other components. For example, processing component 202 may include a multimedia module to facilitate interaction between multimedia component 208 and processing component 202.

Memory 204 is configured to store various types of data to support operations at device 200 . Examples of such data include instructions for any application or method operating on device 200, contact data, phonebook data, messages, pictures, videos, etc. Memory 204 may be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EEPROM), Programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

Power component 206 provides power to various components of device 200 . Power components 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 200 .

Multimedia component 208 includes a screen that provides an output interface between the device 200 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action. In some embodiments, multimedia component 208 includes a front-facing camera and/or a rear-facing camera. When the device 200 is in an operating mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.

Audio component 210 is configured to output and/or input audio signals. For example, audio component 210 includes a microphone (MIC) configured to receive external audio signals when device 200 is in operating modes, such as call mode, recording mode, and speech recognition mode. The received audio signals may be further stored in memory 204 or sent via communications component 216 . In some embodiments, audio component 210 also includes a speaker for outputting audio signals.

The I/O interface 212 provides an interface between the processing component 202 and a peripheral interface module, which may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.

Sensor component 214 includes one or more sensors for providing various aspects of status assessment for device 200 . For example, the sensor component 214 can detect the open/closed state of the device 200, the relative positioning of components, such as the display and keypad of the device 200, and the sensor component 214 can also detect a change in position of the device 200 or a component of the device 200. , the presence or absence of user contact with device 200 , device 200 orientation or acceleration/deceleration and temperature changes of device 200 . Sensor assembly 214 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 216 is configured to facilitate wired or wireless communications between device 200 and other devices. Device 200 may access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G or 5G, or a combination thereof. In one exemplary embodiment, the communication component 216 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 216 also includes a near field communications (NFC) module to facilitate short-range communications. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, device 200 may be configured by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable Gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are implemented for executing the above method.

In an exemplary embodiment, a non-transitory computer-readable storage medium including instructions, such as a memory 204 including instructions, which can be executed by the processor 220 of the device 200 to complete the above method is also provided. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Figure 17 is a schematic diagram of a deep network model training device 300 according to an exemplary embodiment. For example, device 300 may be provided as a server. It can be understood that the device 300 can be used to implement training of a deep network model. Referring to FIG. 17 , device 300 includes processing component 322 , which further includes one or more processors, and memory resources, represented by memory 332 , for storing instructions, such as application programs, executable by processing component 322 . The application program stored in memory 332 may include one or more modules, each of which corresponds to a set of instructions. In addition, the processing component 322 is configured to execute instructions to perform the process method of training the deep network model in the above method.

Device 300 may also include a power supply component 326 configured to perform power management of device 300, a wired or wireless network interface 350 configured to connect device 300 to a network, and an input-output (I/O) interface 358. Device 300 may operate based on an operating system stored in memory 332, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.

The depth estimation network comprehensively designed in this disclosure is lightweight and highly accurate, and can be deployed in handheld device scenarios.

Furthermore, a large number of deep data sets are collected, and different sampling weights are set for these different numbers of data sets to balance the data distribution. Using normalization operations on data of different scales and shifts allows training to be completed in the same dimension. At the same time, an efficient semantic information acquisition module EASPP is designed in the depth estimation network to capture semantic information with a smaller amount of parameters and calculations. The present disclosure also uses a multi-dimensional depth loss function to improve the quality of depth prediction at object edges.

The above-mentioned solution of the present disclosure solves the problem that the depth estimation model takes a long time in the handheld device scene, and can be deployed on the mobile phone. The floating-point model can reach 150ms on the Qualcomm 7325 platform. At the same time, the present disclosure can achieve depth estimation in any scenario and has stronger generalization capabilities. The present disclosure can also accurately predict the depth of portrait edges, green plant boundaries, etc., and meet the requirements for depth estimation results when shooting blurred scenes.

It can be further understood that “plurality” in this disclosure refers to two or more, and other quantifiers are similar. "And/or" describes the relationship between related objects, indicating that there can be three relationships. For example, A and/or B can mean: A exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the related objects are in an "or" relationship. The singular forms "a", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It is further understood that the terms "first", "second", etc. are used to describe various information, but such information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other and do not imply a specific order or importance. In fact, expressions such as "first" and "second" can be used interchangeably. For example, without departing from the scope of the present disclosure, the first information may also be called second information, and similarly, the second information may also be called first information.

It will be further understood that although the operations are described in a specific order in the drawings in the embodiments of the present disclosure, this should not be understood as requiring that these operations be performed in the specific order shown or in a serial order, or that it is required that Perform all operations shown to obtain the desired results. In certain circumstances, multitasking and parallel processing may be advantageous.

Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure that follow the general principles of the disclosure and include common knowledge or customary technical means in the technical field that are not disclosed in the disclosure. .

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the disclosure is limited only by the appended rights.

Claims

An image depth prediction method, characterized in that the method includes:

Get the image to be processed;

Input the image to be processed into a deep network model and predict the image depth of the image to be processed. The deep network model is composed of multiple layers of depth-separable convolution;

Wherein, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of an error weight parameter, a depth gradient error, and a depth structure loss parameter;

The error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth, the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used to characterize The difference in label depth corresponding to different positions in the training image; the estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image.
The method according to claim 1, characterized in that the target loss function is determined in the following way:

The target loss function is determined according to at least one of the first loss function, the second loss function and the third loss function; wherein the first loss function is determined based on the error weight parameter, and the second loss function Based on the depth gradient error determination, the third loss function is determined based on the depth structure loss parameter.
The method according to claim 2, characterized in that the first loss function is determined in the following manner:

According to the estimated depth and the tag depth, determine the absolute value of the error between the estimated depth and the tag depth;

The first loss function is determined according to the error absolute value and the error weight parameter.
The method according to claim 2 or 3, characterized in that the depth gradient error is determined in the following manner:

Determine estimated depth gradients in at least two directions according to the estimated depth and a preset gradient function;

Determine label depth gradients in the at least two directions according to the label depth and the gradient function;

determining the depth gradient error in the at least two directions based on the estimated depth gradient in the at least two directions and the tag depth gradient in the at least two directions;

The second loss function is determined in the following way:

The second loss function is determined based on the depth gradient error in the at least two directions.
The method according to any one of claims 2-4, characterized in that the depth structure loss parameter is determined in the following manner:

Determine multiple training data groups, wherein the training data group includes at least two pixels, the at least two pixels are pixels of the training image, and the label depth includes the at least one in the training image. The training labels corresponding to the two pixels;

For each training data group, determine the depth structure loss parameter according to the label depth corresponding to the at least two pixel points;

The third loss function is determined in the following way:

For each training data group, the third loss function is determined based on the depth structure loss parameter and the estimated depth corresponding to the at least two pixel points.
The method according to claim 5, characterized in that determining multiple training data groups includes:

Determine the gradient boundary of the training image using a preset gradient function;

Sampling is performed on the training image according to the gradient boundary to determine the plurality of training data groups.
The method according to any one of claims 1-6, characterized in that the training image is determined in the following manner:

Determine at least one training image set in any scene, where the images in the training image set are images that have been masked for the set area;

The training images are determined from the at least one training image set according to preconfigured sampling weights.
An image depth prediction device, characterized in that the device includes:

Acquisition module, used to obtain images to be processed;

A prediction module, configured to input the image to be processed into a deep network model and predict the image depth of the image to be processed. The deep network model is composed of multiple layers of depth-separable convolution;

Wherein, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of an error weight parameter, a depth gradient error, and a depth structure loss parameter;

The error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth, the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used to characterize The difference in label depth corresponding to different positions in the training image; the estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image.
The device according to claim 8, characterized in that the target loss function is determined in the following manner:

The target loss function is determined according to at least one of the first loss function, the second loss function and the third loss function; wherein the first loss function is determined based on the error weight parameter, and the second loss function Based on the depth gradient error determination, the third loss function is determined based on the depth structure loss parameter.
The device according to claim 9, characterized in that the first loss function is determined in the following manner:

According to the estimated depth and the tag depth, determine the absolute value of the error between the estimated depth and the tag depth;

The first loss function is determined according to the error absolute value and the error weight parameter.
The device according to claim 9 or 10, characterized in that the depth gradient error is determined in the following manner:

Determine estimated depth gradients in at least two directions according to the estimated depth and a preset gradient function;

Determine label depth gradients in the at least two directions according to the label depth and the gradient function;

determining the depth gradient error in the at least two directions based on the estimated depth gradient in the at least two directions and the tag depth gradient in the at least two directions;

The second loss function is determined in the following way:

The second loss function is determined based on the depth gradient error in the at least two directions.
The device according to any one of claims 9-11, characterized in that the depth structure loss parameter is determined in the following manner:

Determine multiple training data groups, wherein the training data group includes at least two pixels, the at least two pixels are pixels of the training image, and the label depth includes the at least one in the training image. The training labels corresponding to the two pixels;

For each training data group, determine the depth structure loss parameter according to the label depth corresponding to the at least two pixel points;

The third loss function is determined in the following way:

For each training data group, the third loss function is determined based on the depth structure loss parameter and the estimated depth corresponding to the at least two pixel points.
The device according to claim 12, wherein the determining of multiple training data groups includes:

Determine the gradient boundary of the training image using a preset gradient function;

Sampling is performed on the training image according to the gradient boundary to determine the plurality of training data groups.
The device according to any one of claims 8-13, characterized in that the training image is determined in the following manner:

Determine at least one training image set in any scene, where the images in the training image set are images that have been masked for the set area;

The training images are determined from the at least one training image set according to preconfigured sampling weights.
An image depth prediction device, characterized by including:

processor;

Memory used to store instructions executable by the processor;

Wherein, the processor is configured to perform the method described in any one of claims 1 to 7.
A non-transitory computer-readable storage medium that, when instructions in the storage medium are executed by a processor of a computer, enables the computer to perform the method described in any one of claims 1 to 7.