CN117616457A

CN117616457A - Image depth prediction method, device, equipment and storage medium

Info

Publication number: CN117616457A
Application number: CN202280004623.9A
Authority: CN
Inventors: 倪鹏程; 张亚森; 苏海军; 陈凌颖
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2024-02-27
Also published as: WO2023245321A1

Abstract

The disclosure relates to an image depth prediction method, an apparatus, a device and a storage medium, comprising: and acquiring an image to be processed. And inputting the image to be processed into a depth network model, and predicting the image depth of the image to be processed. The depth network model is composed of multiple layers of depth separable convolutions. The depth network model is obtained by training a target loss function determined based on at least one of an error weight parameter, a depth gradient error and a depth structure loss parameter. The depth network model adopted by the method has the depth separable convolution, so that the deployment of the depth network model on the terminal equipment is realized, the operation time is effectively saved, and the conditions of long operation time consumption and discomfort of the large model on the terminal equipment are avoided.

Description

Image depth prediction method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of computer vision, and in particular relates to an image depth prediction method, an image depth prediction device, image depth prediction equipment and a storage medium.

Background

With the rapid development of the technology level, the terminal equipment becomes an indispensable article for people in daily life. In some cases, the terminal device needs to perform depth estimation on a picture, so as to complete subsequent tasks. For example, people often want to reach the blurring that occurs when a camera takes a picture. However, the lens on the mobile phone cannot be compared with the camera lens, and the mobile phone is required to perform corresponding depth estimation on the photographed picture, so that corresponding blurring can be performed on areas with different depths in the picture. Or the vehicle-mounted equipment can realize the perception of the automatic driving situation by acquiring the image and carrying out depth estimation.

In the related art, the depth estimation scheme used generally adopts a network with large parameter and calculation amount. Such schemes tend to be deployed on large computing devices, such as large service devices, service clusters, and the like. Obviously, such a scheme cannot be adapted to the terminal device.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides an image depth prediction method, apparatus, device, and storage medium.

According to a first aspect of embodiments of the present disclosure, there is provided an image depth prediction method, the method including: acquiring an image to be processed; inputting an image to be processed into a depth network model, predicting the image depth of the image to be processed, wherein the depth network model is formed by multi-layer depth separable convolution; the depth network model is obtained by training a target loss function, and the target loss function is determined based on at least one of an error weight parameter, a depth gradient error and a depth structure loss parameter; the error weight parameter is used for representing the weight of the difference between the estimated depth and the label depth, the depth gradient error is used for representing the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used for representing the label depth difference corresponding to different positions in the training image; the estimated depth is determined by the training stage depth network model based on the training image, and the tag depth corresponds to the training image.

In one embodiment, the objective loss function is determined as follows: determining an objective loss function according to at least one of the first loss function, the second loss function and the third loss function; wherein the first loss function is determined based on the error weight parameter, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameter.

In one embodiment, the first loss function is determined as follows: determining an absolute value of an error between the estimated depth and the tag depth according to the estimated depth and the tag depth; and determining a first loss function according to the absolute value of the error and the error weight parameter.

In one embodiment, the depth gradient error is determined as follows: determining estimated depth gradients in at least two directions according to the estimated depth and a preset gradient function; determining a tag depth gradient in at least two directions according to the tag depth and the gradient function; determining depth gradient errors in at least two directions according to the estimated depth gradients in at least two directions and the tag depth gradients in at least two directions; the second loss function is determined as follows: a second loss function is determined based on the depth gradient error in at least two directions.

In one embodiment, the depth structure loss parameter is determined as follows: determining a plurality of training data sets, wherein the training data sets comprise at least two pixel points, the at least two pixel points are the pixel points of a training image, and the tag depth comprises training tags corresponding to the at least two pixel points in the training image; determining a depth structure loss parameter according to the label depth corresponding to at least two pixel points for each training data set; the third loss function is determined as follows: and determining a third loss function according to the depth structure loss parameter and the estimated depth corresponding to at least two pixel points for each training data set.

In one embodiment, determining a plurality of training data sets includes: determining a gradient boundary of the training image by using a preset gradient function; a plurality of training data sets are determined by sampling on the training image according to the gradient boundaries.

In one embodiment, the training image is determined as follows: determining at least one training image set in any scene, wherein the images in the training image set are images subjected to mask processing aiming at a set area; a training image is determined from at least one training image set according to pre-configured sampling weights.

According to a second aspect of embodiments of the present disclosure, there is provided an image depth prediction apparatus, the apparatus comprising: the acquisition module is used for acquiring the image to be processed; the prediction module is used for inputting the image to be processed into a depth network model, predicting the image depth of the image to be processed, and the depth network model is formed by multi-layer depth separable convolution; the depth network model is obtained by training a target loss function, and the target loss function is determined based on at least one of an error weight parameter, a depth gradient error and a depth structure loss parameter; the error weight parameter is used for representing the weight of the difference between the estimated depth and the label depth, the depth gradient error is used for representing the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used for representing the label depth difference corresponding to different positions in the training image; the estimated depth is determined by the training stage depth network model based on the training image, and the tag depth corresponds to the training image.

According to a third aspect of embodiments of the present disclosure, there is provided an image depth prediction apparatus comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the method of the first aspect or any implementation of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium, which when executed by a processor of a computer, enables the computer to perform the method of the first aspect or any implementation of the first aspect.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: the adopted depth network model has the depth separable convolution, so that the deployment of the depth network model on the terminal equipment is realized, the operation time is effectively saved, and the conditions of long time consumption and discomfort of the large model in operation on the terminal equipment are avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of a scenario illustrated according to an example embodiment.

Fig. 2 is a schematic diagram of a depth prediction based blurring image, which is shown according to an example embodiment.

Fig. 3 is a schematic diagram illustrating a fully supervised depth estimation flow diagram according to an exemplary embodiment.

Fig. 4 is a schematic diagram of a deep network model structure, according to an example embodiment.

Fig. 5 is a schematic diagram illustrating a semantic aware unit architecture according to an example embodiment.

Fig. 6 is a flowchart illustrating a method of image depth prediction according to an exemplary embodiment.

FIG. 7 is a flowchart illustrating a deep network model training method, according to an example embodiment.

Fig. 8 is a flowchart illustrating a method of determining a target loss function according to an exemplary embodiment.

FIG. 9 is a flowchart illustrating a method of determining a first loss function, according to an example embodiment.

FIG. 10 is a flowchart illustrating a method of determining a second loss function, according to an example embodiment.

FIG. 11 is a flowchart illustrating a method of determining a third loss function, according to an example embodiment.

FIG. 12 is a depth illustration showing an intent in accordance with an exemplary embodiment.

Fig. 13 is a schematic diagram illustrating a depth structure sampling according to an example embodiment.

Fig. 14 is a schematic diagram illustrating a random sampling according to an example embodiment.

Fig. 15 is a schematic view of an image depth prediction apparatus according to an exemplary embodiment.

Fig. 16 is a schematic diagram of an image depth prediction apparatus according to an exemplary embodiment.

FIG. 17 is a schematic diagram of a deep network model training apparatus, according to an example embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure.

The method related to the disclosure can be applied to a scene for performing depth prediction and estimation on an image, for example, in the scene shown in fig. 1, a user performs self-timer through a mobile phone, and often hopes that the personal part of the image is clear, and the background can be properly blurred. Clearly, it is desirable for the user to perform corresponding processing on regions of different depths in the image, for example, regions of lesser depth, i.e., regions of the person may remain clear, while regions of greater depth, such as background regions, may perform corresponding blurring processing.

However, in general, whether the image at the shooting is clear or not is related to the unique design of the lens, and the focal plane of the shot image is clear and two sides of the shot image are virtual. However, due to the limitation of the lens design on the mobile phone, the blurring effect of the image shot by the lens of the mobile phone cannot reach the blurring effect of the image shot by the camera. At this time, some mobile phones may perform depth prediction on different areas in the image, and layer the image based on the predicted depth information as guiding information. The mobile phone can perform blurring on different image layers by using different blurring convolution kernels, and finally, the blurring results of different layers are fused to obtain a final blurring image.

For example, fig. 2 is a schematic diagram of a depth prediction based blurring image, which is shown according to an exemplary embodiment. It can be seen that for example, a user uses a mobile phone to perform a self-timer in some complex environments. At this time, depth prediction is required for an image photographed by a mobile phone (i.e., the leftmost image in fig. 2), and then the image may be layered according to the predicted depth information. The blurring convolution of different layers can be adopted for blurring according to different layers, and finally the blurring results of different layers are fused to obtain a final blurring image (namely, the right-most image in fig. 2).

In general, it can be considered that a user takes a picture using a mobile phone, and takes a picture using a monocular camera.

In some related schemes, the monocular camera based scheme for depth prediction may employ self-supervised depth estimation, it being understood that depth prediction may also be referred to as depth estimation. For example, a depth estimation network and a relative pose estimation network may be constructed and successive video frames employed as network inputs. In the training process, the depth map and the relative pose relationship of the frames can be estimated through calculation. Then, the mutual mapping relation of the scene between 3D and 2D can be utilized to minimize the photometric reconstruction error, thereby optimizing the depth map and the relative pose relation. In the actual use process, only the trained depth estimation network is used for carrying out depth prediction on the input video, so that a corresponding depth map is obtained. For example, depth information may be predicted from a video sequence based on monocular depth (monosdepth).

Of course, in other related schemes, a fully supervised depth estimation approach may be employed, such as building a depth network that uses paired color pictures and depth pictures (i.e., labels) as inputs. And updating network parameters by using a loss function through a depth image obtained by predicting the color image and a depth image serving as a label, thereby obtaining the trained depth network. For example, as represented by Big To Small (BTS), the input pairs of color images and depth pictures (i.e., labels) accomplish the network training task. In the training process, a color picture is input, and a depth estimation result is obtained through a feature encoder, a feature aggregation network and a feature decoder. The depth estimation results are then lost to the real data (GT) depth computation structure similarity (structural similarity index, SSIM) and parameter updates are performed. In the use process, as shown in fig. 3, a color picture is input, and then a feature map F of the input picture is obtained after convolution. Based on the feature map F, the depth features of different dimensions can be obtained by convolution from different dimensions through operations such as full-image encoder (full-image encoder), convolution, hole space pyramid pooling (atrous spatial pyramid pooling, ASPP), and the like. Then, depth features obtained by convolution of different dimensions can be overlapped The convolution is then performed again, as in convolution x in fig. 3. Then, the characteristic y obtained by convolution x can be used for determining the depth map of each different layer in an ordinal regression (ordinal regression) mode, for example, the depth map of different depth intervals can be determined, for example, the depth intervals can be divided into L'>l ₀ 、L”>l ₁ 、……、L”>l _k-1 Etc. And finally, overlapping the depth maps corresponding to the different depth intervals to obtain a finally output depth map.

In yet other related aspects, monocular cameras and structured light approaches may also be utilized. The structured light may be, for example, a laser radar, a millimeter wave radar, or the like. The monocular camera and the structured light may be fixed by using a rigid fixation. Then, the monocular camera and the structured light can be calibrated in the early stage, and then the depth result of the corresponding pixel position can be obtained through the triangle ranging principle.

Of course, the specific implementation process of the above-mentioned different related schemes may be implemented by referring to the existing manner, which is not repeated in this disclosure.

However, in the above-mentioned related solution, in order to enhance the effect of depth prediction as much as possible, a network with a large number of parameters and calculation is generally selected, and an operator not adapted to the handheld terminal device is used, which makes deployment and operation of the depth network on the handheld terminal device difficult.

Accordingly, the present disclosure provides an image depth prediction method for performing depth prediction on an input image using a depth network model composed of a multi-layer depth separable convolution. The depth network model is trained by adopting a target loss function, and the target loss function is determined based on at least one of an error weight parameter, a depth gradient error and a depth structure loss parameter. The error weight parameter is used for representing the weight of the difference between the estimated depth and the label depth, the depth gradient error is used for representing the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used for representing the label depth difference corresponding to different positions in the training image. The estimated depth is determined by the training stage depth network model based on the training image, and the tag depth corresponds to the training image. The depth network model adopted by the method has the depth separable convolution, so that the deployment of the depth network model on the terminal equipment is realized, the operation time is effectively saved, and the conditions of long operation time consumption and discomfort of the large model on the terminal equipment are avoided.

The aspects related to the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 4 is a schematic diagram of a deep network model structure, according to an example embodiment. It can be seen that the depth network model may include at least one encoder (encoder), a semantic aware unit and at least one decoder (decoder) in its structure. It will be appreciated that the network architecture may also be a trained deep network model. Of course, an untrained deep network model is also possible. That is, the untrained depth network model is the same structure as the trained depth network model, and training does not change the model structure. The semantic perception unit may include multiple layers of depth separable convolutions, so that compared with ASPP in the related art, the computing efficiency is higher, and the redundancy of the network structure is smaller. In some examples, the semantic aware unit of the present disclosure may also be considered a more efficient ASPP, such as may be referred to as efficient hole pyramid pooling (efficient atrous spatial pyramid pooling, EASPP).

Notably, the trained depth network model may be referred to as a depth network model in this disclosure, and the untrained depth network model may be referred to as an initial depth model.

In some examples, the encoder and decoder may be comprised of depth separable convolutions. And, EASPP may also include multi-layer depth separable convolutions. The number of encoders and decoders may be the same and correspond one to one. For example, in some examples 5 encoders may be included, so there may be 5 decoders corresponding. For example, encoder 0, encoder 1, encoder 2, encoder 3, and encoder 4, and decoder 0, decoder 1, decoder 2, decoder 3, and decoder 4. Wherein, encoder 0 corresponds to decoder 0, encoder 1 corresponds to decoder 1, encoder 2 corresponds to decoder 2, encoder 3 corresponds to decoder 3, and encoder 4 corresponds to decoder 4. In some examples, the first encoder and the corresponding decoder of the first encoder may not include a depth separable convolution, while the remaining encoders and corresponding decoders may include a depth separable convolution.

In some examples, the image to be processed may be input into a depth network model, e.g., depth characteristic data of the image to be processed may be extracted into at least one encoder. In some examples, the depth network model may include 5 encoders and 5 decoders. In other examples, the number of encoders and decoders may be more or less according to actual situations, and specifically, may be arbitrarily set according to actual situations, which is not limited in this disclosure. It will be appreciated that when the number of encoders and decoders is more than 5, the model running speed may be reduced while the model size is increased. When the number of encoders and decoders is less than 5, the depth of the input image feature extraction is insufficient, so that the deep feature information is ignored, and the model prediction effect is deteriorated. Thus, the present disclosure will be described taking the example of including 5 encoders and 5 decoders in a depth network model.

In some examples, each input may be a single Zhang Dai processed image. The image to be processed may be, for example, a color image. The training image may be input to the encoder 0 first, where the encoder 0 may include two convolution layers, each of which may have a convolution kernel size of 3*3, and when the training image is convolved by the two convolution layers, a depth feature with a channel (channel) dimension of 16 may be output. As can be seen from fig. 4, the depth characteristic data output from the encoder 0 can be input to the next encoder, i.e., the encoder 1. Of course, it can be seen that the depth characteristic data output from the encoder 0 may also be input to the decoder 0 corresponding thereto. It will be appreciated that when there are multiple encoders, the first encoder may not have a depth separable convolution, while the remaining encoders have depth separable convolutions. Of course, in some examples, all encoders may have depth separable convolutions, which may be specifically adjusted according to the actual situation, and the disclosure is not particularly limited.

A downsampling layer and a depth separable convolution layer may be included for the encoder 1. The downsampling layer may downsample with max pooling (MaxPool) 2D. Wherein 2D is represented as two dimensions of width and height when the image is pooled. Of course, in some examples, pooling may be performed in a 2 x 2 matrix size. In some examples, encoder 1 takes the output of encoder 0 as input, and after the input is subjected to MaxPool2D downsampling, and then is subjected to feature extraction by a depth separable convolution layer, a depth feature with a channel dimension of 32 can be output. It can be seen that the depth characteristic data output by the encoder 1 can be input to the next encoder, i.e. encoder 2. Of course, it may be input to the corresponding decoder 1. Encoder 2, encoder 3, encoder 4 are similar to encoder 1, for example, encoder 2 may include a downsampling layer that may be downsampled with MaxPool2D, and a depth separable convolutional layer. The encoder 2 takes the output of the encoder 1 as input, performs the feature extraction after the down-sampling by using the MaxPool2D, and then can output the depth feature with the channel dimension of 64. It can be seen that the depth characteristic data output from the encoder 2 can be input to the next encoder, namely, the encoder 3; but also to the corresponding decoder 2. The encoder 3 may include a downsampling layer and a depth separable convolution layer, and the downsampling layer may downsample with MaxPool 2D. The encoder 3 takes the output of the encoder 2 as input, performs the feature extraction after the down-sampling by using the MaxPool2D, and then performs the feature extraction by using the depth separable convolution layer, so as to output the depth feature with the channel dimension of 128. It can be seen that the depth characteristic data output from the encoder 3 can be input to the next encoder, namely, the encoder 4; and can be input to the corresponding decoder 3. Encoder 4 may include a downsampling layer that may downsample with MaxPool2D and a depth separable convolution layer. The encoder 4 takes the output of the encoder 3 as input, performs the feature extraction after the down-sampling by using the MaxPool2D, and then performs the feature extraction by using the depth separable convolution layer, so as to output the depth feature with the channel dimension of 192. It can be seen that the depth characteristic data output by the encoder 4 can be input to a semantic perception unit, i.e. EASPP; and can also be input to the corresponding decoder 4. It will be appreciated that the channel dimension number of the depth feature data output by each encoder may be dimension enhanced via conventional convolutional layers included in each encoder. The specific number of dimensions can be arbitrarily adjusted according to practical situations, and the disclosure is not limited.

It will be appreciated that in some examples, the depth characteristic data output by the last encoder will be input into the semantic perception unit no matter how many encoders are employed.

The semantic perception unit can perform semantic extraction according to the output of the last encoder and output depth semantic data containing rich semantic information. For example, as shown in fig. 5, one possible semantic aware unit structure is shown. It can be seen that the semantic aware unit may include at least one depth separable convolution and fusion layer therein. Wherein the depth separable convolution has a coefficient of expansion. The expansion coefficients corresponding to different depth-separable convolutions in the semantic perception unit can be the same or different, and specifically, the expansion coefficient of each depth-separable convolution can be adjusted at will according to actual conditions, and the invention is not limited.

In some examples, 5 depth separable convolutions may be included. Wherein the depth separable convolution 1 takes as input the depth characteristic data output by the last encoder, i.e. encoder 4 in fig. 4. The expansion coefficient of the depth separable convolution 1 may be 3, after the depth separable convolution 1 performs semantic extraction on the depth feature data output by the last encoder, the obtained depth semantic data is transmitted to the next depth separable convolution, namely the depth separable convolution 2, and the obtained depth semantic data may also be transmitted to the fusion layer. The depth separable convolution 2 takes the data obtained by fusing the output of the depth separable convolution 1 and the depth characteristic data output by the last encoder as input, the expansion coefficient of the depth separable convolution 2 can be 6, after the depth separable convolution 2 carries out semantic extraction on the fused data, the obtained depth semantic data is transmitted to the next depth separable convolution, namely the depth separable convolution 3, and the obtained depth semantic data can be transmitted to a fusion layer. The depth separable convolution 3 takes the output of the depth separable convolution 2 and the input of the depth separable convolution 2 as inputs, the expansion coefficient of the depth separable convolution 3 can be 12, after the depth separable convolution 3 performs semantic extraction on the input data, the obtained depth semantic data is transmitted to the next depth separable convolution, namely the depth separable convolution 4, and the obtained depth semantic data can be transmitted to a fusion layer. The depth separable convolution 4 and the depth separable convolution 5 are similar to the depth separable convolution 3, the depth separable convolution 4 takes the output of the depth separable convolution 3 and the input of the depth separable convolution 2 as inputs, the expansion coefficient of the depth separable convolution 4 can be 18, after the depth separable convolution 4 performs semantic extraction on the input data, the obtained depth semantic data is transmitted to the next depth separable convolution, namely the depth separable convolution 5, and the obtained depth semantic data can be transmitted to a fusion layer. The depth separable convolution 5 takes the output of the depth separable convolution 4 and the input of the depth separable convolution 2 as inputs, the expansion coefficient of the depth separable convolution 5 can be 24, and the depth separable convolution 5 carries out semantic extraction on the input data and then inputs the obtained depth semantic data into a fusion layer. The fusion layer fuses the output of each depth separable convolution and the depth characteristic data output by the last encoder (namely, encoder 4 in fig. 4) to determine final depth semantic data, and the final depth semantic data is used as the output of the semantic perception unit. The semantic perception unit inputs the determined final depth semantic data into the decoder corresponding to the last encoder, which may be, for example, decoder 4 in fig. 4. It will be appreciated that the other depth-separable convolutions of the semantic perception unit than the first depth-separable convolution have inputs that include not only the output of the last depth-separable convolution, but also the output of the first depth-separable convolution, or the output of the last encoder, thereby ensuring that more rich semantic information can be extracted. The fusion layer in fig. 5 may represent a merge (concat) operation of channel dimensions for the respective depth separable convolutionally output data. It will be appreciated that the concat represents a concat operation on the same channel dimension, that is, the processing of data in the semantic aware unit does not change the channel dimension of the output.

Continuing back to fig. 4, the decoder 4 will decode from the output of the semantic perception unit and the output of the encoder 4. The decoder 4 may set a stride (stride) of 8, and the stride is a super parameter of the decoder 4. The decoder 4 will output a profile=8. Wherein the channel dimension of the feature map is 128. The decoder 4 transmits the feature map to the next decoder, decoder 3. The decoder 3 will decode from the output of the decoder 4 and the output of the encoder 3. Wherein the decoder 3 may be provided with a stride of 4. The decoder 3 will output a profile=4. Wherein the channel dimension of the feature map is 64. The decoder 3 transmits the feature map to the next decoder, decoder 2. The decoder 2 will decode from the output of the decoder 3 and the output of the encoder 2. Wherein decoder 2 may set stride to 2. The decoder 2 will output a profile=2. Wherein the channel dimension of the feature map is 32. The decoder 2 transmits the feature map to the next decoder, decoder 1. The decoder 1 will decode from the output of the decoder 2 and the output of the encoder 1. Wherein decoder 1 may set stride to 1. The decoder 1 will output a profile=1. Wherein the channel dimension of the feature map is 16. The decoder 1 transmits the feature map to the next decoder, decoder 0.

The decoder 0 will decode from the output of the decoder 1 and the output of the encoder 0. The decoder 1 will output the depth estimation result for the input training image, i.e. the depth estimation data corresponding to the training image. The depth estimation data may be a depth map with a channel dimension of 1.

It will be appreciated that a depth separable convolution may also be employed in the decoder, corresponding to the encoder. Thus, in some examples, the last decoder (i.e., the decoder corresponding to the first encoder) may also not contain a depth separable convolution. The decoder may be configured similar to the encoder and may be used to perform the inverse operation of the encoder.

The last decoder in the depth network model (i.e., decoder 0 in fig. 4) may then output the image depth of the image to be trained.

In some examples, for a depth network model in the training phase, the initial depth model. The input may be a training image and an estimated depth is obtained. It will be appreciated that the estimated depth represents the image depth of the training image output after the training image has passed through the initial depth model. It will be appreciated that during the training phase, each training image may correspond to a label depth that is used to represent the true image depth of the corresponding training image. The initial depth model may calculate a loss function from the estimated depth and the tag depth each time it is trained, and adjust parameters in the initial depth model based on the loss function. For example, random gradient descent (stochastic gradient descent, SGD) may be employed for gradient back propagation to update parameters in the initial depth model.

After the loss function converges, a trained deep network model can be obtained. It will be appreciated that the data processing procedure of the training phase is similar to that of the application phase, and thus, the training phase data processing may refer to the corresponding description of the application phase data processing specifically, which is not repeated in this disclosure.

The depth network model in the application adopts the depth separable convolution, so that the depth network model can adapt to the scene of the terminal equipment and is perfectly deployed on the terminal equipment, and the terminal equipment can perform depth estimation on the image based on the trained depth network model.

The above-described process will be described in more detail with reference to the accompanying drawings.

Fig. 6 is a flowchart illustrating a method of image depth prediction according to an exemplary embodiment. As shown in fig. 6, the method may be run on a terminal device. The Terminal device may also be referred to as a Terminal, a User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), etc., and is a device that provides voice and/or data connectivity to a User, for example, the Terminal device may be a handheld device, an in-vehicle device, etc. with a wireless connection function. Currently, examples of some terminal devices include smart phones (Mobile Phone), pocket computers (Pocket Personal Computer, PPC), palm computers, personal digital assistants (Personal Digital Assistant, PDA), notebook computers, tablet computers, wearable devices, or in-vehicle devices, etc. In addition, in the case of a vehicle networking (V2X) communication system, the terminal device may also be an in-vehicle device. It should be understood that the embodiments of the present disclosure do not limit the specific technology and specific device configuration adopted by the terminal device.

It will be appreciated that the deep network model involved in this approach may employ the network architecture described above with respect to fig. 4 and 5. In some examples, the present disclosure may implement a 2D image depth estimation task using a monocular vision system, with the input being only that a single color picture output is a depth map represented by gray values. In some examples, the method may also be extended to tasks such as computational photography, autopilot situational awareness, and the like.

Thus, the method may comprise the steps of:

in step S11, an image to be processed is acquired.

In some examples, the terminal device may obtain a pending image requiring a predicted depth. The image to be processed may be obtained from other devices through a network, or may be obtained through shooting by a terminal device, or may be stored in advance on the terminal device, which is not limited in the disclosure.

In some examples, the network may be implemented, for example, using code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), time division multiple access (Time Division Multiple Access, TDMA), frequency division multiple access (Frequency Division Multiple Access, FDMA), orthogonal frequency division multiple access (Orthogonal Frequency-Division Multiple Access, OFDMA), single Carrier frequency division multiple access (SC-FDMA), carrier sense multiple access/collision avoidance (Carrier Sense Multiple Access with Collision Avoidance), and the like. Networks may be classified into 2G (english: generation) networks, 3G networks, 4G networks, or future evolution networks, such as a fifth Generation wireless communication system (The 5th Generation Wireless Communication System,5G) network, and 5G networks may also be referred to as New Radio (NR) networks, according to factors such as capacity, rate, delay, etc. of different networks.

In step S12, the image to be processed is input to the depth network model, and the image depth of the image to be processed is predicted.

In some examples, the image to be processed acquired in S12 may be input into a depth network model, so as to obtain an image depth predicted for the image to be processed.

The depth network model is obtained by training a target loss function, and the target loss function is determined based on at least one of an error weight parameter, a depth gradient error and a depth structure loss parameter. The error weight parameter is used for representing the weight of the difference between the estimated depth and the label depth; the depth gradient error is used for representing gradient difference between the estimated depth and the label depth; the depth structure loss parameter is used for representing the difference of the label depth corresponding to different positions in the training image.

The depth network model adopted by the method has the depth separable convolution, so that the deployment of the depth network model on the terminal equipment is realized, the operation time is effectively saved, and the conditions of long operation time consumption and discomfort of the large model on the terminal equipment are avoided.

In some embodiments, for the deep network model referred to in fig. 4-6, the training process thereof may be, for example, as shown in fig. 7, and fig. 7 is a flowchart of a deep network model training method according to an exemplary embodiment. The method may run on a service device. In some examples, the service device may be a server or a cluster of servers. Of course, in other examples, it may be a server or a cluster of servers running on a virtual machine.

Thus, the method may comprise the steps of:

in step S21, a preconfigured training image set is obtained, where the training image set includes a training image and a label depth corresponding to the training image.

In some examples, the service device may obtain a pre-configured training image set. The training image set may be pre-stored on the service device. Or the training image set can be stored in a database, and the service equipment is connected with corresponding data to acquire the training image set. The training image set comprises training images and label depths corresponding to the training images.

In some examples, the generation of the training image set may obtain a large amount of dense depth data through network data acquisition, optical flow estimation, binocular stereo matching, depth estimation teacher (teacher) model prediction. It will be appreciated that the dense representation may determine the corresponding tag depth for each pixel in the training image obtained in the manner described above.

In some examples, the images in the training image set may be images masked for the set region. For example, the set region may be a background region of the image, such as a sky region, a sea region, or the like. Taking the set area as the sky area as an example, for training images related to outdoor scenes, such training images typically contain sky. The color of the sky part, the cloud, etc. may have a corresponding effect on the depth estimation, e.g. misestimating the depth of the cloud, etc. Therefore, the sky region can be segmented by using a pre-trained sky segmentation model to obtain a sky mask S _mask . Thereafter, S can be utilized _mask The image containing the sky is processed to label the sky area with the corresponding label depth. For example, S in the image can be set _mask The depth of the treatment area is the maximum, indicating the furthest depth. By marking the sky part in the training image, the accuracy of estimating the sky region depth in depth prediction is improved. For another example, when the training image set is acquired by using the binocular stereo matching method, for a part of the invalid region may exist, the image acquired by using the binocular stereo matching method may be processed by using the valid region mask, so as to obtain a processed image. Can adopt V _mask Representing the active area mask. Through V _mask After the images are processed, the obtained images are images of the effective areas, so that the pixels which are not in the areas (namely, pixels in the ineffective areas) are effectively prevented from participating in corresponding calculation, such as loss calculation and the like, in the subsequent training process. It will be appreciated that, through V _mask The processed image may also be subjected to S _mask And (5) processing. That is, the effective area image may include a background area to be labeled. Later, the image after mask processing can be used as a training image for subsequent training.

In still other examples, since current depth prediction schemes typically focus on a relatively fixed scene, the trained model may suffer from significant degradation, i.e., poor generalization, when it comes out of the scene. Therefore, the acquired training image set can comprise training images in any scene, so that the generalization capability of the depth network model obtained by training is remarkably improved.

Thus, in some examples, different training image sets, e.g., { Data }, may be partitioned according to different time, different acquisition modes ₁ ，Data ₂ ，…，Data _n }。Data _n Representing the nth training image set. Thereafter, different sampling weights may be set for different sets of training images based on the difference in the amount of data each set of training images contains. It may be for example as shown in equation 1,

wherein p is _j Representing the sampling weights for the j-th training image set at training. N () represents the count statistics of the samples in the training image set, i.e. the amount of data in the training image set.

In some examples, the training images and the corresponding tag depths of the training images may be obtained from respective training image sets based on the sampling weights calculated in equation 1. Therefore, the training process is not biased to a certain class of training images, so that the generalization capability of the depth network model obtained by training is extremely strong, and the method is applicable to depth estimation of various different scene images.

In some examples, the images in the training image set can be changed in a data augmentation mode such as horizontal overturn, random clipping, color change and the like of the images, so that the data volume in the data set is expanded to meet the training requirement.

In step S22, the training image is input to the initial depth model, and an estimated depth corresponding to the training image is determined.

In some examples, the service device may input the training images in the training image set acquired in S21 into an initial depth model formed by depth separable convolution for training, to obtain the estimated depth corresponding to the training images. In some examples, training images in each training image set acquired based on sampling weights may be sequentially input into the initial depth model for training.

In some examples, a training image of the initial depth model is input, the output of which may be denoted as p. Then, p can be range cut and then multiplied by 255 to obtain the estimated depth corresponding to the training image. Wherein the estimated depth can be noted as D _pred Clip (p, 0, 1) 255. Wherein clip () is represented as clip. It can be understood that the output result may be recorded as a value between p and 0 and 1, and in order to more conveniently observe the output depth, the relative depth result D of each pixel point may be obtained by clipping and multiplying 255. The relative depth result D may then be used as the estimated depth. The specific calculation process may be implemented by referring to an existing manner, which is not described in detail in this disclosure.

In step S23, the initial depth model is adjusted using a loss function based on the estimated depth and the tag depth.

In some examples, the service device may calculate the loss function based on the estimated depth obtained in S22 and the tag depth corresponding to the corresponding training image. And adjusting the initial depth model based on the calculated loss function.

In some examples, the adjustment of the initial depth model may be accomplished by SGD, i.e., updating corresponding parameters in the initial depth model. It will be appreciated that the hyper-parameters in the model are not adjusted and updated during the training process.

In step S24, a trained deep network model is obtained until the loss function converges.

In some examples, a trained deep network model may be obtained when the loss function in S23 converges. It will be appreciated that the depth network model is adapted for deployment on the terminal device due to the use of depth separable convolution, so that the terminal device performs depth estimation on the image based on the depth network model.

The depth network model obtained through training has the depth separable convolution, so that the deployment of the depth network model on the terminal equipment is realized, the operation time is effectively saved, and the conditions of long time consumption and discomfort of the large model in operation on the terminal equipment are avoided.

In some embodiments, such as shown in fig. 8, the loss functions involved in fig. 6, 7 may be confirmed by the following steps:

in step S31, a target loss function is determined from at least one of the first, second and third loss functions.

In some examples, the target loss function for adjusting the initial depth model may be determined based on at least one of the first loss function, the second loss function, and the third loss function. Wherein the first loss function may be determined based on the error weight parameter, the second loss function may be determined based on the depth gradient error, and the third loss function may be determined based on the depth structure loss parameter.

The method and the system can be combined by a plurality of loss functions for training the depth network model, so that the trained depth network model can accurately identify the image depth.

In some embodiments, as shown in fig. 9, the first loss function involved in S31 may be determined by:

in step S41, an absolute value of the error is determined from the estimated depth and the tag depth.

In some examples, the absolute value of the error may be determined based on the estimated depth predicted by the initial depth model and the tag depth corresponding to the corresponding training image. For example, can be denoted as abs (D _pred -D _target ). Where abs () is expressed as absolute value, D _pred Representing an estimated depth predicted as an initial depth model, D _target Expressed as the corresponding label depth for the corresponding training image.

In step S42, a first loss function is determined based on the absolute value of the error and the error weight parameter.

In some examples, the first loss function may be determined based on the absolute value of the error determined in S21 and the error weight parameter W determined in S22.

The error weight parameter W may be determined according to the estimated depth and the tag depth. For example, the error weight parameter W may be determined according to the estimated depth predicted by the initial depth model and the tag depth corresponding to the corresponding training image. For example, W can be calculated by equation 2.

W＝pow(D _pred -D _target 2) … … equation 2

Wherein pow (D) _pred -D _target 2) is represented as calculation D _pred -D _target To the power of 2. It will be appreciated that the weight value may apply a greater loss weight to those regions where the prediction error bias is greater. Therefore, equation 2 can also be considered to exhibit a focus (focal) property.

Thus, in some examples, after deriving the absolute value of the error and the error weight parameter W, a first loss function may be calculated by equations 3 and 4.

L ₁ ＝V _mask ·abs(D _pred -D _target ) … … equation 4

Wherein,represented as a first loss function, also known as focal-L ₁ The loss function, alpha and beta are preset weight coefficients. It can be seen that L can be calculated using the absolute value of the error ₁ Loss function W.L ₁ Then represent L with focal property ₁ A loss function. In some examples, L is calculated ₁ The loss function may be calculated based on an active area mask or sky mask, e.g., V _mask Indicating whether each pixel point in the effective area is effective or not, for example, 0 and 1 can be used for distinguishing effective from ineffective. abs (D) _pred -D _target ) The absolute value of the error of each pixel point in the same region is represented. Of course, in some examples, the calculation may also be performed pixel by pixel, and the disclosure is not limited.

Wherein, specifically calculate L ₁ The loss function may be implemented with reference to existing manners, and this disclosure is not repeated.

The method and the device introduce weight attributes when determining the first loss function so as to endow higher loss weight to the position with larger prediction error deviation, thereby better training the initial depth model and enabling the recognition result of the depth network model obtained by training to be more accurate.

In order to ensure that the depth result profile obtained in the depth estimation scheme is clearer, noise points are reduced, the expressive power is improved, and scenes with higher requirements for depth details, such as scenes of computed photography, of later tasks can be met. The present disclosure may thus also adjust the initial depth model by one or more of the following loss functions, e.g. a second loss function, a third loss function.

In some embodiments, the tag depth may include a tag depth gradient in at least two directions. For example, as shown in fig. 10, the second loss function involved in S31 may be determined by:

in step S51, an estimated depth gradient in at least two directions is determined based on the estimated depth and using a gradient function set in advance.

In some examples, the estimated depth predicted from the initial depth model may be brought into a pre-set gradient function to determine an estimated depth gradient in at least two directions. In some examples, 2 different directions may be generally selected, although in other examples, more or fewer directions may be selected, and the disclosure is not limited.

In one example, the directions may be, for example, x, y directions. And determining the estimated depth gradient in both x and y directions, the estimated depth gradient can be calculated by the formulas 5 and 6.

It will be appreciated that D may be D _pred Then D in the x direction is calculated by equation 5 _x Can be expressed asAnd D in y direction is calculated by formula 6 _y Can be expressed asWherein,represented as convolutions on the data using the sobel (sobel) operator. Wherein, Andi.e. the matrix that sobel operators are more commonly used. Of course, the matrix may be specifically adjusted according to the actual situation, so that gradient data in more directions may be obtained.

It will be appreciated that the gradient referred to in fig. 10 is different from the gradient used in training, and the gradient referred to in SGD is the gradient of the change in the loss function. The gradient referred to in fig. 10 refers to the gradient of depth variation of different areas in the training image.

In step S52, a tag depth gradient in at least two directions is determined from the tag depth and gradient functions.

In some examples, the tag depth may comprise a tag depth gradient in at least two directions. Wherein at least two directions related to the tag depth gradient are the same as at least two directions related to the estimated depth gradient. Of course, in other examples, the tag depth gradient in the x-direction can also be calculated based on equations 5 and 6, respectivelyCalculating to obtain the label depth gradient in the y direction

In step S53, a depth gradient error in at least two directions is determined from the estimated depth gradient in at least two directions and the tag depth gradient in at least two directions.

In some examples, the depth gradient error for at least two directions may be determined from the estimated depth gradients in at least two directions determined in S51 and the tag depth gradients in at least two directions determined in S52. For example, based on the x direction, it can be based onAndcalculating to obtain depth gradient error in x direction, and recording as Similarly, based on the y direction, it can be based onAndcalculating to obtain depth gradient error in y direction, and recording as

In step S54, a second loss function is determined based on the depth gradient error in at least two directions.

In some examples, the second loss function may be determined from the depth gradient error in at least two directions determined in S53. Wherein the second loss function may also be referred to as gradient loss function.

In some examples, the second loss function may be calculated by equation 7,

wherein L is _grad Representing a second loss function. In some examples, L is calculated _grad In this case, the calculation may be based on an effective area mask or sky mask, that is, whether each pixel point in the effective area mask or sky mask is effective, and absolute gradient values in x direction and absolute gradient error values in y direction of all pixel points in the area mask or sky mask. Of course, in some examples, the calculation may also be performed pixel by pixel, and the disclosure is not limited.

The method and the device also introduce gradient loss, so that the depth network model obtained through training can more accurately identify the gradient of the image, and accordingly boundaries with different depths can be better identified.

In some embodiments, such as that shown in fig. 11, the third loss function involved in S31 may be determined by:

in step S61, a plurality of training data sets are determined for the training image.

In some examples, the service device may determine a plurality of training data sets for each training image. Wherein each training data set may be composed of at least two pixels in the training image. Thus, the label depth data of the training image may include depth label data corresponding to the respective pixel points.

In some embodiments, determining the plurality of training data sets may include: and determining gradient boundaries of the training image by using a preset gradient function. Then, a plurality of training data sets are determined by sampling on the training image according to the gradient boundaries. It will be appreciated that the process of determining a plurality of training data sets may be referred to as data set (pair) sampling.

In some examples, pair sampling is performed based on gradient boundaries, which may also be referred to as depth structure sampling. For example, gradient calculations may be performed on the training image based on sobel operators to determine which regions have a large gradient change. For example, a gradient threshold may be preset, and when the gradient difference at different positions is greater than or equal to the gradient threshold, the depth gradient may be considered to have a larger change, so that a corresponding gradient boundary may be determined. It will be appreciated that for the second loss function depicted in fig. 10, the accuracy of this partial gradient boundary calculation can be effectively adjusted. For example, as shown in fig. 11, assuming that the depth map corresponding to the training image is the depth map shown in fig. 12, a schematic diagram of pair sampling based on gradient boundaries may be as shown in fig. 13. It can be seen that each pixel point of the depth structure sample is located around the gradient boundary. In some examples, each training data set may be composed of two pixels in a training image. One pixel point can be acquired at one side of the gradient boundary, then one pixel point is acquired on the gradient boundary, and a training data set is formed. Or a training data set can be formed by collecting a pixel point on one side of the gradient boundary and then collecting a pixel point on the other side of the gradient boundary. In this way, a plurality of training data sets can be determined.

According to the method and the device, the training data set is determined according to the gradient boundary, so that different depths can be distinguished more accurately in the subsequent training process, and further the recognition result of the depth network model obtained through training is more accurate.

In some embodiments, determining the plurality of training data sets may further comprise: random sampling is performed on the training image to determine a plurality of training data sets.

In some examples, to avoid training more on gradient boundary portions during the training process, portions with slow gradient changes are omitted. Thus, the present disclosure may also perform random pair sampling, i.e., random sampling, on the training image. Thereby determining a plurality of training sets. For example, as shown in fig. 14, it can be seen that the plurality of training data sets obtained by random sampling are distributed uniformly, so that the pixel points corresponding to the gradient change slow region are effectively reserved. According to the method and the device, the training data set is determined according to the gradient boundary, so that different depths can be distinguished more accurately in the subsequent training process, and further the recognition result of the depth network model obtained through training is more accurate.

Continuing back to fig. 11, after S41 is performed, the following steps may continue to be performed.

In step S62, for each training data set, a depth structure loss parameter is determined according to the label depths corresponding to at least two pixels.

In some examples, the service device may determine, for each training data set, a depth structure loss parameter according to tag depths corresponding to at least two pixels within the training data set. In some examples, if two pixels are included in the training data set, the depth structure loss parameter may be determined based on tag depths corresponding to the two pixels in the training data set.

Wherein the depth structure loss parameter may be denoted as ρ. For example, ρ can be calculated by equation 8.

Wherein a and b respectively represent two pixel points in the training data set.

In step S63, for each training data set, a third loss function is determined according to the depth structure loss parameter and the estimated depths corresponding to the at least two pixels.

In some examples, the service device may determine the third loss function according to the depth structure loss parameter ρ determined in S62 and the estimated depths corresponding to the at least two pixels included in the training data set. Wherein the third loss function may also be referred to as a depth map structure sampling loss function. For example, the third loss function may be calculated by equation 9.

Wherein L is _pair Represented as a third loss function, Denoted as e To the power. It can be seen that in some examples, the third loss function uses a ranking loss (ranking loss) for the calculation of the loss function.

The third loss function can be determined by the training group with the pixels as a group, so that in the subsequent training process, different depth areas in the image can be sampled more accurately, and the recognition result of the depth network model obtained by training is more accurate.

In some examples, the objective loss function may be determined according to at least one of the first loss function, the second loss function, and the third loss function, for example, equation 10 gives a manner of determining the objective loss function.

Wherein L is _total Representing the target loss function, parameter lambda ₁ Weighting coefficients, lambda, for the first loss function ₂ Weighting coefficients, lambda, for the second loss function ₃ A weighting factor for the third loss function. As can be appreciated, lambda ₁ 、λ ₂ And lambda (lambda) ₃ The specific numerical values of (a) may be set according to actual conditions, and the present disclosure is not limited. Obviously, by adjusting lambda ₁ 、λ ₂ And lambda (lambda) ₃ To obtain a target loss function using one or more of the first, second and third loss functions to train to obtain a depth network model.

It will be appreciated that the first, second and third loss functions are different for the purpose of model adjustment. The more the loss function used when determining the target loss function, the higher the recognition accuracy of the depth network model trained by using the target loss function.

In some embodiments, for an input image input into the depth network model, for example, the input image may be a pending image or a training image, may include: normalizing the input image to obtain a normalized input image. The normalized input image is then input to the depth network model.

In some examples, normalization of the input image may be processed, for example, in conjunction with the mean and variance of the color channels of the input image, such as shown in equation 11.

I _norm = (I-m)/v … … formula 11

Wherein m represents the mean, v represents the variance, I represents the input image, I _norm Representing the normalized input image. In some examples, the color channels of I may be arranged in BGR. It will be appreciated that B represents a blue (blue) channel, G represents a green (green) channel, and R represents a red (red) channel. In some examples, m may be taken (0.485,0.456,0.506) and v may be taken (0.229,0.224,0.225). It will be appreciated that m and v are only exemplary descriptions of one value, and in other examples, any value may be taken according to practical situations, and the disclosure is not limited thereto.

In some examples, during the training phase, the label depth corresponding to the training image may also be normalized. Because the training image set sources are wide, the depth level (scale), transformation (shift) of the label depth corresponding to the training images in different training image sets may be inconsistent. The label depth corresponding to the training image can be normalized. Of course, it will be appreciated that this approach may also normalize the estimated depth predicted by the initial depth model.

For example, the median value of the tag depth can be calculated by equation 12.

D _t =media (D) … … equation 12

Wherein D is _t The median value of D is indicated. D represents the label depth corresponding to each pixel in the active area.

The mean value after the median value is removed in the tag depth can then be determined by equation 13.

Wherein M represents an effective region V _mask The number of effective pixels in the display panel. It will be appreciated that V _mask Can also be replaced by S _mask 。

Thereafter, D, which can be determined based on equation 12 _t And D in equation 13 _s And determining the normalized label depth. I.e., as shown in equation 14

The method and the device can normalize the images of the input model, ensure that training can be completed under the same dimension in the training process, and further enable the recognition result of the depth network model obtained by training to be more accurate.

In some embodiments, during the actual training process, jpg image format color pictures may be employed as training pictures. Meanwhile, a depth picture in a png image format can be used as the depth label data corresponding to the training image. The png picture format may be used for sky-segmentation pictures. If training is based on a pair training set, in some examples, about 80 ten thousand training data sets, i.e., pair sets, may be used.

The depth network model adopted by the method has the depth separable convolution, so that the deployment of the depth network model on the terminal equipment is realized, the operation time is effectively saved, and the conditions of long time consumption and discomfort of the large model in operation on the terminal equipment are avoided.

It should be understood by those skilled in the art that the various implementations/embodiments of the present disclosure may be used in combination with the foregoing embodiments or may be used independently. Whether used alone or in combination with the previous embodiments, the principles of implementation are similar. In the practice of the present disclosure, some of the examples are described in terms of implementations that are used together. Of course, those skilled in the art will appreciate that such illustration is not limiting of the disclosed embodiments.

Based on the same conception, the embodiment of the disclosure also provides an image depth prediction device.

It can be appreciated that, in order to implement the above-mentioned functions, the image depth prediction apparatus provided in the embodiments of the present disclosure includes a hardware structure and/or a software module that perform respective functions. The disclosed embodiments may be implemented in hardware or a combination of hardware and computer software, in combination with the various example elements and algorithm steps disclosed in the embodiments of the disclosure. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation is not to be considered as beyond the scope of the embodiments of the present disclosure.

FIG. 15 is a schematic diagram of a deep network model training apparatus, according to an example embodiment. Referring to fig. 15, the apparatus 100 may be that the apparatus 100 may include: an acquisition module 101, configured to acquire an image to be processed; the prediction module 102 is used for inputting the image to be processed into a depth network model, predicting the image depth of the image to be processed, and the depth network model is formed by multi-layer depth separable convolution; the depth network model is obtained by training a target loss function, and the target loss function is determined based on at least one of an error weight parameter, a depth gradient error and a depth structure loss parameter; the error weight parameter is used for representing the weight of the difference between the estimated depth and the label depth, the depth gradient error is used for representing the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used for representing the label depth difference corresponding to different positions in the training image; the estimated depth is determined by the training stage depth network model based on the training image, and the tag depth corresponds to the training image.

In one possible implementation, the objective loss function is determined as follows: determining an objective loss function according to at least one of the first loss function, the second loss function and the third loss function; wherein the first loss function is determined based on the error weight parameter, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameter.

The method and the system can be combined by a plurality of loss functions for training the depth network model, so that the trained depth network model can be ensured to identify the image depth more accurately.

In one possible embodiment, the first loss function is determined in the following way: determining an absolute value of an error between the estimated depth and the tag depth according to the estimated depth and the tag depth; and determining a first loss function according to the absolute value of the error and the error weight parameter.

In one possible implementation, the depth gradient error is determined as follows: determining estimated depth gradients in at least two directions according to the estimated depth and a preset gradient function; determining a tag depth gradient in at least two directions according to the tag depth and the gradient function; determining depth gradient errors in at least two directions according to the estimated depth gradients in at least two directions and the tag depth gradients in at least two directions; the second loss function is determined as follows: a second loss function is determined based on the depth gradient error in at least two directions.

In one possible implementation, the depth structure loss parameter is determined as follows: determining a plurality of training data sets, wherein the training data sets comprise at least two pixel points, the at least two pixel points are the pixel points of a training image, and the tag depth comprises training tags corresponding to the at least two pixel points in the training image; determining a depth structure loss parameter according to the label depth corresponding to at least two pixel points for each training data set; the third loss function is determined as follows: and determining a third loss function according to the depth structure loss parameter and the estimated depth corresponding to at least two pixel points for each training data set.

In one possible implementation, determining a plurality of training data sets includes: determining a gradient boundary of the training image by using a preset gradient function; a plurality of training data sets are determined by sampling on the training image according to the gradient boundaries.

In one possible implementation, the training image is determined as follows: determining at least one training image set in any scene, wherein the images in the training image set are images subjected to mask processing aiming at a set area; a training image is determined from at least one training image set according to pre-configured sampling weights.

According to the depth prediction method and the depth prediction device, the training images under a plurality of scenes are obtained, so that the trained depth network model has stronger generalization capability, and depth prediction under any scene can be achieved. And the mask processing is carried out on the set region, so that the processed image is more beneficial to training of a model, and the prediction precision of the model after training is improved.

With respect to the apparatus 100 of the above embodiment, the specific manner in which the respective modules perform the operations has been described in detail in the embodiment of the related method, and will not be described in detail herein.

Fig. 16 is a block diagram illustrating an image depth prediction apparatus 200 according to an exemplary embodiment. For example, device 200 may be a terminal device such as a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, redCap terminal, or the like.

Referring to fig. 16, the device 200 may include one or more of the following components: a processing component 202, a memory 204, a power component 206, a multimedia component 208, an audio component 210, an input/output (I/O) interface 212, a sensor component 214, and a communication component 216.

The processing component 202 generally controls overall operation of the device 200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 202 may include one or more processors 220 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 202 can include one or more modules that facilitate interactions between the processing component 202 and other components. For example, the processing component 202 may include a multimedia module to facilitate interaction between the multimedia component 208 and the processing component 202.

The memory 204 is configured to store various types of data to support operations at the device 200. Examples of such data include instructions for any application or method operating on device 200, contact data, phonebook data, messages, pictures, video, and the like. The memory 204 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power component 206 provides power to the various components of the device 200. The power components 206 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 200.

The multimedia component 208 includes a screen between the device 200 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 208 includes a front-facing camera and/or a rear-facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 200 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 210 is configured to output and/or input audio signals. For example, the audio component 210 includes a Microphone (MIC) configured to receive external audio signals when the device 200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 204 or transmitted via the communication component 216. In some embodiments, audio component 210 further includes a speaker for outputting audio signals.

The I/O interface 212 provides an interface between the processing assembly 202 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 214 includes one or more sensors for providing status assessment of various aspects of the device 200. For example, the sensor assembly 214 may detect an on/off state of the device 200, a relative positioning of the components, such as a display and keypad of the device 200, a change in position of the device 200 or a component of the device 200, the presence or absence of user contact with the device 200, an orientation or acceleration/deceleration of the device 200, and a change in temperature of the device 200. The sensor assembly 214 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 214 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 216 is configured to facilitate communication between the device 200 and other devices in a wired or wireless manner. The device 200 may access a wireless network based on a communication standard, such as WiFi,2G, 3G, 4G, or 5G, or a combination thereof. In one exemplary embodiment, the communication component 216 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 216 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 204, including instructions executable by processor 220 of device 200 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

FIG. 17 is a schematic diagram of a deep network model training apparatus 300, according to an example embodiment. For example, the device 300 may be provided as a server. It is understood that the apparatus 300 may be used to enable training of a deep network model. Referring to fig. 17, the apparatus 300 includes a processing component 322 that further includes one or more processors, and memory resources represented by a memory 332 for storing instructions, such as application programs, executable by the processing component 322. The application program stored in memory 332 may include one or more modules each corresponding to a set of instructions. Further, the processing component 322 is configured to execute instructions to perform the process method of training the deep network model of the methods described above.

The device 300 may also include a power component 326 configured to perform power management of the device 300, a wired or wireless network interface 350 configured to connect the device 300 to a network, and an input output (I/O) interface 358. The device 300 may operate based on an operating system stored in memory 332, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

The depth estimation network comprehensively designed by the method has the characteristics of light weight and high accuracy, and can be deployed in a handheld device scene.

Further, a large number of depth data sets are acquired and different sampling weights are set for these different number of data sets to balance the data distribution. The use of normalization operations on the data of different scales and shift allows training to be done in the same dimension. Meanwhile, an efficient semantic information acquisition module EASPP is designed in the depth estimation network, and semantic information is grabbed under the condition of less parameter quantity and calculation amount. The present disclosure also uses a multi-dimensional depth loss function to promote the predicted quality of depth at the edges of objects.

The scheme solves the problem that the depth estimation model consumes long time in a handheld device scene, achieves deployment at a mobile phone end, and the floating point model can reach 150ms on a high-pass 7325 platform. Meanwhile, the depth estimation method and the depth estimation device can realize depth estimation under any scene and have stronger generalization capability. The method and the device can accurately predict the depth of the edges, the green planting boundaries and the like of the portrait, and meet the requirement of shooting the blurring scene on the depth estimation result.

It is further understood that the term "plurality" in this disclosure means two or more, and other adjectives are similar thereto. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It is further understood that the terms "first," "second," and the like are used to describe various information, but such information should not be limited to these terms. These terms are only used to distinguish one type of information from another and do not denote a particular order or importance. Indeed, the expressions "first", "second", etc. may be used entirely interchangeably. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure.

It will be further understood that although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the scope of the appended claims.

Claims

A method of image depth prediction, the method comprising:

acquiring an image to be processed;

inputting the image to be processed into a depth network model, and predicting the image depth of the image to be processed, wherein the depth network model is formed by multi-layer depth separable convolution;

the depth network model is obtained by training a target loss function, and the target loss function is determined based on at least one of an error weight parameter, a depth gradient error and a depth structure loss parameter;

the error weight parameter is used for representing the weight of the difference between the estimated depth and the label depth, the depth gradient error is used for representing the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used for representing the label depth difference corresponding to different positions in the training image; the estimated depth is determined by the depth network model based on the training image in a training stage, and the label depth corresponds to the training image.
The method of claim 1, wherein the objective loss function is determined by:

determining the target loss function according to at least one of a first loss function, a second loss function and a third loss function; wherein the first loss function is determined based on the error weight parameter, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameter.
The method of claim 2, wherein the first loss function is determined by:

determining an absolute value of an error between the estimated depth and the tag depth according to the estimated depth and the tag depth;

and determining the first loss function according to the absolute value of the error and the error weight parameter.
A method according to claim 2 or 3, wherein the depth gradient error is determined by:

determining estimated depth gradients in at least two directions according to the estimated depth and a preset gradient function;

determining a tag depth gradient in the at least two directions according to the tag depth and the gradient function;

Determining the depth gradient error for the at least two directions from the estimated depth gradient in the at least two directions and the tag depth gradient in the at least two directions;

the second loss function is determined as follows:

determining the second loss function based on the depth gradient error for the at least two directions.
The method according to any of claims 2-4, wherein the depth structure loss parameter is determined by:

determining a plurality of training data sets, wherein the training data sets comprise at least two pixel points, the at least two pixel points are the pixel points of the training image, and the label depth comprises training labels corresponding to the at least two pixel points in the training image;

determining the depth structure loss parameters according to the label depths corresponding to the at least two pixel points for each training data set;

the third loss function is determined as follows:

and determining the third loss function according to the depth structure loss parameter and the estimated depth corresponding to the at least two pixel points for each training data set.
The method of claim 5, wherein said determining a plurality of training data sets comprises:

determining gradient boundaries of the training image by using a preset gradient function;

and sampling on the training image according to the gradient boundary, and determining the plurality of training data sets.
The method according to any one of claims 1-6, wherein the training image is determined by:

determining at least one training image set in any scene, wherein the images in the training image set are images subjected to mask processing aiming at a set area;

the training images are determined from the at least one training image set according to pre-configured sampling weights.
An image depth prediction apparatus, the apparatus comprising:

the acquisition module is used for acquiring the image to be processed;

the prediction module is used for inputting the image to be processed into a depth network model, predicting the image depth of the image to be processed, and the depth network model is formed by multi-layer depth separable convolution;

the depth network model is obtained by training a target loss function, and the target loss function is determined based on at least one of an error weight parameter, a depth gradient error and a depth structure loss parameter;

The error weight parameter is used for representing the weight of the difference between the estimated depth and the label depth, the depth gradient error is used for representing the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used for representing the label depth difference corresponding to different positions in the training image; the estimated depth is determined by the depth network model based on the training image in a training stage, and the label depth corresponds to the training image.
The apparatus of claim 8, wherein the objective loss function is determined by:

determining the target loss function according to at least one of a first loss function, a second loss function and a third loss function; wherein the first loss function is determined based on the error weight parameter, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameter.
The apparatus of claim 9, wherein the first loss function is determined by:

determining an absolute value of an error between the estimated depth and the tag depth according to the estimated depth and the tag depth;

And determining the first loss function according to the absolute value of the error and the error weight parameter.
The apparatus of claim 9 or 10, wherein the depth gradient error is determined by:

determining estimated depth gradients in at least two directions according to the estimated depth and a preset gradient function;

determining a tag depth gradient in the at least two directions according to the tag depth and the gradient function;

determining the depth gradient error for the at least two directions from the estimated depth gradient in the at least two directions and the tag depth gradient in the at least two directions;

the second loss function is determined as follows:

determining the second loss function based on the depth gradient error for the at least two directions.
The apparatus according to any of claims 9-11, wherein the depth structure loss parameter is determined by:

determining a plurality of training data sets, wherein the training data sets comprise at least two pixel points, the at least two pixel points are the pixel points of the training image, and the label depth comprises training labels corresponding to the at least two pixel points in the training image;

Determining the depth structure loss parameters according to the label depths corresponding to the at least two pixel points for each training data set;

the third loss function is determined as follows:

and determining the third loss function according to the depth structure loss parameter and the estimated depth corresponding to the at least two pixel points for each training data set.
The apparatus of claim 12, wherein the determining a plurality of training data sets comprises:

determining gradient boundaries of the training image by using a preset gradient function;

and sampling on the training image according to the gradient boundary, and determining the plurality of training data sets.
The apparatus according to any one of claims 8-13, wherein the training image is determined by:

determining at least one training image set in any scene, wherein the images in the training image set are images subjected to mask processing aiming at a set area;

the training images are determined from the at least one training image set according to pre-configured sampling weights.
An image depth prediction apparatus, comprising:

A processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing the method of any one of claims 1 to 7.
A non-transitory computer readable storage medium, which when executed by a processor of a computer, causes the computer to perform the method of any one of claims 1 to 7.