CN113610911A

CN113610911A - Training method and device of depth prediction model, medium and electronic equipment

Info

Publication number: CN113610911A
Application number: CN202110852004.2A
Authority: CN
Inventors: 戴夏强
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-11-05

Abstract

The disclosure provides a training method of a depth prediction model, a training device of the depth prediction model, a computer readable medium and electronic equipment, and relates to the technical field of image processing. The method comprises the following steps: inputting the image sequence into a model to be trained to obtain a depth prediction image set corresponding to the image sequence; acquiring a first depth prediction image corresponding to a current image and a second depth prediction image corresponding to a previous image of the current image in an image sequence from a depth prediction image set; calculating a target loss function corresponding to the current image based on the first depth prediction image and the second depth prediction image; and updating the weight of the model to be trained based on the target loss function to obtain a depth prediction model. The method and the device can optimize the continuity and consistency of the prediction results of the depth prediction model, so that the depth prediction model keeps consistent with the depth prediction results of the static objects in the image sequence, and meanwhile, the prediction results of the moving objects in the image sequence are enabled to be smooth in transition.

Description

Training method and device of depth prediction model, medium and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a depth prediction model training method, a depth prediction model training apparatus, a computer-readable medium, and an electronic device.

Background

In the related art, there are generally two methods for constructing a depth learning model for depth prediction of an image: the method comprises the steps of training a convolutional neural network based on a binocular image, and training the network based on a monocular image. The first method needs to acquire images of a scene through two view cameras with fixed relative positions to obtain binocular images, and needs to input two paired images during network training and network-based prediction; the second method only needs a camera with one visual angle to acquire images of a scene, and only needs to input a single image when network training and network-based prediction are performed.

For monocular depth prediction models, two methods can be generally adopted to optimize the model: firstly, optimizing a monocular depth prediction model by a pure supervision method, namely a single-frame marking mode; and secondly, optimizing the monocular depth prediction model based on the constraint of the space consistency of image characteristics among binocular images.

However, in the depth prediction of a video with a complex background or uneven illumination, the monocular depth prediction model obtained based on the optimization method is prone to have the problem that the prediction result is discontinuous from frame to frame, and further the prediction result may have the phenomena of flickering, jumping and the like in the video.

Disclosure of Invention

The purpose of the present disclosure is to provide a training method for a depth prediction model, a training device for a depth prediction model, a computer readable medium, and an electronic device, so as to improve the continuity and consistency of monocular depth prediction results at least to a certain extent, so that the depth prediction results of a model for a stationary object in a sequence are kept unchanged, and meanwhile, smooth transition of the depth prediction results for a moving object is ensured.

According to a first aspect of the present disclosure, there is provided a training method of a depth prediction model, including: inputting the image sequence into a model to be trained to obtain a depth prediction image set corresponding to the image sequence; the image sequence comprises at least two images corresponding to a target scene acquired from the same visual angle; acquiring a first depth prediction image corresponding to a current image and a second depth prediction image corresponding to a previous image of the current image in an image sequence from a depth prediction image set; calculating a target loss function corresponding to the current image based on the first depth prediction image and the second depth prediction image; and updating the weight of the model to be trained based on the target loss function to obtain a depth prediction model.

According to a second aspect of the present disclosure, there is provided a training apparatus for a depth prediction model, including: the sequence processing module is used for inputting the image sequence into a model to be trained to obtain a depth prediction image set corresponding to the image sequence; the image sequence comprises at least two images corresponding to a target scene acquired from the same visual angle; the image acquisition module is used for acquiring a first depth prediction image corresponding to a current image and a second depth prediction image corresponding to a previous image of the current image in an image sequence from a depth prediction image set; the loss calculation module is used for calculating a target loss function corresponding to the current image based on the first depth prediction image and the second depth prediction image; and the weight updating module is used for updating the weight of the model to be trained based on the target loss function so as to obtain the depth prediction model.

According to a third aspect of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the above-mentioned method.

According to a fourth aspect of the present disclosure, there is provided an electronic apparatus, comprising: a processor; and memory storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the above-described method.

According to the training method of the depth prediction model provided by an embodiment of the disclosure, an image sequence is input into a model to be trained to obtain a depth prediction image set corresponding to the image sequence, then, for a current image, a first depth prediction image corresponding to the current image and a second depth prediction image corresponding to a migration image of the current image in the image sequence are obtained in the depth prediction image set, a target loss function corresponding to the current image is calculated based on the first depth prediction image and the second depth prediction image, and then, a weight of the model to be trained is updated based on the target loss function to obtain the depth prediction model. According to the method, the image sequence is input to the model to be trained to obtain the depth prediction image combination corresponding to the image sequence, and then the consistency and the consistency of the prediction results of the depth prediction model can be optimized by utilizing the consistency of the features and the structures in each image in the image sequence, so that the depth prediction results of the depth prediction model on the static objects in the image sequence are kept consistent, and meanwhile, the prediction results on the moving objects in the image sequence are enabled to be smooth in transition.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;

FIG. 2 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied;

FIG. 3 schematically illustrates a flow chart of a method of training a depth prediction model in an exemplary embodiment of the disclosure;

FIG. 4 schematically illustrates a flow chart of a method of calculating an objective loss function in an exemplary embodiment of the disclosure;

FIG. 5 is a schematic diagram illustrating the structure of a model to be trained in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart of another method of training a depth prediction model in an exemplary embodiment of the disclosure;

FIG. 7 shows an image corresponding to a portrait scene containing a portrait in a sequence of images;

fig. 8 illustrates a depth prediction image obtained after the image shown in fig. 7 is processed by a depth prediction model obtained based on the training method of the depth prediction model of the present disclosure;

FIG. 9 illustrates a depth prediction image obtained by processing the image shown in FIG. 7 based on a depth prediction model obtained by a conventional training method;

fig. 10 schematically illustrates a composition diagram of a training apparatus for a depth prediction model in an exemplary embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which a method and apparatus for training a depth prediction model according to an embodiment of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The

terminal devices

101, 102, 103 may be various electronic devices having an image processing function, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The method for training the depth prediction model provided by the embodiment of the present disclosure is generally performed by the server 105, and accordingly, the device for training the depth prediction model is generally disposed in the server 105. However, it is easily understood by those skilled in the art that the training method of the depth prediction model provided in the embodiment of the present disclosure may also be executed by the

terminal devices

101, 102, and 103, and accordingly, the training device of the depth prediction model may also be disposed in the

terminal devices

101, 102, and 103, which is not particularly limited in this exemplary embodiment. For example, in an exemplary embodiment, a user may acquire an image sequence through a camera module included in the

terminal devices

101, 102, and 103, and then send the image sequence to the server 105, and the server 105 trains a model to be trained through a training method of a depth prediction model provided in the embodiment of the present disclosure, so as to obtain the depth prediction model.

The exemplary embodiment of the present disclosure provides an electronic device for implementing a training method of a depth prediction model, which may be a

terminal device

101, 102, 103 or a server 105 in fig. 1. The electronic device includes at least a processor and a memory for storing executable instructions of the processor, the processor being configured to perform a method of training a depth prediction model via execution of the executable instructions.

The following takes the mobile terminal 200 in fig. 2 as an example, and exemplifies the configuration of the electronic device. It will be appreciated by those skilled in the art that the configuration of figure 2 can also be applied to fixed type devices, in addition to components specifically intended for mobile purposes. In other embodiments, mobile terminal 200 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. The interfacing relationship between the components is only schematically illustrated and does not constitute a structural limitation of the mobile terminal 200. In other embodiments, the mobile terminal 200 may also interface differently than shown in fig. 2, or a combination of multiple interfaces.

As shown in fig. 2, the mobile terminal 200 may specifically include: a processor 210, an internal memory 221, an external memory interface 222, a Universal Serial Bus (USB) interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 250, a wireless communication module 260, an audio module 270, a speaker 271, a microphone 272, a microphone 273, an earphone interface 274, a sensor module 280, a display 290, a camera module 291, an indicator 292, a motor 293, a button 294, and a Subscriber Identity Module (SIM) card interface 295. Wherein the sensor module 280 may include a depth sensor 2801, a pressure sensor 2802, a gyroscope sensor 2803, and the like.

Processor 210 may include one or more processing units, such as: the Processor 210 may include an Application Processor (AP), a modem Processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband Processor, and/or a Neural-Network Processing Unit (NPU), and the like. The different processing units may be separate devices or may be integrated into one or more processors.

The NPU is a Neural-Network (NN) computing processor, which processes input information quickly by using a biological Neural Network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as intelligent recognition of the mobile terminal 200, for example: image recognition, face recognition, speech recognition, text understanding, and the like. In an exemplary embodiment, the method for training the depth prediction model may be performed based on an NPU, for example, a depth prediction image may be obtained by predicting an image sequence based on the NPU.

A memory is provided in the processor 210. The memory may store instructions for implementing six modular functions: detection instructions, connection instructions, information management instructions, analysis instructions, data transmission instructions, and notification instructions, and execution is controlled by processor 210.

The wireless communication function of the mobile terminal 200 may be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, a modem processor, a baseband processor, and the like. Wherein, the antenna 1 and the antenna 2 are used for transmitting and receiving electromagnetic wave signals; the mobile communication module 250 may provide a solution including wireless communication of 2G/3G/4G/5G, etc. applied to the mobile terminal 200; the modem processor may include a modulator and a demodulator; the Wireless communication module 260 may provide a solution for Wireless communication including a Wireless Local Area Network (WLAN) (e.g., a Wireless Fidelity (Wi-Fi) network), Bluetooth (BT), and the like, applied to the mobile terminal 200. In some embodiments, antenna 1 of the mobile terminal 200 is coupled to the mobile communication module 250 and antenna 2 is coupled to the wireless communication module 260, such that the mobile terminal 200 may communicate with networks and other devices via wireless communication techniques.

The mobile terminal 200 implements a display function through the GPU, the display screen 290, the application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 290 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 210 may include one or more GPUs that execute program instructions to generate or alter display information. In an exemplary embodiment, format change of the images in the image sequence can be implemented by the GPU, the display screen 290, the application processor, and the like, so as to obtain the images satisfying the format condition.

The mobile terminal 200 may implement a photographing function through the ISP, the camera module 291, the video codec, the GPU, the display screen 290, the application processor, and the like. The ISP is used for processing data fed back by the camera module 291; the camera module 291 is used for capturing still images or videos; the digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals; the video codec is used to compress or decompress digital video, and the mobile terminal 200 may also support one or more video codecs. In an exemplary embodiment, the image sequence may be obtained by capturing images through the ISP, the camera module 291, the video codec, the GPU, the display screen 290, the application processor, and the like.

The depth sensor 2801 is used to acquire depth information of a scene. In some embodiments, the depth sensor may be disposed in the camera module 291, and then the camera module disposed with the depth sensor collects depth data corresponding to the target scene to generate a depth annotation image corresponding to the current image.

The pressure sensor 2802 is used to sense a pressure signal and convert the pressure signal into an electrical signal. The gyro sensor 2803 may be used to determine a motion gesture of the mobile terminal 200. In addition, other functional sensors, such as an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc., may be provided in the sensor module 280 according to actual needs.

When some image sequences with the problems of complex background, uneven illumination and the like are processed, the problem that the prediction result of each image in the image sequence is discontinuous when the sequences are predicted is easily caused by aiming at a monocular depth prediction model obtained by a correlation optimization mode. The main reason for these problems is that when the related depth network model is trained, it is usually based on a single image depth annotation image as a reference. In this case, since the invariance characteristic of view transformation such as image translation and rotation is not considered, it is easy to cause the monocular depth prediction model to have the problem that the prediction results of the previous image and the next image are inconsistent when predicting the image sequence, and further cause the phenomena of flickering and jumping.

Based on one or more of the problems described above, the present example embodiment provides a training method of a depth prediction model. The training method of the depth prediction model may be applied to the server 105, and may also be applied to one or more of the

terminal devices

101, 102, and 103, which is not particularly limited in this exemplary embodiment. Referring to fig. 3, the training method of the depth prediction model may include the following steps S310 to S340:

in step S310, the image sequence is input into the model to be trained, and a depth prediction image set corresponding to the image sequence is obtained.

The image sequence includes at least two images corresponding to a target scene acquired from the same view angle, and the image sequence generally refers to a plurality of images having continuity, and the target scene generally refers to a scene including a portrait. For example, it may be a video clip or a sequence of images with continuity of content extracted from the video. It should be noted that, in order to obtain the image sequence, sample data for training the model to be trained may be read in batches. For example, a preset number of view-invariant video frames may be read as a sequence of images.

In an exemplary embodiment, the model to be trained may include a monocular depth prediction network such as BTS, MIDAS, and the like, and may also include a monocular depth prediction network of other structures, which is not particularly limited in this disclosure. Based on this, when the image sequence is input into the model to be trained, although the image sequence is input once, when depth prediction is performed, prediction is sequentially performed on the basis of each image in the image sequence, a depth prediction image corresponding to each image is obtained, and further, a depth prediction image set corresponding to the image sequence is obtained.

It should be noted that, because the image formats supported by different models to be trained are different, the image sequence may be subjected to format conversion before being input into the model to be trained. For example, the model to be trained supports images with a resolution of 480 × 640, before the image sequence is input into the model to be trained, each image in the image sequence may be scaled to obtain an image with a resolution of 480 × 640, and then the obtained image sequence is input into the model to be trained.

In an exemplary embodiment, in order to facilitate fast convergence when the model to be trained is trained, each sample image in the image sequence may be normalized before the image sequence is input into the model to be trained. For example, the pixel values corresponding to all the pixels are reduced by 127.5 and then divided by 127.5, and the pixel values corresponding to all the pixels are normalized to the range of [ -1, 1 ].

In step S320, a first depth prediction image corresponding to the current image and a second depth prediction image corresponding to a previous image of the current image in the image sequence are obtained in the depth prediction image set.

The current image is an image which is calculated at the current moment in the process of training the model to be trained so as to update the weight of the model to be trained.

In an exemplary embodiment, after obtaining the depth prediction image set corresponding to the image sequence, based on consistency of features and structures in the image sequence, when performing training based on the current image, a first depth prediction image corresponding to the current image and a second depth prediction image corresponding to a previous image of the current image in the image sequence may be obtained simultaneously.

In step S330, a target loss function corresponding to the current image is calculated based on the first depth prediction image and the second depth prediction image.

In an exemplary embodiment, after the first depth prediction image and the second depth prediction image are obtained, a target loss function can be calculated according to the consistency of features and structures in a current image and a previous image, and then the continuity and consistency of prediction results of a model to be trained are optimized according to the target loss function.

In an exemplary embodiment, when the target loss function corresponding to the current image is calculated based on the first depth prediction image and the second depth prediction image, as shown in fig. 4, the following steps S410 to S430 may be included:

in step S410, an absolute loss function corresponding to the current image is calculated based on the first depth prediction image.

The absolute loss function refers to a loss function calculated based on the current image itself or a prediction result, a labeling result, and the like corresponding to the current image, and the absolute loss function may include all kinds of loss functions determined according to the depth characteristic of the current image itself. For example, the absolute loss function may include a depth loss function determined based on a first depth prediction image corresponding to the current image and a depth annotation image corresponding to the current image; as another example, the absolute loss function can include a smoothing loss function determined based on the first depth prediction image corresponding to the current image. In addition, the absolute loss function may further include other loss functions determined according to the depth characteristics of the current image itself, which is not particularly limited in this disclosure.

In an exemplary embodiment, when the absolute loss function includes a depth loss function, the depth loss function may be calculated based on a first depth prediction image obtained by predicting the current image by the model to be trained and a depth annotation image obtained by annotating the current image in advance. Specifically, a logarithmic difference between the depth values corresponding to the first depth prediction image and the depth labeling image may be calculated for each pixel, and then a depth loss function corresponding to the current image may be calculated based on the logarithmic difference. Wherein the logarithmic difference g of the depth value corresponding to each pixel of the first depth prediction image and the depth annotation image_iCan be calculated based on the following equation (1):

wherein d is_iThe depth prediction value of the ith pixel point is represented,

and indicating the labeled depth value of the ith pixel point.

In an exemplary embodiment, when calculating the depth loss function corresponding to the current image based on the logarithmic difference, the depth loss function L_dThe calculation can be made based on the following equation (2):

wherein, g_iIs shown asThe logarithm difference of the depth values corresponding to the i pixel points, and T represents the number of pixels on the current image; α and λ are parameters set in advance. For example, α may take 10 and λ may take 0.5.

In an exemplary embodiment, the absolute loss function may further include a smoothing loss function. At this time, the absolute loss function corresponding to the current image may be directly calculated based on the first depth prediction image. Specifically, a horizontal partial derivative and a vertical partial derivative of each pixel on the first depth prediction image may be calculated on the basis of the first depth prediction image, and then a smoothing loss function corresponding to the current image may be calculated based on the horizontal partial derivative and the vertical partial derivative corresponding to each pixel.

In an exemplary embodiment, when calculating the smoothing loss function corresponding to the current image based on the horizontal partial derivative and the vertical partial derivative corresponding to each pixel, the smoothing loss function L_smoothThe calculation can be made based on the following equation (3):

wherein d is_i,jIndicating a depth predicted value corresponding to a pixel with coordinates (i, j) in the depth predicted image; delta_xd_i,jRepresenting the horizontal partial derivative corresponding to the pixel with the coordinate (i, j) in the depth prediction image; delta_yd_i,jIndicating the vertical partial derivative corresponding to the pixel with the coordinate (i, j) in the depth prediction image.

In step S420, a relative loss function corresponding to the current image is calculated based on the first depth prediction image and the second depth prediction image.

The relative loss function value is calculated based on the current image and a previous image of the current image in the image sequence or a prediction result, an annotation result and the like corresponding to the current image and the previous image, and the relative loss function may include a loss function which is determined according to a depth characteristic of the current image and a depth characteristic of the previous image and can represent a depth change of the current image relative to the previous image.

It should be noted that the relative loss function includes all kinds of loss functions that can characterize the depth change of the current image relative to the previous image, and the disclosure is not limited thereto.

In an exemplary embodiment, the relative loss function may be calculated based on similarity. Specifically, the similarity may be calculated based on the first depth prediction image and the second depth prediction image, and then the relative loss function of the current image pair may be calculated based on the similarity.

When the similarity is calculated based on the first depth prediction image and the second depth prediction image, the image similarity between the first depth prediction image and the second depth prediction image can be directly calculated; the feature similarity output by the feature layer corresponding to the first depth prediction image and the second prediction image can be calculated; the image similarity and the feature similarity can also be calculated simultaneously. It should be noted that, when calculating the similarity, if only one similarity is obtained by calculation, the similarity may be directly used as a relative loss function; if multiple similarities are calculated simultaneously, the relative loss function may be calculated based on the multiple similarities. During specific calculation, different calculation formulas can be designed according to different application scenarios to calculate the relative loss function. For example, the calculated similarities may be directly added to obtain a relative loss function.

In an exemplary embodiment, when the similarity includes both the image similarity and the feature similarity, the image similarity of the first depth prediction image and the second depth prediction image may be calculated first; and then respectively obtaining a first output corresponding to the current image output by the target characteristic layer of the model to be trained and a second output corresponding to the previous image output by the target characteristic layer of the model to be trained when the model to be trained is used for predicting the current image and the previous image, and then calculating the characteristic similarity based on the first output and the second output.

Specifically, when the image similarity between the first depth prediction image and the second depth prediction image is calculated, downsampling may be performed on the first depth prediction image and the second depth prediction image, and then the mean square error of all pixel points after downsampling is calculated. The above-described process of calculating the image similarity S can be expressed by the following formula (4):

S＝MSE(w₀，w₁) Equation (4)

Wherein MSE represents the mean square error of all pixel points of the calculated image; w is a₀Representing the downsampling result corresponding to the current image; w is a₁Representing the corresponding downsampling result of the previous image.

Specifically, when the feature similarity E is calculated based on the first output and the second output, the calculation can be performed by the following equation (5):

E＝MSE(X₀，X₁) Equation (5)

Wherein MSE represents the mean square error of all pixel points of the calculated image; x₀A first output representing a target feature layer corresponding to the first depth prediction image; x₁And a second output representing the target feature layer corresponding to the second depth prediction image.

The process of calculating the relative loss function is described in detail below by taking the target feature layer as the x2 feature layer and the x4 feature layer of the model to be trained as an example.

Referring to fig. 5, in a model to be trained in which a hierarchical feature in a fusion scene enables depth prediction of different regions, an encoding branch and a decoding branch may be included. When the depth prediction is performed based on the model to be trained, the output can be designed to include the following three parts: a depth prediction image, output 1 of the divide-by-2 resolution feature output layer (x2 feature layer), and output 2 of the divide-by-4 resolution feature output layer (x4 feature layer).

At this time, the image similarity and the feature similarity may be calculated, respectively. Aiming at the image similarity, the first depth prediction image and the second depth prediction image can be directly subjected to down-sampling, and then the image similarity S is obtained through calculation based on a formula (4); for the feature similarity, the x2 feature layer in fig. 5, the first output a when the model to be trained processes the current image, and the second output b when the model to be trained processes the previous image may be obtained first, and then based on the first outputA and a second output b are obtained, and the first feature similarity E is obtained through calculation according to the formula (5)₁(ii) a Meanwhile, for the feature similarity, it is also necessary to obtain an x4 feature layer in fig. 5, a first output c when the model to be trained processes the current image and a second output d when the model to be trained processes the previous image, and then calculate a second feature similarity E by using formula (5) based on the first output c and the second output d₂。

Note that the feature output resolutions of the x2 feature layer and the x4 feature layer are [ batch _ size, height, width, channel _ number ]. Wherein, the batch _ size refers to the n value corresponding to the model to be trained, namely the stacking number of the input images; the channel _ number refers to a c value corresponding to the model to be trained, i.e. the number of channels.

And when the image similarity, the first feature similarity and the second feature similarity are obtained, adding the image similarity, the first feature similarity and the second feature similarity to obtain a relative loss function.

In step S430, a target loss function corresponding to the current image is calculated based on the absolute loss function and the relative loss function.

In an exemplary embodiment, after obtaining the absolute loss function and the relative loss function, the target loss function corresponding to the current image may be jointly calculated based on the absolute loss function and the relative loss function. It should be noted that, on the premise of different requirements, when calculating the target loss function, different weights may be set for different loss functions based on the influence of different loss functions on the requirements. In addition, when weights are set for the absolute loss function or the relative loss function, different weights may be set for different types of loss functions, or the same weight may be set. For example, in calculating the target loss function L, a weighted sum of an absolute loss function and a relative loss function may be calculated. For example, the calculation can be performed by the following formula (6):

L＝α_s(S+E₁+E₂)+α_d*L_d+α_smooth*L_smoothformula (6)

Wherein the insulation isFor depth loss function L in loss function_dAnd a smoothing loss function L_smoothWith different weights alpha_dAnd alpha_smooth(ii) a For the relative loss function S + E₁+E₂With the weight alpha_s。

In step S340, weight updating is performed on the model to be trained based on the target loss function to obtain a depth prediction model.

In an exemplary embodiment, after the target loss function is obtained, the weight of the model to be trained may be updated based on the target loss function to obtain the depth prediction model. Specifically, as shown in fig. 6, the method includes the following steps: step S601, reading sample data in batches to obtain an image sequence; step S603, normalizing each image in the image sequence to obtain a normalized image sequence; step S605, inputting the normalized image sequence into a model to be trained for processing to obtain a depth prediction image set corresponding to the image sequence; step S607, calculating a target loss function based on a first depth prediction image corresponding to the current image and a second depth prediction image corresponding to a previous image of the current image in the image sequence; and step S609, calculating the weight of the gradient updating model to be trained based on the target loss function, and after the epoch is carried out for multiple times, converging the target loss function to a certain range to further obtain the depth prediction model.

In summary, in the exemplary embodiment, the consistency and continuity of depth prediction can be effectively optimized by using the consistency of features and structures in an image sequence, so that the depth prediction of a depth prediction model on a stationary object in the sequence can be basically guaranteed to be unchanged, and the transition smoothness of the depth prediction on a moving object can be guaranteed.

In addition, when the depth prediction model obtained based on the method is applied to a video blurring function, the problem that blurring strength is inaccurate or not gentle can be effectively solved. For example, for a portrait scene including a portrait as shown in fig. 7, processing a depth prediction model obtained according to the training method of the depth prediction model of the present disclosure may obtain a depth prediction image 1 as shown in fig. 8; the depth prediction image 2 shown in fig. 9 can be obtained by processing according to a depth prediction model obtained by taking a conventional single-image-based depth annotation image as a reference. By comparing the depth prediction image 1 with the depth prediction image 2, the problem that the blurring strength is not accurate or not smooth can be found and improved.

It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Further, referring to fig. 10, an embodiment of the present invention further provides a training apparatus 1000 for a depth prediction model, which includes a sequence processing module 1010, an image obtaining module 1020, a loss calculating module 1030, and a weight updating module 1040. Wherein:

the sequence processing module 1010 may be configured to input the image sequence into a model to be trained, so as to obtain a depth prediction image set corresponding to the image sequence; the image sequence comprises at least two images corresponding to the target scene acquired from the same view angle.

The image obtaining module 1020 may be configured to obtain, in the depth prediction image set, a first depth prediction image corresponding to the current image and a second depth prediction image corresponding to a previous image of the current image in the image sequence.

The loss calculation module 1030 can be configured to calculate a target loss function corresponding to the current image based on the first depth prediction image and the second depth prediction image.

The weight update module 1040 may be configured to perform weight update on the model to be trained based on the target loss function to obtain the depth prediction model.

In an exemplary embodiment, the loss calculating module 1030 may be configured to calculate an absolute loss function corresponding to the current image based on the first depth prediction image; calculating a relative loss function corresponding to the current image based on the first depth prediction image and the second depth prediction image; and calculating a target loss function corresponding to the current image based on the absolute loss function and the relative loss function.

In an exemplary embodiment, the absolute loss function includes a depth loss function, and the loss calculation module 1030 may be configured to calculate, for each pixel, a logarithmic difference between a depth value of the first depth prediction image and a depth value of a depth annotation image corresponding to the current image; and calculating a depth loss function corresponding to the current image based on the logarithmic difference.

In an exemplary embodiment, the absolute loss function further includes a smooth loss function, and the loss calculation module 1030 is configured to calculate a horizontal partial derivative and a vertical partial derivative of each pixel on the first depth prediction image; and calculating the corresponding absolute loss function of the current image based on the corresponding transverse partial derivative and longitudinal partial derivative of each pixel.

In an exemplary embodiment, the loss calculation module 1030 may be configured to calculate a similarity based on the first depth prediction image and the second depth prediction image; and calculating a corresponding relative loss function of the current image based on the similarity.

In an exemplary embodiment, the similarity includes an image similarity and a feature similarity, and the loss calculating module 1030 may be configured to calculate the image similarity of the first depth prediction image and the second depth prediction image; and acquiring a first output of a target characteristic layer corresponding to the first depth prediction image and a second output of the target characteristic layer corresponding to the second depth prediction image, and calculating the characteristic similarity based on the first output and the second output.

In an exemplary embodiment, the sequence processing module 1010 may be configured to normalize all images included in the image sequence to obtain a normalized image sequence.

The specific details of each module in the above apparatus have been described in detail in the method section, and details that are not disclosed may refer to the method section, and thus are not described again.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the present disclosure may also be implemented in a form of a program product including program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present disclosure described in the above section "exemplary method" of this specification, when the program product is run on the terminal device, for example, any one or more of the steps in fig. 3, fig. 4, and fig. 6 may be performed.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A training method of a depth prediction model is characterized by comprising the following steps:

inputting an image sequence into a model to be trained to obtain a depth prediction image set corresponding to the image sequence; the image sequence comprises at least two images corresponding to a target scene acquired from the same visual angle;

acquiring a first depth prediction image corresponding to a current image and a second depth prediction image corresponding to a previous image of the current image in the image sequence from the depth prediction image set;

calculating a target loss function corresponding to the current image based on the first depth prediction image and the second depth prediction image;

and updating the weight of the model to be trained based on the target loss function to obtain a depth prediction model.

2. The method according to claim 1, wherein the calculating the target loss function for the current picture based on the first depth prediction picture and the second depth prediction picture comprises:

calculating an absolute loss function corresponding to the current image based on the first depth prediction image;

calculating a relative loss function corresponding to the current image based on the first depth prediction image and the second depth prediction image;

and calculating a target loss function corresponding to the current image based on the absolute loss function and the relative loss function.

3. The method of claim 2, wherein the absolute loss function comprises a depth loss function;

the calculating the absolute loss function corresponding to the current image based on the first depth prediction image comprises:

calculating a logarithmic difference between the depth value of the first depth prediction image and the depth value of the depth labeling image corresponding to the current image for each pixel;

and calculating a depth loss function corresponding to the current image based on the logarithmic difference.

4. The method of claim 3, wherein the absolute loss function further comprises a smoothing loss function;

the calculating the absolute loss function corresponding to the current image based on the first depth prediction image further includes:

calculating a horizontal partial derivative and a vertical partial derivative of each pixel on the first depth prediction image;

calculating a smoothing loss function corresponding to the current image based on the lateral partial derivative and the longitudinal partial derivative corresponding to each pixel.

5. The method according to claim 2, wherein the calculating a relative loss function for the current picture based on the first depth prediction picture and the second depth prediction picture comprises:

calculating a similarity based on the first depth prediction image and the second depth prediction image;

and calculating a relative loss function corresponding to the current image based on the similarity.

6. The method of claim 5, wherein the similarity comprises an image similarity and a feature similarity;

the calculating feature similarity based on the first depth prediction image and the second depth prediction image comprises:

calculating the image similarity of the first depth prediction image and the second depth prediction image;

and acquiring a first output of a target characteristic layer corresponding to the first depth prediction image and a second output of the target characteristic layer corresponding to the second depth prediction image, and calculating the characteristic similarity based on the first output and the second output.

7. The method of claim 1, wherein prior to said inputting the sequence of images into the model to be trained, the method further comprises:

and normalizing all images contained in the image sequence to obtain a normalized image sequence.

8. An apparatus for training a depth prediction model, comprising:

the sequence processing module is used for inputting the image sequence into a model to be trained to obtain a depth prediction image set corresponding to the image sequence; the image sequence comprises at least two images corresponding to a target scene acquired from the same visual angle;

the image acquisition module is used for acquiring a first depth prediction image corresponding to a current image and a second depth prediction image corresponding to a previous image of the current image in the image sequence from the depth prediction image set;

the loss calculation module is used for calculating a target loss function corresponding to the current image based on the first depth prediction image and the second depth prediction image;

and the weight updating module is used for updating the weight of the model to be trained based on the target loss function so as to obtain a depth prediction model.

9. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-7 via execution of the executable instructions.