CN111784757B

CN111784757B - Training method of depth estimation model, depth estimation method, device and equipment

Info

Publication number: CN111784757B
Application number: CN202010611746.1A
Authority: CN
Inventors: 叶晓青; 谭啸; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2024-01-23
Anticipated expiration: 2040-06-30
Also published as: CN111784757A

Abstract

The application discloses a training method of a depth estimation model, a depth estimation method, a device and equipment, and relates to the field of computer vision and deep learning. The specific implementation scheme is as follows: acquiring a plurality of image sets, wherein the internal references of the plurality of image sets are different, each image set comprises at least one original depth map, and the internal references of the original depth maps included in the same image set are the same; performing internal reference processing on the plurality of image sets to obtain a plurality of target image sets, wherein the internal references of the plurality of target image sets are the same; and training the neural network according to the plurality of target image sets to obtain a depth estimation model. Because the original depth map in the image set with different internal parameters is subjected to internal reference processing before the neural network is trained based on the depth map, the internal parameters of the original depth map in the image set with different internal parameters are unified, and then the accuracy of depth estimation can be improved when the neural network is trained.

Description

Training method of depth estimation model, depth estimation method, device and equipment

Technical Field

The embodiment of the application relates to the field of computer vision and deep learning in image processing, in particular to a training method, a depth estimation method, a device and equipment of a depth estimation model.

Background

Currently, monocular depth estimation is based on training a neural network based on images captured by a plurality of monocular cameras. And then inputting the image to be estimated into a trained neural network to realize image depth estimation.

In the training process, a large number of training image sets are required, and in order to improve the generation efficiency of the training image sets, a plurality of monocular cameras are often required to shoot simultaneously, and the training image sets obtained by the method often cause inaccurate monocular depth estimation.

Disclosure of Invention

The application provides a training method, a depth estimation method, a device and equipment for a depth estimation model for improving monocular depth estimation accuracy.

According to a first aspect of the present application, there is provided a training method of a depth estimation model, including: acquiring a plurality of image sets, wherein the internal parameters of the plurality of image sets are different, each image set comprises at least one original depth map, and the internal parameters of the original depth maps included in the same image set are the same; performing internal reference processing on the plurality of image sets to obtain a plurality of target image sets, wherein the internal references of the plurality of target image sets are the same; and training the neural network according to the plurality of target image sets to obtain a depth estimation model.

According to a second aspect of the present application, there is provided a depth estimation method comprising: acquiring an image to be estimated; and inputting the image to be estimated into a depth estimation model obtained through training by the training method according to the first aspect, and obtaining a depth map corresponding to the image to be estimated.

According to a third aspect of the present application, there is provided a training apparatus of a depth estimation model, comprising: the first acquisition module is used for acquiring a plurality of image sets, wherein the internal references of the plurality of image sets are different, each image set comprises at least one original depth map, and the internal references of the original depth maps included in the same image set are the same; the internal reference processing module is used for carrying out internal reference processing on the plurality of image sets to obtain a plurality of target image sets, and the internal references of the plurality of target image sets are the same; and the training module is used for training the neural network according to the plurality of target image sets to obtain a depth estimation model.

According to a fourth aspect of the present application, there is provided a depth estimation apparatus comprising: the second acquisition module is used for acquiring an image to be estimated; and the input module is used for inputting the image to be estimated into a depth estimation model obtained through training by the training method according to the first aspect, and obtaining a depth map corresponding to the image to be estimated.

According to a fifth aspect of the present application, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a sixth aspect of the present application, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the first aspect.

The technique according to the present application solves the problem of inaccurate existing monocular depth estimation.

It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

fig. 1 is a schematic view of an application scenario in an embodiment of the present application;

FIG. 2 is a schematic diagram of training principle of a depth estimation model according to an embodiment of the present application;

FIG. 3 is a flow chart of a training method of a depth estimation model according to an embodiment of the present application;

FIG. 4 is a schematic illustration of an exemplary internal reference process of the present application;

FIG. 5 is a schematic illustration of an internal process of another example of the present application;

FIG. 6 is a schematic diagram of a sizing agent determination process according to an example of the present application;

FIG. 7 is a schematic diagram of a size conversion factor determination process of another example of the present application;

FIG. 8 is a flow chart of a depth estimation method of an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a training device of a depth estimation model according to an embodiment of the present application;

fig. 10 is a schematic structural view of a depth estimation device according to an embodiment of the present application;

FIG. 11 is a block diagram of an electronic device for implementing a training method and/or a depth estimation method for a depth estimation model according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is an application scenario diagram of an embodiment of the present application. As shown in fig. 1, the application scenario includes: a data processing device 11 and a user device 12; the data processing device 11 is configured to train the neural network according to a sample image set, where the sample image set includes at least one sample image and a depth map corresponding to the sample image, as shown in fig. 2, input the at least one sample image and the depth map corresponding to the sample image into the neural network to obtain a depth prediction map of the sample image, and adjust network parameters of the neural network according to a difference between the depth prediction map and the depth map until the neural network reaches a convergence state, thereby obtaining a depth estimation model, the trained depth estimation model may be stored in the user device 12, and a user may obtain the depth map corresponding to the image by inputting the image to be estimated into the depth estimation model. The data processing device 11 may be a desktop computer, a notebook computer, a smart phone, an IPAD, a server, etc., and the user device 12 may be a desktop computer, a notebook computer, a smart phone, an IPAD, etc.

A depth image (depth image), also called range image, refers to an image having as pixel values the distance (depth) from an image acquisition device to points in a scene, which directly reflects the geometry of the scene's visible surface. In the depth map, each pixel represents the distance from the object at that particular (x, y) coordinate to the nearest object to the camera plane to that plane in the field of view of the depth sensor.

In the above process, the inventors found that: in a monocular depth estimation scene, a plurality of sample images are often acquired by a plurality of monocular cameras respectively, and the monocular depth estimation is inaccurate due to perspective shortening effect and scale uncertainty in a single image shot by a single camera. For example, a sample image 1 including an object a is acquired by the monocular camera 1, a sample image 2 including an object a is acquired by the monocular camera 2, and the monocular camera 1 and the monocular camera 2 respectively take different internal shots, which may cause that the depth of the object a in the sample image 1 and the monocular camera 2 is different in visual perception, that is, different depth information is corresponding to the object a, but the depth information of the sample image 1 and the depth information of the sample image 2 are actually the same, so that it is likely that the different depth information corresponding to the sample image 1 and the sample image 2 are identified for the neural network, thereby causing inaccurate monocular depth estimation.

According to the method and the device for achieving the monocular depth estimation, the image sets collected by different monocular cameras are subjected to standardized processing, namely, the image sets collected by different monocular cameras through different internal references are unified under the same internal reference, and then the image sets are used for training of a neural network, so that the accuracy of monocular depth estimation can be improved.

It should be noted that in this embodiment, the data processing device 11 and the user device 12 may be integrated into one electronic device, that is, the electronic device may be used for training a neural network or performing depth estimation on an image.

The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The application provides a training method, a depth estimation method, a device and equipment of a depth estimation model, which are applied to the fields of computer vision and deep learning in the field of image processing so as to achieve the aim of improving monocular depth estimation accuracy.

According to an embodiment of the application, a training method of a depth estimation model is provided.

As shown in fig. 3, a flowchart of a training method of a depth estimation model according to an embodiment of the present application is shown. The training method of the depth estimation model comprises the following steps:

Step S301, a plurality of image sets are acquired.

Wherein the internal parameters of the plurality of image sets are different, each image set comprises at least one original depth map, and the internal parameters of the original depth maps included in the same image set are the same.

In this embodiment, the internal parameters of the image set refer to the internal parameters of the image acquisition device, for example, the monocular camera, which are parameters related to the characteristics of the camera itself, such as the focal length, the pixel size, and the like of the camera. Wherein, a monocular camera collects an image set, a plurality of image sets are collected by different monocular cameras, therefore, the internal parameters of the original depth map in the same image set are the same, and the internal parameters of a plurality of image sets are different.

And step S302, performing internal reference processing on the plurality of image sets to obtain a plurality of target image sets.

The internal parameters of the plurality of target image sets are the same, the plurality of target image sets are in one-to-one correspondence with the plurality of image sets, that is to say, one image set is subjected to internal parameter processing, and one target image set can be obtained. Each target image set includes at least one target depth map.

In the step, the original depth map in the plurality of image sets is subjected to internal reference processing to obtain a plurality of target image sets with the same internal reference. For example, assume that the image sets include image sets 1 to N, and the image sets 1 to N correspond to the references 1 to N, respectively, at least two references from the references 1 to N are different, and after the reference processing in this step, the references of the N image sets are processed into the same reference, and the image sets after the reference processing are multiple target image sets.

Step S303, training a neural network according to a plurality of target image sets to obtain a depth estimation model.

The method comprises the steps of inputting a target image set into a neural network to obtain a depth map prediction result, adjusting network parameters of the neural network according to the difference between the depth map prediction result and a corresponding original depth map, repeating the process until iterative training reaches a convergence state, and storing the network parameters obtained by current round training to obtain a depth estimation model after training is finished. For a specific training procedure of the neural network, reference may be made to the description of the prior art, and this embodiment will not be described in detail here.

In the embodiment, a plurality of image sets with different internal parameters are acquired, and the image sets with different internal parameters are processed into a plurality of target image sets with the same internal parameters; each image set comprises at least one original depth map, the internal references of the original depth maps in the same image set are the same, and then a neural network is trained according to a plurality of target image sets to obtain a depth estimation model. Because the original depth map in the image set with different internal parameters is subjected to internal reference processing before the neural network is trained based on the depth map, the internal parameters of the original depth map in the image set with different internal parameters are unified, and then the accuracy of depth estimation can be improved when the neural network is trained.

In the implementation process of processing the image sets with different internal parameters into the target image sets with the same internal parameters, a standard image set can be determined first, and then the internal parameters of the image sets are processed so that the internal parameters of the image sets are the same, wherein the standard image set can be one image set in the image sets or can be an image set outside the image sets. The following describes the above two implementation procedures in detail:

in a first alternative embodiment, performing an internal reference process on the plurality of image sets to obtain a plurality of target image sets, including:

and a1, selecting one image set from a plurality of image sets as a standard image set.

In this embodiment, the standard image set is one of a plurality of image sets.

And a2, performing internal reference processing on the original depth maps of the residual image sets in the plurality of image sets to obtain a target image set corresponding to the residual image sets.

The internal parameters of the target depth map included in the target image set are the same as those of the images in the standard image set. The target image sets corresponding to the standard image set and the residual image set form a plurality of target image sets.

The above steps a1 and a2 are illustrated by a specific example:

In an alternative example, as shown in fig. 4, assume that a plurality of image sets includes an image set 1, an image set 2, and an image set 3, and that the image set 1, the image set 2, and the image set 3 correspond to an internal reference 1, an internal reference 2, and an internal reference 3, respectively; taking the image set 1 as a standard image set, and performing internal reference processing on original depth maps in the image set 2 and the image set 3 to obtain an image set 2 '(target image set 2) and an image set 3' (target image set 3), wherein internal references of the image set 2 'and the image set 3' are internal reference 1 after internal reference processing.

In a second alternative embodiment, performing internal reference processing on the multiple image sets to obtain multiple target image sets, including:

and b1, determining a standard image set.

In this embodiment, the standard image set does not belong to the plurality of image sets.

And b2, performing internal reference processing on the original depth maps of the image sets to obtain target image sets corresponding to each image set.

The internal parameters of the target depth map included in the target image set are the same as those of the images in the standard image set.

The above steps b1 and b2 are illustrated by a specific example:

in another alternative example, as shown in fig. 5, assume that there are image set 0, image set 1, image set 2, and image set 3, and that image set 1, image set 2, and image set 3 constitute a plurality of image sets; image set 0, image set 1, image set 2 and image set 3 correspond to reference 0, reference 1, reference 2 and reference 3, respectively; taking the image set 0 as a standard image set, and performing internal reference processing on original depth maps in the image set 1, the image set 2 and the image set 3 to obtain an image set 1 '(target image set 1), an image set 2' (target image set 2) and an image set 3 '(target image set 3), wherein the internal reference of the image set 1', the image set 2 'and the image set 3' is the internal reference 0 after the internal reference processing.

On the basis of the first implementation manner, performing internal reference processing on the multiple image sets to obtain multiple target image sets, where the internal reference processing includes: and determining a target image set corresponding to the residual image set according to the internal parameters of the residual image set, the internal parameters of the standard image set and the size conversion factors. The implementation process of this embodiment may be represented by the following formula (1):

Depth _n '＝scale _n *f _n /f ₁ *Depth _n ； (1)

in Depth _n ' target depth map representing nth target image set, scale _n A size conversion factor representing an nth image set; f (f) _n An internal reference representing the nth image set, f ₁ Internal reference, depth representing standard image set _n And representing an original depth map of an nth image set, wherein N is 2 to N, and N is an integer greater than 0.

On the basis of the second implementation manner, performing internal reference processing on the multiple image sets to obtain multiple target image sets, where the internal reference processing includes: and determining a target image set corresponding to the plurality of image sets according to the internal parameters of the plurality of image sets, the internal parameters of the standard image set and the size conversion factors. The implementation process of this embodiment may be expressed by the following formula (2):

Depth _n '＝scale _n *f _n /f ₀ *Depth _n ； (2)

in Depth _n ' target depth map representing nth target image set, scale _n A size conversion factor representing an nth image set; f (f) _n Represent the firstInternal reference of n image sets, f ₀ Internal reference, depth representing standard image set _n And representing an original depth map of an nth image set, wherein N is a value of 1 to N, and N is an integer greater than 0.

Wherein the size conversion factor in the above formula (1) and formula (2) is determined according to the following method:

and if the length-to-width ratio of the original depth map in the residual image set is equal to that of the original depth map in the standard image set, taking the length ratio of the original depth map in the residual image set to the original depth map in the standard image set as a size conversion factor.

For example, as shown in fig. 6, assume that a plurality of image sets includes N image sets, respectively denoted as image set 1 (W ₁ ，H ₁ ) Image set 2 (W) ₂ ，H ₂ ) … image set N (W) _N ，H _N ) The 1 st image set (image set 1) of the N image sets is selected as a standard image set, and the rest image set is recorded as an image set i (W _i ，H _i ) If W _i /H _i ＝W ₁ /H ₁ Will W _i /W ₁ As a size conversion factor for the remaining image set i. Of course, in this embodiment, H may be _i /H ₁ As a size conversion factor for the remaining image set i.

And if the length-to-width ratio of the original depth map in the residual image set is not equal to that of the original depth map in the standard image set, taking the length ratio of the original depth map in the residual image set after the residual image set is processed to the original depth map in the standard image set as a size transformation factor, wherein the length-to-width ratio of the original depth map after the processing is equal to that of the original depth map in the standard image set.

Continuing with the above example, as shown in FIG. 7, assume that the plurality of image sets includes N image sets, respectively denoted as image set 1 (W ₁ ，H ₁ ) Image set 2 (W) ₂ ，H ₂ ) … image set N (W) _N ，H _N ) Selecting the 1 st image set of the N image sets as a standard image set, and recording the rest image set as an image set i (W _i ，H _i ) If W _i /H _i ≠W ₁ /H ₁ The rest of the graph is firstThe original depth map in image set i is cropped to be equal to W ₁ /H ₁ Equal depth image, length of cut depth image is recorded as W' _i The width is marked as H' _i And W' _i /H’ _i ＝W ₁ /H ₁ . The size conversion factor of the present embodiment is W' _i /W ₁ . Of course, in this embodiment, H 'may be used' _i /H ₁ As a size conversion factor for the remaining image set i.

After performing depth transformation on the original depth map through the embodiment, training a neural network according to a plurality of target image sets to obtain a depth estimation model, including:

and c1, inputting the target image set into a neural network to obtain the logarithmic value of the depth information.

The neural network of the present embodiment may be a self-coding depth neural network based on an hourglass model (hour glass), which is a multi-scale self-coding network.

This step can be expressed as the following formula:

Target＝log(Depth _n ')； (3)

in formula (3), depth _n ' is the Depth map output by the neural network, log (Depth) _n ') indicate a pair of Depth _n ' logarithm, target represents a logarithmic value of depth information.

And c2, determining a loss function according to the logarithmic value of the depth information.

Wherein the loss function includes a first loss function and a second loss function, and determining the loss function according to the logarithmic value of the depth information includes: determining a first loss function and a second loss function according to the logarithmic value of the depth information; a loss function is determined based on the first loss function and the second loss function. Wherein the loss function may be a weighted sum of the first loss function and the second loss function. The above procedure can be expressed by the following formula (4):

L＝L _depth +L _grad ； (4)

in the formula (4), L _depth Is an L1-norm loss function; l (L) _grad Is a gradient similarity loss function;wherein N is the number of the image set, and N is the total number of the image set;

wherein y is _pred Gradient representing predicted depth map output by neural network, y _pred IncludedAnd->Can be noted as y _pred /> And->Gradients of the predicted depth map along the x and y directions of the image output by the neural network respectively; />And->Is the gradient of the true values along the x and y directions of the image; mask is a foreground region that does not consider infinite depth.

And step c3, adjusting network parameters of the neural network according to the loss function.

And adjusting network parameters of the neural network according to the loss function L until the neural network reaches a convergence state, thereby obtaining the depth estimation model.

According to the embodiment, the weight of the depth at the distance with larger error is reduced by estimating the logarithmic value of the depth information, so that the depth from a few meters to hundreds of meters can be controlled within a certain range, the depth is more sensitive to the near place and is not sensitive to the depth error at the distance, and the depth information can be better applied to practice.

Optionally, each image set further includes at least one original image, each original image corresponds to an original depth map, and after each original depth map is processed by the internal reference in the foregoing embodiment, a corresponding target depth map can be obtained, where the target depth map is used as label information of the corresponding original image in the training process of the neural network. Before the original image and the corresponding target depth map are input into the neural network, the original image and the target depth map need to be processed to adapt to the input picture size of the neural network.

In an alternative embodiment, before performing the internal reference processing on the multiple image sets to obtain multiple target image sets, the original image and the corresponding original depth map of each image set are subjected to size transformation, so that the sizes of the original image and the corresponding original depth map are the same as the input image size of the neural network.

It should be noted that, on the basis of the present embodiment, depth in the formula (1) and the formula (2) _n Should be the original depth map.

Alternatively, an image region with the same size as the input picture of the neural network may be directly cut out from the original image, and a depth map region corresponding to the image region may be cut out from the original depth map corresponding to the original image.

Optionally, performing a size transformation on the original image and the corresponding original depth map of each image set so that the sizes of the original image and the corresponding original depth map are the same as the sizes of the input images of the neural network, which may be implemented by the following specific implementation procedures:

and d1, selecting one image set from the plurality of image sets as a standard image set.

Wherein the standard image set is one of a plurality of image sets. Of course, an image set that does not belong to a plurality of image sets may be selected as the standard image set, which is not particularly limited in this embodiment.

And d2, performing size transformation on the original images of the residual image sets and the corresponding original depth maps in the plurality of image sets so that the sizes of the original images of the residual image sets and the corresponding original depth maps are the same as the sizes of the original images of the standard image sets and the corresponding original depth maps.

Wherein step d2 comprises: if the aspect ratio of the original depth map in the residual image set is equal to that of the original depth map in the standard image set, the original depth map in the residual image set is directly subjected to telescopic operation (restore), wherein the telescopic operation comprises shrinking and amplifying, and the original depth map in the residual image set after the telescopic operation is the same as the original depth map in the standard image set in size.

If the aspect ratio of the original depth map in the remaining image set is not equal to that of the original depth map in the standard image set, a crop (crop) operation is first required to be performed on the original depth map in the remaining image set, and an image area with the same aspect ratio as that of the original depth map in the standard image set is cropped. In addition, the cut image area needs to be the maximum image area of the original depth map in the standard image set on the premise of ensuring that the length-width ratio of the cut image area is equal to that of the original depth map in the standard image set. And then, performing a telescoping operation on the cropped image area, wherein the telescoping operation comprises shrinking and enlarging, and the cropped image area has the same size as the original depth map in the standard image set after the telescoping operation.

After the image processing in the above two embodiments, the depth map may be subjected to depth transformation using the above formula (1) or formula (2) to obtain the target depth map.

The method in this embodiment further includes:

and d3, performing size transformation on the target sample image and the corresponding target depth map of each target image set so that the sizes of the target sample image and the corresponding target depth map are the same as the input picture size of the neural network.

After the above-mentioned processing in step d1 and step d2, it is also necessary to randomly cut out the images with the sizes required for training from the target sample image and the corresponding target depth map of each target image set.

In another alternative embodiment, after performing internal reference processing on the multiple image sets to obtain multiple target image sets, performing size transformation on the original image and the corresponding target depth map of each target image set so that the sizes of the original image and the corresponding target depth map are the same as the input image size of the neural network.

It should be noted that, on the basis of the present embodiment, depth in the formula (1) and the formula (2) _n Should be the original depth map after the size conversion process.

Alternatively, an image area with the same size as the input picture of the neural network may be directly cut out from the original image of each target image set, and a depth map area corresponding to the image area may be cut out from the original depth map corresponding to the original image.

Optionally, performing size transformation on the original image and the corresponding target depth map of each target image set so that the sizes of the original image and the corresponding target depth map are the same as the sizes of the input images of the neural network, and the following specific implementation process may be adopted, which specifically includes:

and e1, selecting one target image set from a plurality of target image sets as a standard image set.

And e2, performing size transformation on the original images of the residual target image sets and the corresponding target depth maps in the plurality of target image sets so that the sizes of the original images of the residual target image sets and the corresponding target depth maps are the same as the sizes of the original images of the standard image sets and the corresponding original target depth maps.

The implementation process of step e2 is similar to that of step d2, and the description of step d2 may be referred to herein, and will not be repeated here.

And e3, performing size transformation on the target sample image and the corresponding target depth map of each target image set so that the sizes of the target sample image and the corresponding target depth map are the same as the input picture size of the neural network.

Of course, in the embodiment of the present application, the original image and the original depth map may be processed by different processing methods described above, that is, the original image and the original depth map may be processed before the depth transformation, or the original image and the target depth map may be processed after the depth transformation, to obtain images with different sizes adapted to the input image of the neural network, and then the images obtained by different processing methods are used for training of the neural network, so that the training data space may be enlarged.

According to an embodiment of the application, the application further provides a depth estimation method.

As shown in fig. 8, a flow chart of a depth estimation method according to an embodiment of the present application is shown. The depth estimation method of the embodiment of the application comprises the following method steps:

Step 801, obtaining an image to be estimated;

step 802, inputting the image to be estimated into a depth estimation model obtained through training of a training method of the depth estimation model, and obtaining a depth map corresponding to the image to be estimated.

According to the method, the image to be estimated is acquired, the image to be estimated is input into the depth estimation model obtained through the training method of the depth estimation model, and the depth image corresponding to the image to be estimated is obtained.

According to the embodiment of the application, the application further provides a training device of the depth estimation model.

Fig. 9 is a schematic structural diagram of a training device of a depth estimation model according to an embodiment of the present application. The training device 90 for a depth estimation model according to the embodiment of the present application includes: a first obtaining module 91, configured to obtain a plurality of image sets, where internal references of the plurality of image sets are different, where each image set includes at least one original depth map, and internal references of original depth maps included in the same image set are the same; the internal reference processing module 92 is configured to perform internal reference processing on the multiple image sets to obtain multiple target image sets, where internal references of the multiple target image sets are the same; the training module 93 is configured to train the neural network according to the multiple target image sets to obtain a depth estimation model.

Optionally, the internal processing module 92 includes: a first selecting unit 921 for selecting one image set among the plurality of image sets as a standard image set; an internal reference processing unit 922, configured to perform internal reference processing on original depth maps of remaining image sets in the plurality of image sets, to obtain a target image set corresponding to the remaining image set, where internal references of a target depth map included in the target image set are the same as internal references of images in the standard image set; and the standard image sets and the target image sets corresponding to the residual image sets form the target image sets.

Optionally, the reference processing unit 922 performs reference processing on the original depth map of the remaining image sets in the plurality of image sets to obtain a target image set corresponding to the remaining image sets, which specifically includes: and determining a target image set corresponding to the residual image set according to the internal parameters of the residual image set, the internal parameters of the standard image set and the size transformation factor.

Optionally, each target image set includes at least one target depth map; the internal processing unit 922 determines a target image set corresponding to the remaining image set using the following formula:

Depth _n '＝scale _n *f _n /f ₁ *Depth _n ；

in Depth _n ' target depth map representing nth target image set, scale _n A size conversion factor representing an nth image set; f (f) _n An internal reference representing the nth image set, f ₁ Internal reference, depth representing standard image set _n Representing the original depth map of the nth image set.

Optionally, the internal processing unit 922 determines the size conversion factor according to the following method: if the length-to-width ratio of the original depth map in the residual image set is equal to that of the original depth map in the standard image set, taking the length ratio of the original depth map in the residual image set to the original depth map in the standard image set as the size transformation factor; and if the length-to-width ratio of the original depth map in the residual image set is not equal to that of the original depth map in the standard image set, taking the length ratio of the original depth map in the residual image set after the residual image set processing to the original depth map in the standard image set as the size transformation factor, wherein the length-to-width ratio of the original depth map in the processed image set is equal to that of the original depth map in the standard image set.

Optionally, the training module 93 includes: an input unit 931 for inputting the target image set into a neural network to obtain a logarithmic value of depth information; a determining unit 932 for determining a loss function according to the logarithmic value of the depth information; an adjusting unit 933, configured to adjust a network parameter of the neural network according to the loss function.

Optionally, each image set further includes at least one original image, each original image corresponding to one of the original depth maps; the apparatus 90 further comprises: the size transformation module 94 is configured to perform a size transformation on the original image and the corresponding original depth map of each image set, so that the sizes of the original image and the corresponding original depth map are the same as the input image size of the neural network.

Optionally, the size transformation module 94 includes: a second selection unit 941 for selecting one image set among the plurality of image sets as a standard image set; a size conversion unit 942, configured to perform size conversion on an original image of a remaining image set and a corresponding original depth map in the plurality of image sets, so that the sizes of the original image of the remaining image set and the corresponding original depth map are the same as the sizes of the original image of the standard image set and the corresponding original depth map; the size transformation unit 942 is further configured to perform size transformation on the target sample image and the corresponding target depth map of each target image set, so that the sizes of the target sample image and the corresponding target depth map are the same as the input image size of the neural network.

Optionally, each target image set further includes at least one original image and a corresponding target depth map; the apparatus 90 further comprises: the size transformation module 94 is configured to perform a size transformation on the original image and the corresponding target depth map of each target image set, so that the sizes of the original image and the corresponding target depth map are the same as the input image size of the neural network.

Optionally, the size transformation module 94 includes: a second selection unit 941 for selecting one target image set among the plurality of target image sets as a standard image set; a size transformation unit 942, configured to perform size transformation on the original images of the remaining target image sets and the corresponding target depth maps in the plurality of target image sets, so that the sizes of the original images of the remaining target image sets and the corresponding target depth maps are the same as the sizes of the original images of the standard image sets and the corresponding target depth maps; the size transformation unit 942 is further configured to perform size transformation on the target sample image and the corresponding target depth map of each target image set, so that the sizes of the target sample image and the corresponding target depth map are the same as the input image size of the neural network.

According to an embodiment of the application, the application further provides a depth estimation device.

As shown in fig. 10, a schematic structural diagram of a depth estimation device according to an embodiment of the present application is shown. The depth estimation apparatus 100 of the embodiment of the present application includes: a second acquisition module 101 and an input module 102;

a second acquiring module 101, configured to acquire an image to be estimated;

and the input module 102 is used for inputting the image to be estimated into a depth estimation model obtained through training by a training method such as a depth estimation model, and obtaining a depth map corresponding to the image to be estimated.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 11, a block diagram of an electronic device of a training method and/or a depth estimation method of a depth estimation model according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

The training method for executing the depth estimation model and the electronic device for executing the depth estimation method may be the same electronic device or different electronic devices, which is not particularly limited in this embodiment.

As shown in fig. 11, the electronic device includes: one or more processors 1101, memory 1102, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). In fig. 11, a processor 1101 is taken as an example.

Memory 1102 is a non-transitory computer-readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a training method and/or a depth estimation method for a depth estimation model provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the training method and/or the depth estimation method of the depth estimation model provided by the present application.

The memory 1102 is used as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as training methods of a depth estimation model and/or program instructions/modules corresponding to the depth estimation methods in the embodiments of the present application (e.g., the first acquisition module 91, the reference processing module 92, the training module 93, and the size transformation module 94 shown in fig. 9, and the second acquisition module 101 and the input module 102 shown in fig. 10). The processor 1101 executes various functional applications of the server and data processing, i.e., implements the training method and/or depth estimation method of the depth estimation model in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 1102.

Memory 1102 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to a training method of the depth estimation model and/or use of an electronic device of the depth estimation method, etc. In addition, memory 1102 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 1102 optionally includes memory remotely located relative to processor 1101, which may be connected to the training method of the depth estimation model and/or the electronics of the depth estimation method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The training method of the depth estimation model and/or the electronic device of the depth estimation method may further include: an input device 1103 and an output device 1104. The processor 1101, memory 1102, input device 1103 and output device 1104 may be connected by a bus or other means, for example in fig. 11.

The input device 1103 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the training method of the depth estimation model and/or the depth estimation method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. The output device 1104 may include a display device, auxiliary lighting (e.g., LEDs), and haptic feedback (e.g., a vibration motor), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A training method of a depth estimation model, comprising:

acquiring a plurality of image sets, wherein the internal parameters of the plurality of image sets are different, each image set comprises at least one original depth map, and the internal parameters of the original depth maps included in the same image set are the same; wherein, the internal parameters of the image set refer to the internal parameters of the image acquisition equipment for acquiring the image set;

selecting one image set from the plurality of image sets as a standard image set;

determining a target image set corresponding to the residual image set according to the internal parameters of the residual image set, the internal parameters of the standard image set and the size conversion factors, wherein the internal parameters of a target depth map included in the target image set are the same as the internal parameters of the images in the standard image set;

the standard image set and the target image sets corresponding to the residual image sets form a plurality of target image sets, and the internal references of the target image sets are the same;

Training a neural network according to the plurality of target image sets to obtain a depth estimation model;

the determining the target image set corresponding to the residual image set according to the internal parameters of the residual image set, the internal parameters of the standard image set and the size transformation factor comprises the following steps:

determining a first ratio, wherein the first ratio is a ratio of internal references of the image set participating in the standard image set;

determining the product of the size transformation factor of the image set, the first ratio and the original depth map of the image set as a target depth map of a target image set;

wherein the size conversion factor is determined according to the following method:

if the length-to-width ratio of the original depth map in the residual image set is equal to that of the original depth map in the standard image set, taking the length ratio of the original depth map in the residual image set to the original depth map in the standard image set as the size transformation factor;

and if the length-to-width ratio of the original depth map in the residual image set is not equal to that of the original depth map in the standard image set, taking the length ratio of the original depth map in the residual image set after the residual image set processing to the original depth map in the standard image set as the size transformation factor, wherein the length-to-width ratio of the original depth map in the processed image set is equal to that of the original depth map in the standard image set.

2. The method of claim 1, wherein each target image set comprises at least one target depth map;

and determining a target image set corresponding to the residual image set according to the internal parameters of the residual image set, the internal parameters of the standard image set and the size conversion factors, wherein the target image set is realized by adopting the following formula:

；

in the method, in the process of the invention,target depth map representing the nth target image set,/->A size conversion factor representing an nth image set; />An internal reference representing the nth image set, < ->Internal reference representing standard image set, +.>Representing the original depth map of the nth image set.

3. The method according to claim 1 or 2, wherein the training the neural network according to the plurality of target image sets to obtain a depth estimation model comprises:

inputting the target image set into a neural network to obtain the logarithmic value of depth information;

determining a loss function according to the logarithmic value of the depth information;

and adjusting network parameters of the neural network according to the loss function.

4. The method of claim 1 or 2, wherein each image set further comprises at least one original image, each original image corresponding to one of the original depth maps;

Before the internal reference processing is performed on the plurality of image sets to obtain a plurality of target image sets, the method further comprises:

and performing size transformation on the original image and the corresponding original depth map of each image set so that the sizes of the original image and the corresponding original depth map are the same as the size of the input picture of the neural network.

5. The method of claim 4, wherein the performing a size transformation on the original image and the corresponding original depth map for each image set such that the original image and the corresponding original depth map have the same size as the input picture size of the neural network comprises:

performing size transformation on original images of the residual image sets and corresponding original depth maps in the plurality of image sets so that the sizes of the original images of the residual image sets and the corresponding original depth maps are the same as the sizes of the original images of the standard image sets and the corresponding original depth maps;

performing internal reference processing on the depth map of each image set to obtain the target depth map, and then, the method further comprises the following steps:

and performing size transformation on the target sample image and the corresponding target depth map of each target image set so that the sizes of the target sample image and the corresponding target depth map are the same as the input picture size of the neural network.

6. The method of claim 1 or 2, wherein each target image set further comprises at least one original image and a corresponding target depth map;

after the internal reference processing is performed on the plurality of image sets to obtain a plurality of target image sets, the method further comprises:

and performing size transformation on the original image and the corresponding target depth map of each target image set so that the sizes of the original image and the corresponding target depth map are the same as the input picture size of the neural network.

7. The method of claim 6, wherein the performing a size transformation on the original image and the corresponding target depth map for each target image set such that the original image and the corresponding target depth map are the same size as the input picture size of the neural network comprises:

selecting one target image set from the plurality of target image sets as a standard image set;

performing size transformation on original images of the remaining target image sets and corresponding target depth maps in the plurality of target image sets so that the sizes of the original images of the remaining target image sets and the corresponding target depth maps are the same as the sizes of the original images of the standard image sets and the corresponding target depth maps;

8. A depth estimation method, comprising:

acquiring an image to be estimated;

inputting the image to be estimated into a depth estimation model obtained through training by the training method according to any one of claims 1-7, and obtaining a depth map corresponding to the image to be estimated.

9. A training apparatus for a depth estimation model, comprising:

the first acquisition module is used for acquiring a plurality of image sets, wherein the internal references of the plurality of image sets are different, each image set comprises at least one original depth map, and the internal references of the original depth maps included in the same image set are the same; wherein, the internal parameters of the image set refer to the internal parameters of the image acquisition equipment for acquiring the image set;

the internal reference processing module is used for carrying out internal reference processing on the plurality of image sets to obtain a plurality of target image sets, and the internal references of the plurality of target image sets are the same;

the training module is used for training the neural network according to the plurality of target image sets to obtain a depth estimation model;

Wherein, the internal reference processing module comprises:

a selection unit configured to select one image set among the plurality of image sets as a standard image set;

the reference processing unit is used for determining a target image set corresponding to the residual image set according to the reference of the residual image set, the reference of the standard image set and the size transformation factor, wherein the reference of a target depth map included in the target image set is the same as the reference of the image in the standard image set; the standard image set and the target image sets corresponding to the residual image sets form a plurality of target image sets, and the internal references of the target image sets are the same;

the internal reference processing unit is specifically configured to determine a first ratio, where the first ratio is a ratio of internal references in the image set that participate in the standard image set; determining the product of the size transformation factor of the image set, the first ratio and the original depth map of the image set as a target depth map of a target image set;

wherein the internal processing unit determines the size conversion factor according to the following method:

10. The apparatus of claim 9, wherein each target image set comprises at least one target depth map; the internal reference processing unit determines a target image set corresponding to the residual image set by adopting the following formula:

；

11. The apparatus of claim 9 or 10, wherein the training module comprises:

the input unit is used for inputting the target image set into a neural network to obtain the logarithmic value of the depth information;

a determining unit, configured to determine a loss function according to a logarithmic value of the depth information;

And the adjusting unit is used for adjusting the network parameters of the neural network according to the loss function.

12. The apparatus of claim 9 or 10, wherein each image set further comprises at least one original image, each original image corresponding to one of the original depth maps;

the apparatus further comprises:

and the size transformation module is used for carrying out size transformation on the original image and the corresponding original depth map of each image set so that the sizes of the original image and the corresponding original depth map are the same as the size of the input picture of the neural network.

13. The apparatus of claim 12, wherein the size transformation module comprises:

a size conversion unit, configured to perform size conversion on original images of a remaining image set and corresponding original depth maps in the plurality of image sets, so that the sizes of the original images of the remaining image set and the corresponding original depth maps are the same as the sizes of the original images of the standard image set and the corresponding original depth maps;

the size transformation unit is further configured to perform size transformation on the target sample image and the corresponding target depth map of each target image set, so that the sizes of the target sample image and the corresponding target depth map are the same as the input image size of the neural network.

14. The apparatus of claim 9 or 10, wherein each target image set further comprises at least one original image and a corresponding target depth map;

the apparatus further comprises:

and the size transformation module is used for carrying out size transformation on the original image and the corresponding target depth map of each target image set so that the sizes of the original image and the corresponding target depth map are the same as the size of the input picture of the neural network.

15. The apparatus of claim 14, wherein the size transformation module comprises:

a selection unit configured to select one target image set from the plurality of target image sets as a standard image set;

the size transformation unit is used for performing size transformation on the original images of the residual target image sets and the corresponding target depth maps in the plurality of target image sets so that the sizes of the original images of the residual target image sets and the corresponding target depth maps are the same as the sizes of the original images of the standard image sets and the corresponding target depth maps;

16. A depth estimation apparatus comprising:

the second acquisition module is used for acquiring an image to be estimated;

the input module is used for inputting the image to be estimated into a depth estimation model obtained through training by the training method according to any one of claims 1-7, and obtaining a depth map corresponding to the image to be estimated.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.