CN110189372A

CN110189372A - Depth map model training method and device

Info

Publication number: CN110189372A
Application number: CN201910464165.7A
Authority: CN
Inventors: 秦硕; 李金鹏
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Apollo Intelligent Technology Beijing Co Ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2019-08-30

Abstract

The embodiment of the present invention proposes a kind of depth map model training method and device, and method includes: the depth map that training sample is obtained by initial model, and depth map includes the effective pixel points with predetermined depth value；Based on each effective pixel points, using the penalty values computation model penalty values of multiple loss functions, the penalty values of multiple loss functions include at least two in Scale invariant penalty values, gradient penalty values and sequence penalty values；Initial model is optimized according to model penalty values, depth graph model is obtained with training.The embodiment of the present invention calculates Scale invariant penalty values, gradient penalty values and sequence penalty values by loss function, enables to calculated penalty values to be more conducive to model optimization, improves the learning efficiency of model, accelerates the convergence rate of model training.

Description

Depth map model training method and device

Technical Field

The invention relates to the technical field of image recognition, in particular to a depth map model training method and device.

Background

In a visual perception system, three-dimensional scene information provides more possibilities for various computer vision applications such as image segmentation, target detection, object tracking and the like, and a Depth map (Depth map) is widely applied as a three-dimensional scene information expression mode. Each pixel value of the depth image can be used to characterize how far a point in the scene is from the camera, and it can directly reflect the geometry of the visible surface of the thing. Existing ways of obtaining depth maps include ways of object detection by radar or deep learning. However, the radar can detect an obstacle, but is prone to miss detection or false detection of an object having poor echo characteristics. The deep learning-based mode is mainly characterized in that the training is carried out through the labeled data, but the learning speed is low during model training due to the fact that the model is unreasonable in design, much time is consumed during learning of a large number of labeled samples, and the model convergence speed is reduced.

Disclosure of Invention

The embodiment of the invention provides a depth map model training method and device, and aims to solve one or more technical problems in the prior art.

In a first aspect, an embodiment of the present invention provides a depth map model training method, including:

obtaining a depth map of a training sample through an initial model, wherein the depth map comprises effective pixel points with predicted depth values;

calculating a model loss value by using loss values of a plurality of loss functions based on each effective pixel point, wherein the loss values of the plurality of loss functions comprise at least two of a scale invariant loss value, a gradient loss value and a sequence loss value;

and optimizing the initial model according to the model loss value so as to train to obtain a depth map model.

In one embodiment, the method further comprises:

and acquiring the labeled depth value and the physical depth of each pixel point of the training sample according to the training sample.

In one embodiment, the scale-invariant loss values are calculated by a first loss function comprising:

wherein,and expressing a scale invariant loss value, n expresses the number of the effective pixel points, i expresses the number of the effective pixel points, and Ri is the difference between the labeled depth value and the predicted depth value of the ith effective pixel point.

In one embodiment, the gradient loss value is calculated by a second loss function comprising:

wherein,expressing a gradient loss value, n expressing the number of the effective pixel points, i expressing the number of the effective pixel points, Ri expressing the difference between the labeled depth value and the predicted depth value of the ith effective pixel point,representing the gradient of the ith effective pixel point in the direction of the x axis,and representing the gradient of the ith effective pixel point in the y-axis direction.

In one embodiment, the sequential loss value is calculated by a third loss function comprising:

wherein,denotes the sequential loss value, P_ijAnd c represents a constant, and tau represents a constant.

In one embodiment, calculating a model loss value using loss values of a plurality of loss functions includes:

and adding the scale invariant loss value calculated by the first loss function, the gradient loss value calculated by the second loss function and the sequential loss value calculated by the third loss function to obtain a model loss value.

In a second aspect, an embodiment of the present invention provides a depth map model training apparatus, including:

the first acquisition module is used for acquiring a depth map of a training sample through an initial model, wherein the depth map comprises effective pixel points with predicted depth values;

the calculation module is used for calculating model loss values by utilizing loss values of a plurality of loss functions based on the effective pixel points, wherein the loss values of the plurality of loss functions comprise at least two of scale-invariant loss values, gradient loss values and sequence loss values;

and the optimization module is used for optimizing the initial model according to the model loss value so as to train and obtain a depth map model.

In one embodiment, the method further comprises:

and the second acquisition module is used for acquiring the labeled depth value and the physical depth of each pixel point of the training sample according to the training sample.

In one embodiment, the calculation module comprises:

a first loss submodule for calculating the scale invariant loss value by a first loss function;

the first loss function includes:

wherein,and the scale invariance loss value is represented, n represents the number of effective pixel points, i represents the number of the effective pixel points, and Ri is the difference between the labeled depth value and the predicted depth value of the ith effective pixel point.

In one embodiment, the calculation module comprises:

a second loss submodule for calculating the gradient loss value by a second loss function;

the second loss function includes:

wherein,expressing the gradient loss value, n expressing the number of effective pixel points, i expressing the number of the effective pixel points, Ri expressing the difference between the labeled depth value and the predicted depth value of the ith effective pixel point,represents the gradient of the ith effective pixel point in the direction of the x axis,and representing the gradient of the ith effective pixel point in the y-axis direction.

In one embodiment, the calculation module comprises:

a third loss submodule for calculating the sequential loss value by a third loss function;

the third loss function includes:

In one embodiment, the calculation module comprises:

and the model loss submodule is used for adding the scale invariant loss value calculated by the first loss function, the gradient loss value calculated by the second loss function and the sequential loss value calculated by the third loss function to obtain the model loss value.

In a third aspect, an embodiment of the present invention provides a depth map model training terminal, where functions of the depth map model training terminal may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-described functions.

In one possible design, the structure of the depth map model training terminal includes a processor and a memory, the memory is used for storing a program supporting the depth map model training terminal to execute the depth map model training method, and the processor is configured to execute the program stored in the memory. The depth map model training terminal can further comprise a communication interface for communicating with other equipment or a communication network.

In a fourth aspect, an embodiment of the present invention provides a depth map obtaining apparatus, where a depth map model obtained in the third aspect is used to obtain a depth map.

In a fifth aspect, an embodiment of the present invention provides a depth map acquisition system, including: the depth map model training terminal of the third aspect, and the depth map obtaining device of the fourth aspect.

In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium for storing computer software instructions for a depth map model training terminal, which includes a program for executing the depth map model training method.

One of the above technical solutions has the following advantages or beneficial effects: according to the embodiment of the invention, the scale invariant loss value, the gradient loss value and the sequential loss value are calculated through the loss function, so that the calculated loss value is more beneficial to model optimization, and the convergence speed of model training is improved.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

FIG. 1 shows a flow diagram of a depth map model training method according to an embodiment of the invention.

Fig. 2 shows a detailed flowchart of step S300 of the depth map model training method according to an embodiment of the present invention.

FIG. 3 shows a flow diagram of a depth map model training method according to another embodiment of the invention.

Fig. 4 shows a detailed flowchart of step S200 of the depth map model training method according to an embodiment of the present invention.

Fig. 5 is a block diagram showing a structure of a depth map model training apparatus according to an embodiment of the present invention.

Fig. 6 is a block diagram showing a structure of a depth map model training apparatus according to another embodiment of the present invention.

Fig. 7 is a block diagram showing a configuration of a computation module of the depth map model training apparatus according to the embodiment of the present invention.

Fig. 8 is a block diagram showing a configuration of a computation module of a depth map model training apparatus according to another embodiment of the present invention.

FIG. 9 is a schematic structural diagram of a depth map model training terminal according to an embodiment of the present invention.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

FIG. 1 shows a flow diagram of a depth map model training method according to an embodiment of the invention. As shown in fig. 1, the depth map model training method includes:

s100: and obtaining a depth map of the training sample through the initial model, wherein the depth map comprises effective pixel points with predicted depth values.

The initial model can adopt any model with different network structures in the prior art, and the input picture can be processed to obtain a corresponding depth map. For example, one H × W picture may be input to any initial model, and one H × W depth map may be output.

It should be noted that, because the input pictures have different qualities, it is not necessary that each pixel in the depth map obtained by the initial model includes depth information. That is, there may be invalid pixel points that do not include depth information and valid pixel points that include depth information in the depth map. Wherein, the depth value in the depth information of the effective pixel point can be used to directly or indirectly represent the depth (distance) between the things in the picture and the device for shooting the picture.

In one example, there is a conversion relationship between the depth value of the valid pixel and the physical depth (the distance of the pixel from the capture device in the actual scene), and if the physical depth is D, the corresponding depth value is log (D).

S200: based on each effective pixel point, a model loss value (loss) is calculated by using the loss values of the plurality of loss functions. The loss values of the plurality of loss functions include at least two of scale-invariant loss values, gradient loss values, and sequential loss values.

In one example, the scale invariance loss value is mainly based on scale invariance, that is, stable extreme points can be found under a scale space containing all scales regardless of the scale of the original image, so that the scale invariance is realized. The gradient loss value is mainly based on the gradient condition of each pixel point on the x and y axes. The sequence loss value mainly utilizes the depth sequence relation of any two pixel points.

S300: and optimizing the initial model according to the model loss value so as to train to obtain the depth map model.

In one example, as shown in fig. 2, the initial model is optimized according to the model loss value to train a depth map model, including:

s310: judging whether the model converges according to the loss value;

s320: if not, performing back propagation through a loss value by using a Stochastic Gradient Descent (SGD) method, and updating parameters of the optimization model;

s330: based on the optimized parameters, obtaining the depth map again by utilizing a forward propagation algorithm and a linear regression mode;

s340: calculating a loss value of the depth map obtained again and judging whether the model is converged;

s350: if so, taking the model as a trained depth map model;

s360: if not, steps S320-S340 are repeated until the model converges.

In one embodiment, as shown in fig. 3, the method further includes:

s400: and acquiring the labeled depth value and the physical depth of each pixel point of the training sample according to the training sample.

wherein,representing the scale invariant loss value, n representing the number of effective pixel points, i representing the number of effective pixel points, Ri being the labeled depth value and the pre-value of the ith effective pixel pointThe difference between the depth values is measured.

It should be noted that, the depth value of the effective pixel point may include a true value of the actual depth value of the training sample.

In one embodiment, the sequential loss values are calculated by a third loss function, the third loss function comprising:

wherein,denotes the sequential loss value, P_ijAnd c represents a constant, and tau represents a constant. Wherein the constant values of c and tau can be adjusted according to the requirement. In one example, τ -0.25 and c-1.4813756.

In one embodiment, P_ij＝|L_i-L_j|，L_iRepresenting the physical depth L of the ith effective pixel point in the actual scene_jAnd the physical depth of the ith effective pixel point in the actual scene is represented.

In one embodiment, as shown in FIG. 4, calculating a model loss value using loss values of a plurality of loss functions includes:

s210: and calculating a scale invariant loss value through a first loss function based on each effective pixel point.

S220: and calculating a gradient loss value through a second loss function based on each effective pixel point.

S230: and calculating a sequential loss value through a third loss function based on each effective pixel point.

S240: and adding the scale invariant loss value, the gradient loss value and the sequence loss value to obtain a model loss value. The formula for calculating the model loss value is as follows:

wherein,a value representing the scale-invariant loss is represented,the value of the loss of the gradient is represented,indicating the sequential loss value.

In one example, to balance the effect of gradient loss values and sequence loss values on model loss values, empirical value parameters α and β may be introduced to control the gradient loss values and sequence loss values.

In an embodiment, the selection of the training samples and the obtaining of the labeled depth values and the physical depths of the training samples may be performed by any method in the prior art, for example, acquiring a plurality of images by using a camera, and then obtaining three-dimensional information and depth information of a scene by using a motion structure recovery method, so as to obtain the training samples with the labeled depth values and the physical depths. Or a manual marking mode is adopted, a plurality of discrete positions are selected from one picture, and the positions are coded according to the distance of the camera, so that the training sample with the marked depth value and the physical depth is obtained.

In one embodiment, the depth map model trained based on the above method may be applied in an automatic driving scenario. Therefore, the obstacle detection task based on the depth map can be completed by the automatic driving vehicle only through one camera, and monocular depth prediction is achieved. The camera can be adopted without being influenced by the characteristics of the sensor. The millimeter wave radar/laser radar has the application range, some detection distances are short, some detection distances only correspond to specific materials, and the vision scheme adopting the camera is not influenced by the material of the barrier. And the depth map model trained based on the method is easy to be fused into other systems (such as a multitask neural network) so as to facilitate system integration.

Fig. 5 is a block diagram showing a structure of a depth map model training apparatus according to an embodiment of the present invention. As shown in fig. 5, the depth map model training apparatus includes:

a first obtaining module 10, configured to obtain a depth map of the training sample through the initial model, where the depth map includes valid pixel points with predicted depth values.

The calculating module 20 is configured to calculate a model loss value by using loss values of a plurality of loss functions based on each effective pixel point, where the loss values of the plurality of loss functions include at least two of a scale invariant loss value, a gradient loss value, and a sequential loss value.

And the optimization module 30 is configured to optimize the initial model according to the model loss value to train to obtain the depth map model.

In one embodiment, as shown in fig. 6, the depth map model training apparatus further includes:

and the second obtaining module 40 is configured to obtain, according to the training sample, a labeled depth value and a physical depth value of each pixel point of the training sample.

In one embodiment, as shown in fig. 7, the calculation module 20 includes:

a first loss submodule 21 for calculating a scale invariant loss value by means of a first loss function.

The first loss function includes:

In one embodiment, as shown in fig. 7, the calculation module 20 includes:

a second loss submodule 22 for calculating a gradient loss value by means of a second loss function.

The second loss function includes:

In one embodiment, as shown in fig. 7, the calculation module 20 includes:

a third loss submodule 23 for calculating a sequential loss value by means of a third loss function.

The third loss function includes:

In one embodiment, as shown in fig. 8, the calculation module 20 includes:

and the model loss submodule 24 is configured to add the scale invariant loss value calculated by the first loss function, the gradient loss value calculated by the second loss function, and the sequential loss value calculated by the third loss function to obtain a model loss value.

The functions of each module in each apparatus in the embodiments of the present invention may refer to the corresponding description in the above method, and are not described herein again.

FIG. 9 is a block diagram illustrating a structure of a depth map model training terminal according to an embodiment of the present invention. As shown in fig. 9, the terminal includes: a memory 910 and a processor 920, the memory 910 having stored therein computer programs operable on the processor 920. The processor 920, when executing the computer program, implements the depth map model training method in the above embodiments. The number of the memory 910 and the processor 920 may be one or more.

The terminal further includes:

and a communication interface 930, configured to communicate with an external device, and perform data depth map model training transmission.

Memory 910 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 910, the processor 920 and the communication interface 930 are implemented independently, the memory 910, the processor 920 and the communication interface 930 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus.

Optionally, in an implementation, if the memory 910, the processor 920 and the communication interface 930 are integrated on a chip, the memory 910, the processor 920 and the communication interface 930 may complete communication with each other through an internal interface.

The embodiment of the invention provides a depth map obtaining device, which can obtain a depth map by using a depth map model obtained by a depth map model training terminal of the embodiment.

The embodiment of the invention provides a depth map acquisition system which comprises the depth map model training terminal and the depth map acquisition equipment of the embodiment.

An embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and the computer program is executed by a processor to implement the method in any one of the above embodiments.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A depth map model training method is characterized by comprising the following steps:

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein the scale-invariant loss values are computed by a first loss function comprising:

4. The method of claim 1, wherein the gradient loss value is calculated by a second loss function comprising:

5. The method of claim 1, wherein the sequential loss values are calculated by a third loss function comprising:

6. The method of claim 1, wherein calculating a model loss value using loss values of a plurality of loss functions comprises:

7. A depth map model training device, comprising:

8. The apparatus of claim 7, further comprising:

9. The apparatus of claim 7, wherein the computing module comprises:

the first loss function includes:

wherein,representing the scale invariant loss value, n representing the number of effective pixels, i representing the number of effective pixels, R_iThe difference between the labeled depth value and the predicted depth value of the ith effective pixel point is obtained.

10. The apparatus of claim 7, wherein the computing module comprises:

the second loss function includes:

11. The apparatus of claim 7, wherein the computing module comprises:

the third loss function includes:

12. The apparatus of claim 7, wherein the computing module comprises:

13. A depth map model training terminal, comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.

14. A depth map acquisition apparatus characterized by acquiring a depth map using the depth map model obtained in claim 13.

15. A depth map acquisition system, comprising:

the depth map model training terminal of claim 13;

the depth map acquiring apparatus according to claim 14.

16. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.