CN114663483A

CN114663483A - Training method, device and equipment of monocular depth estimation model and storage medium

Info

Publication number: CN114663483A
Application number: CN202210224721.5A
Authority: CN
Inventors: 郑喜民; 胡浩楠; 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-24
Also published as: WO2023168815A1

Abstract

The application relates to the technical field of artificial intelligence, and discloses a training method, a device, equipment and a storage medium for a monocular depth estimation model, wherein the method comprises the following steps: acquiring a to-be-predicted image; inputting the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation to obtain a target depth image corresponding to the image to be predicted; the training method of the target monocular depth estimation model comprises the following steps: performing fine tuning training on a preset monocular depth estimation interpretable model by adopting a preset training sample set and a target loss function; and taking the monocular depth estimation interpretable model with the fine tuning training finished as the target monocular depth estimation model, wherein the target loss function is a loss function obtained based on depth error loss. And carrying out fine tuning training on the monocular depth estimation interpretable model through a loss function obtained based on the depth error loss, and improving the accuracy of the network based on an interpretable method.

Description

Training method, device and equipment of monocular depth estimation model and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for training a monocular depth estimation model.

Background

Monocular depth estimation is widely applied to tasks such as automatic driving, three-dimensional reconstruction, augmented reality, scene understanding and the like due to the advantages of low cost, rich information and the like. With the rapid development of artificial intelligence technology, monocular depth estimation based on deep learning shows outstanding performance, but is also limited by the 'black box' property of a deep neural network, and the industry proposes an interpretable monocular depth estimation network to quantify the interpretability of the monocular depth estimation network through the depth selectivity of a hidden unit, thereby providing more development possibilities for monocular depth estimation based on deep learning. The interpretable method does not change the architecture of the original network, the depth range is distributed to each network unit in the network during training, and the average response of each network unit to a series of depths is calculated to illustrate the depth selectivity of each network unit, so that the monocular depth estimation network can be explained, but the improvement on the accuracy of the network is limited.

Disclosure of Invention

The present application mainly aims to provide a training method, an apparatus, a device and a storage medium for a monocular depth estimation model, and aims to solve the technical problem that the interpretability method of the monocular depth estimation network in the prior art is limited in improving the accuracy of the network.

In order to achieve the above object, the present application provides a training method of a monocular depth estimation model, the method including:

acquiring a to-be-predicted image;

inputting the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation to obtain a target depth image corresponding to the image to be predicted;

the training method of the target monocular depth estimation model comprises the following steps:

performing fine tuning training on a preset monocular depth estimation interpretable model by adopting a preset training sample set and a target loss function, wherein the target loss function is a loss function obtained based on depth error loss;

and taking the monocular depth estimation interpretable model with the fine tuning training finished as the target monocular depth estimation model.

Further, before the step of inputting the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation to obtain a target depth image corresponding to the image to be predicted, the method further includes:

obtaining the training sample set and the monocular depth estimation interpretable model;

dividing each training sample in the training sample set in batches by adopting a preset batch sample quantity to obtain a plurality of single batch sample sets;

taking any one of the single batch sample sets as a target sample set;

respectively inputting the image sample of each training sample in the target sample set into the monocular depth estimation interpretable model for monocular depth estimation to obtain depth image prediction data;

calculating a loss value according to each depth image prediction data, a depth image calibration value corresponding to each training sample in the target sample set and the target loss function to obtain a target loss value;

updating network parameters of the monocular depth estimation interpretable model according to the target loss value;

and repeatedly executing the step of taking any one single batch of sample sets as a target sample set until a preset model fine tuning training end condition is reached, and taking the monocular depth estimation interpretable model reaching the model fine tuning training end condition as the target monocular depth estimation model.

Further, the step of calculating a loss value according to each depth image prediction data, the depth image calibration value corresponding to each training sample in the target sample set, and the target loss function to obtain a target loss value includes:

performing depth error calculation of pixel points at the same position according to a first depth image and a second depth image to obtain an initial depth error set, wherein the first depth image is the depth image calibration value corresponding to any one training sample in the target sample set, and the second depth image is the depth image prediction data corresponding to the first depth image;

taking each depth error in the initial depth error set, which is greater than a preset depth error threshold value, as a target depth error set;

taking each pixel point in the first depth image corresponding to the target depth error set as an error pixel point set;

generating a depth range with the most pixel points according to the error pixel point set to obtain a target depth range;

and calculating loss values according to the target depth ranges, the depth image prediction data, the depth image calibration values corresponding to the training samples in the target sample set and the target loss function to obtain the target loss values.

Further, the step of updating network parameters of the monocular depth estimation interpretable model according to the target loss value includes:

finding out each network unit corresponding to each target depth range from a depth range and a network unit mapping table corresponding to the monocular depth estimation interpretable model to obtain a single image network unit set;

collecting each single image network unit set to obtain a network unit set to be deduplicated;

carrying out duplicate removal processing on the network units to be subjected to duplicate removal to obtain a target network unit set;

updating network parameters in the monocular depth estimation interpretable model corresponding to the target set of network elements according to the target loss value.

Further, the step of generating the depth range with the largest number of pixels according to the error pixel point set to obtain a target depth range includes:

according to a preset depth range list, carrying out set division on the error pixel point set to obtain a single-depth-range pixel point set;

finding out the single-depth-range pixel point set with the most pixels from each single-depth-range pixel point set to obtain a target pixel point set;

and taking the depth range corresponding to the target pixel point set as the target depth range.

Further, the step of calculating a loss value according to the depth image calibration value and the target loss function corresponding to each target depth range, each depth image prediction data, each training sample in the target sample set, and the like, to obtain the target loss value includes:

generating a binary mask in the target depth range corresponding to the first depth image according to the first depth image and the second depth image to obtain a target binary mask;

and inputting the target binary mask, the depth image prediction data and the depth image calibration value corresponding to each training sample in the target sample set into the target loss function for loss value calculation to obtain the target loss value.

Further, the calculation formula L of the objective loss function_errorComprises the following steps:

where λ is a hyper-parameter, P is the number of training samples in the target sample set, N_kIs the number of pixel points in the image sample of the kth of the training sample in the target sample set,

is the target binarization mask corresponding to the kth training sample in the target sample set, d_kIs the depth image prediction data corresponding to the kth of the training samples in the target sample set,

is the depth image calibration value corresponding to the kth training sample in the target sample set,

is to be

And

the co-located vector elements of (a) are multiplied.

The application also provides a training device of the monocular depth estimation model, the device comprises:

the image acquisition module is used for acquiring an image to be predicted;

the target depth image determining module is used for inputting the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation to obtain a target depth image corresponding to the image to be predicted;

the model fine tuning training module is used for carrying out fine tuning training on a preset monocular depth estimation interpretable model by adopting a preset training sample set and a target loss function, wherein the target loss function is a loss function obtained based on depth error loss; and taking the monocular depth estimation interpretable model with the fine tuning training finished as the target monocular depth estimation model.

The present application further proposes a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.

The present application also proposes a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.

The method comprises the steps of adopting a preset training sample set and a target loss function to conduct fine tuning training on a preset monocular depth estimation interpretable model, wherein the target loss function is a loss function obtained based on depth error loss; and taking the monocular depth estimation interpretable model with the fine tuning training finished as the target monocular depth estimation model. And carrying out fine tuning training on the monocular depth estimation interpretable model through a loss function obtained based on the depth error loss, and improving the accuracy of the network based on an interpretable method.

Drawings

Fig. 1 is a schematic flowchart illustrating a training method of a monocular depth estimation model according to an embodiment of the present application;

FIG. 2 is a block diagram illustrating the structure of a training apparatus for a monocular depth estimation model according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a training method for a monocular depth estimation model, where the method includes:

s1: acquiring a to-be-predicted image;

s2: inputting the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation to obtain a target depth image corresponding to the image to be predicted;

In the embodiment, the monocular depth estimation interpretable model is subjected to fine tuning training through the loss function obtained based on the depth error loss, so that the accuracy of the network is improved based on the interpretability method.

For S1, the image to be predicted input by the user may be acquired, the image to be predicted may be acquired from a third-party application system, or the image to be predicted may be acquired from a database.

The image to be predicted is an image needing monocular depth estimation. The image to be predicted is an image photographed by a monocular image pickup device.

Monocular depth estimation is to estimate the distance between each pixel point in an image and the lens of a monocular imaging device. Thus, depth refers to the distance between a pixel point in an image and the lens of a monocular imaging device.

And S2, inputting the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation, and taking a depth image obtained by monocular depth estimation as a target depth image corresponding to the image to be predicted.

Performing interpretability training of monocular depth estimation on an initial model, wherein the initial model is a model obtained based on a neural network, and taking the initial model after training as a monocular depth estimation interpretable model, so that the monocular depth estimation interpretable model is the monocular depth estimation model with interpretability.

Each network unit in the neural network is allocated with an initial depth range, and then the neural network added with the initial depth range is used as an initial model; updating network parameters of the initial model in the course of training monocular depth estimation of the initial model; and carrying out depth range updating according to the initial model after training is finished, and taking the initial model after the depth range updating as a monocular depth estimation interpretable model.

The depth ranges include: a starting depth and an ending depth. The start depth and the end depth are both distances between a pixel point in the image and a lens of the monocular imaging device.

The updating the depth range according to the initial model after training comprises: inputting an image set to be processed into the initial model after training is finished; each network unit in the initial model after training has a response value for each pixel point in each image in the image set to be processed; taking any network unit in the initial model after training as a network unit to be analyzed; carrying out set division on each response value corresponding to each network unit to be analyzed according to a depth range to obtain a plurality of single network unit response value sets; carrying out average value calculation on each single network unit response value set to obtain a response average value; and finding out each response average value corresponding to the depth range to be analyzed from each response average value corresponding to each network unit, taking the maximum value of the found response average values as a target response average value, and taking the network unit corresponding to the target response average value as the network unit corresponding to the depth range to be analyzed, wherein the depth range to be analyzed is any one depth range.

That is, the network element identifier of the network element corresponding to the target response average value and the depth range to be analyzed are used as the associated data to update the depth range and network element mapping table.

The depth range to network element mapping table comprises: depth range and network element identification. The network element identification may be a network element name, a network element ID, or the like, that uniquely identifies a network element.

Each training sample in the set of training samples comprises: image sample and depth image calibration. The image sample is an image taken with a monocular imaging device. The depth image calibration value is an accurate calibration result of the depth image of the image sample.

The pixel value of each pixel point in the depth image is the depth, that is, the pixel value of each pixel point in the depth image is the distance between the pixel point in the image sample and the lens of the monocular image capturing device adopted for capturing the image sample.

The target loss function is a function obtained by adding depth error loss on the basis of a loss function of the training initial model. By using the depth error loss for model training, the accuracy of the network is improved based on the interpretability method.

It can be understood that, when the preset monocular depth estimation interpretable model is subjected to fine tuning training, the depth range of each network unit in the monocular depth estimation interpretable model is not updated any more, and only the network parameters of the monocular depth estimation interpretable model need to be updated.

In an embodiment, before the step of inputting the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation to obtain a target depth image corresponding to the image to be predicted, the method further includes:

s21: obtaining the training sample set and the monocular depth estimation interpretable model;

s22: dividing each training sample in the training sample set in batches by adopting a preset batch sample quantity to obtain a plurality of single batch sample sets;

s23: taking any one of the single batch sample sets as a target sample set;

s24: respectively inputting the image sample of each training sample in the target sample set into the monocular depth estimation interpretable model for monocular depth estimation to obtain depth image prediction data;

s25: calculating a loss value according to the depth image prediction data, the depth image calibration value corresponding to each training sample in the target sample set and the target loss function to obtain a target loss value;

s26: updating network parameters of the monocular depth estimation interpretable model according to the target loss value;

s27: and repeatedly executing the step of taking any one single batch of sample sets as a target sample set until a preset model fine tuning training end condition is reached, and taking the monocular depth estimation interpretable model reaching the model fine tuning training end condition as the target monocular depth estimation model.

In this embodiment, loss value calculation is performed according to each depth image prediction data, the depth image calibration value corresponding to each training sample in the target sample set, and the target loss function, so that the accuracy of a network is improved based on an interpretable method.

For S21, the training sample set input by the user may be obtained, the training sample set may be obtained from a third-party application system, or the training sample set may be obtained from a database.

The model for interpreting the monocular depth estimation input by the user can be obtained, the model for interpreting the monocular depth estimation can be obtained from a third-party application system, and the model for interpreting the monocular depth estimation can be obtained from a database.

For S22, the batch sample number is any one of 1, 32, 64, 128. It is understood that the number of batch samples may be other values, and is not limited herein.

And dividing each training sample in the training sample set in batches by adopting a preset batch sample quantity, and taking each set obtained by division as a single batch sample set.

When the single batch sample set is not the last set of batch division, the number of training samples in the single batch sample set is the same as the number of batch samples; when the single batch sample set is not the last set of the batch division, the number of training samples in the single batch sample set is less than or equal to the number of batch samples.

For step S23, any one of the single batch sample sets is obtained as a target sample set from each of the single batch sample sets obtained by batch division.

For S24, the image sample of each training sample in the target sample set is respectively input into the model interpretable for monocular depth estimation, and each depth image obtained by monocular depth estimation is used as one depth image prediction data. That is, the number of depth image prediction data is the same as the number of training samples in the target sample set.

For S25, first, according to each depth image prediction data and a depth image calibration value corresponding to each training sample in the target sample set, a parameter value of each parameter in the target loss function is determined, then, each parameter value is substituted into the target loss function to perform loss value calculation, and the calculated loss value is used as a target loss value.

For S26, updating the network parameters of the monocular depth estimation interpretable model according to the target loss value, wherein all the network parameters of the monocular depth estimation interpretable model are updated.

Optionally, according to the target loss value, network parameters of the monocular depth estimation interpretable model are updated, and the network parameters corresponding to the network unit of which the depth error of the monocular depth estimation interpretable model exceeds a threshold value are updated.

For S27, the step of taking any one of the single batch sample sets as the target sample set is repeatedly executed, that is, the steps S23 to S27 are repeatedly executed until a preset model fine tuning training end condition is reached.

Optionally, the condition for ending the model fine tuning training is that the target loss value reaches a preset loss value threshold.

In an embodiment, the step of performing loss value calculation according to each depth image prediction data, the depth image calibration value corresponding to each training sample in the target sample set, and the target loss function to obtain a target loss value includes:

s251: performing depth error calculation of pixel points at the same position according to a first depth image and a second depth image to obtain an initial depth error set, wherein the first depth image is the depth image calibration value corresponding to any one training sample in the target sample set, and the second depth image is the depth image prediction data corresponding to the first depth image;

s252: taking each depth error in the initial depth error set, which is greater than a preset depth error threshold value, as a target depth error set;

s253: taking each pixel point in the first depth image corresponding to the target depth error set as an error pixel point set;

s254: generating a depth range with the most pixels according to the error pixel point set to obtain a target depth range;

s255: and calculating loss values according to the target depth ranges, the depth image prediction data, the depth image calibration values corresponding to the training samples in the target sample set and the target loss function to obtain the target loss values.

In this embodiment, first, a depth range with the most error points is used as a target depth range, and then loss value calculation is performed according to each target depth range, each depth image prediction data, the depth image calibration value corresponding to each training sample in the target sample set, and the target loss function, so that the accuracy of the network is improved based on an interpretable method; by taking each pixel point in the first depth image corresponding to the target depth error set as an error pixel point set, error tracing based on accurate depth is realized, the accuracy of error tracing is improved, and the accuracy of the determined target depth range is improved.

For S251, difference calculation is performed on pixel values of pixel points at the same position according to the first depth image and the second depth image, then absolute value calculation is performed on each difference, each absolute value obtained by calculation is used as a depth error, and each depth error is used as an initial depth error set. That is, the number of depth errors in the initial depth error set is the same as the number of pixel points in the first depth image.

For step S252, each depth error in the initial depth error set that is greater than a preset depth error threshold is used as a target depth error set, so as to provide a basis for finding out an error pixel point.

Optionally, the depth error threshold is set to 1.25.

For S253, each pixel point in the first depth image corresponding to the target depth error set is used as an error pixel point set, so as to find an error pixel point for performing monocular depth estimation on the image sample corresponding to the first depth image by using the monocular depth estimation interpretable model.

For S254, according to the depth of each error pixel in the error pixel set, each error pixel in the error pixel set is divided into each depth set corresponding to each depth range, and the depth range corresponding to the depth set with the largest error pixel is used as the target depth range.

For step S255, first, according to each target depth range, each depth image prediction data, and the depth image calibration value corresponding to each training sample in the target sample set, a parameter value of each parameter in the target loss function is determined, then each parameter value is substituted into the target loss function to perform loss value calculation, and the calculated loss value is used as a target loss value.

In an embodiment, the step of updating the network parameters of the monocular depth estimation interpretable model according to the target loss value includes:

s261: finding out each network unit corresponding to each target depth range from a depth range and network unit mapping table corresponding to the monocular depth estimation interpretable model to obtain a single image network unit set;

s262: collecting each single image network unit set to obtain a network unit set to be deduplicated;

s263: carrying out duplicate removal processing on the network units to be subjected to duplicate removal to obtain a target network unit set;

s264: updating network parameters in the monocular depth estimation interpretable model corresponding to the target set of network elements according to the target loss value.

In the embodiment, the network parameters corresponding to each network unit corresponding to each target depth range in the model capable of interpreting monocular depth estimation are updated, so that the fine adjustment aiming at the main source of the error is realized, and the accuracy of monocular depth estimation of the model capable of interpreting monocular depth estimation is improved accurately.

For S261, performing depth range lookup on each target depth range from the depth range and network element mapping table, and using each network element corresponding to each network element identifier corresponding to the depth range in the depth range and network element mapping table as a single image network element set.

And S262, performing collection processing on each single-image network unit set, and taking a set obtained by the collection processing as a network unit set to be deduplicated.

For S263, performing deduplication processing on the network element set to be deduplicated, and taking the network element set to be deduplicated after the deduplication processing as a target network element set.

For S264, the network parameters corresponding to the target network element set in the monocular depth estimation interpretable model are updated according to the target loss value, thereby achieving fine tuning individually for the main source of error.

In an embodiment, the step of generating the depth range with the largest number of pixels according to the error pixel point set to obtain the target depth range includes:

s2541: according to a preset depth range list, carrying out set division on the error pixel point set to obtain a single-depth-range pixel point set;

s2542: finding out the single-depth-range pixel point set with the most pixels from each single-depth-range pixel point set to obtain a target pixel point set;

s2543: and taking the depth range corresponding to the target pixel point set as the target depth range.

In the embodiment, the depth range with the most error pixels is used as the target depth range, and a basis is provided for fine tuning training of the monocular depth estimation interpretable model based on the loss function obtained by the depth error loss.

For S2541, according to a preset depth range list, set division is performed on the error pixel point sets, and each set obtained by the division is used as a single-depth-range pixel point set, that is, the depth of an error pixel point in each single-depth-range pixel point set belongs to the same depth range.

For S2542, the single-depth-range pixel point set with the most pixels is found from each single-depth-range pixel point set, and a target pixel point set is obtained, that is, the target pixel point set is the single-depth-range pixel point set with the most pixels in each single-depth-range pixel point set.

For S2543, the depth range corresponding to the target pixel point set is the depth range with the largest error pixel points, so that the depth range corresponding to the target pixel point set is used as the target depth range, which improves accuracy of calculating a loss value based on the target depth range, also improves accuracy of determining a network unit with an error based on the target depth range, and realizes accurate error tracing.

In an embodiment, the step of calculating a loss value according to each of the target depth range, each of the depth image prediction data, the depth image calibration value corresponding to each of the training samples in the target sample set, and the target loss function to obtain the target loss value includes:

s2551: generating a binarization mask in the target depth range corresponding to the first depth image according to the first depth image and the second depth image to obtain a target binarization mask;

s2552: and inputting the target binary mask, the depth image prediction data and the depth image calibration value corresponding to each training sample in the target sample set into the target loss function for loss value calculation to obtain the target loss value.

In this embodiment, according to the first depth image and the second depth image, a binarization mask is generated in the target depth range corresponding to the first depth image, and then the binarization mask is used for loss value calculation, so that fine tuning training of a monocular depth estimation interpretable model based on a loss function obtained by a depth error loss is realized, and accuracy of a network is improved based on an interpretable method.

For S2551, any pixel position within the target depth range corresponding to the first depth image is taken as a target pixel position; when the depth range corresponding to the depth corresponding to the target pixel position in the first depth image is the same as the depth range corresponding to the depth corresponding to the target pixel position in the second depth image, setting the mask value corresponding to the target pixel position to 1; and when the depth range corresponding to the depth corresponding to the target pixel position in the first depth image is different from the depth range corresponding to the depth corresponding to the target pixel position in the second depth image, setting the mask value corresponding to the target pixel position to be 0.

The target binary mask includes: pixel location and mask value.

For step S2552, the depth image calibration values corresponding to the target binarization masks, the depth image prediction data, and the training samples in the target sample set are input to the target loss function to perform loss value calculation, and the calculated loss value is used as the target loss value.

In one embodiment, the formula for calculating the objective loss function is L_errorComprises the following steps:

is to be

And

the co-located vector elements of (a) are multiplied.

According to the embodiment, the fine tuning training of the monocular depth estimation interpretable model is carried out on the basis of the loss function obtained by the depth error loss, and the accuracy of the network is improved on the basis of the interpretable method.

λ is a hyper-parameter, which is a preset constant.

Referring to fig. 2, the present application further proposes a training apparatus for a monocular depth estimation model, the apparatus comprising:

an image obtaining module 100, configured to obtain an image to be predicted;

a target depth image determining module 200, configured to input the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation, so as to obtain a target depth image corresponding to the image to be predicted;

a model fine tuning training module 300, configured to perform fine tuning training on a preset monocular depth estimation interpretable model by using a preset training sample set and a target loss function, where the target loss function is a loss function obtained based on a depth error loss; and taking the monocular depth estimation interpretable model with the fine tuning training finished as the target monocular depth estimation model.

In one embodiment, the model fine tuning training module 300 further comprises: the system comprises a data acquisition sub-module, a batch division sub-module, a model fine tuning training sub-module and a cycle control sub-module;

the data acquisition sub-module is used for acquiring the training sample set and the monocular depth estimation interpretable model;

the batch dividing submodule is used for carrying out batch division on each training sample in the training sample set by adopting a preset batch sample number to obtain a plurality of single batch sample sets;

the model fine tuning training submodule is used for taking any one single batch of sample set as a target sample set, inputting the image sample of each training sample in the target sample set into the monocular depth estimation interpretable model for monocular depth estimation to obtain depth image prediction data, calculating loss values according to each depth image prediction data, the depth image calibration value corresponding to each training sample in the target sample set and the target loss function to obtain a target loss value, and updating network parameters of the monocular depth estimation interpretable model according to the target loss value;

and the circulation control sub-module is used for repeatedly executing the step of taking any one single batch of sample sets as a target sample set until a preset model fine tuning training end condition is reached, and taking the model which can be interpreted by the monocular depth estimation and reaches the model fine tuning training end condition as the target monocular depth estimation model.

In one embodiment, the model fine tuning training sub-module further includes: a target depth range determining unit and a target loss value determining unit;

the target depth range determining unit is configured to perform depth error calculation on pixel points at the same position according to a first depth image and a second depth image to obtain an initial depth error set, where the first depth image is the depth image calibration value corresponding to any one of the training samples in the target sample set, and the second depth image is the depth image prediction data corresponding to the first depth image; taking each depth error in the initial depth error set, which is greater than a preset depth error threshold value, as a target depth error set; taking each pixel point in the first depth image corresponding to the target depth error set as an error pixel point set; generating a depth range with the most pixel points according to the error pixel point set to obtain a target depth range;

the target loss value determining unit is configured to perform loss value calculation according to each target depth range, each depth image prediction data, the depth image calibration value corresponding to each training sample in the target sample set, and the target loss function, so as to obtain the target loss value.

In one embodiment, the model fine tuning training sub-module further includes: a network parameter updating unit;

the network parameter updating unit is configured to find each network unit corresponding to each target depth range from a depth range and a network unit mapping table corresponding to the monocular depth estimation interpretable model to obtain a single image network unit set, perform aggregation processing on each single image network unit set to obtain a network unit set to be deduplicated, perform deduplication processing on the network unit set to be deduplicated to obtain a target network unit set, and update a network parameter corresponding to the target network unit set in the monocular depth estimation interpretable model according to the target loss value.

In one embodiment, the target depth range determining unit further includes: according to a preset depth range list, carrying out set division on the error pixel point set to obtain a single-depth-range pixel point set; finding out the single-depth-range pixel point set with the most pixels from each single-depth-range pixel point set to obtain a target pixel point set; and taking the depth range corresponding to the target pixel point set as the target depth range.

In one embodiment, the target loss value determining unit further includes: generating a binarization mask in the target depth range corresponding to the first depth image according to the first depth image and the second depth image to obtain a target binarization mask; and inputting the target binary mask, the depth image prediction data and the depth image calibration value corresponding to each training sample in the target sample set into the target loss function for loss value calculation to obtain the target loss value.

In one embodiment, the above formula L for calculating the objective loss function_errorComprises the following steps:

is to be

And with

The same position vector elements of (a) are added.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data such as a training method of the monocular depth estimation model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of training a monocular depth estimation model. The training method of the monocular depth estimation model comprises the following steps: acquiring a picture to be predicted; inputting the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation to obtain a target depth image corresponding to the image to be predicted; the training method of the target monocular depth estimation model comprises the following steps: performing fine tuning training on a preset monocular depth estimation interpretable model by adopting a preset training sample set and a target loss function, wherein the target loss function is a loss function obtained based on depth error loss; and taking the monocular depth estimation interpretable model with the fine tuning training finished as the target monocular depth estimation model.

In an embodiment, before the step of inputting the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation to obtain a target depth image corresponding to the image to be predicted, the method further includes: obtaining the training sample set and the monocular depth estimation interpretable model; dividing each training sample in the training sample set in batches by adopting a preset batch sample quantity to obtain a plurality of single batch sample sets; taking any one of the single batch sample sets as a target sample set; respectively inputting the image sample of each training sample in the target sample set into the monocular depth estimation interpretable model for monocular depth estimation to obtain depth image prediction data; calculating a loss value according to the depth image prediction data, the depth image calibration value corresponding to each training sample in the target sample set and the target loss function to obtain a target loss value; updating network parameters of the monocular depth estimation interpretable model according to the target loss value; and repeatedly executing the step of taking any one single batch of sample sets as a target sample set until a preset model fine tuning training end condition is reached, and taking the monocular depth estimation interpretable model reaching the model fine tuning training end condition as the target monocular depth estimation model.

In an embodiment, the step of performing loss value calculation according to each depth image prediction data, the depth image calibration value corresponding to each training sample in the target sample set, and the target loss function to obtain a target loss value includes: performing depth error calculation of pixel points at the same position according to a first depth image and a second depth image to obtain an initial depth error set, wherein the first depth image is the depth image calibration value corresponding to any one training sample in the target sample set, and the second depth image is the depth image prediction data corresponding to the first depth image; taking each depth error in the initial depth error set, which is greater than a preset depth error threshold value, as a target depth error set; taking each pixel point in the first depth image corresponding to the target depth error set as an error pixel point set; generating a depth range with the most pixel points according to the error pixel point set to obtain a target depth range; and calculating loss values according to the target depth ranges, the depth image prediction data, the depth image calibration values corresponding to the training samples in the target sample set and the target loss function to obtain the target loss values.

In an embodiment, the step of updating the network parameters of the monocular depth estimation interpretable model according to the target loss value includes: finding out each network unit corresponding to each target depth range from a depth range and a network unit mapping table corresponding to the monocular depth estimation interpretable model to obtain a single image network unit set; collecting each single image network unit set to obtain a network unit set to be deduplicated; carrying out duplicate removal processing on the network units to be subjected to duplicate removal to obtain a target network unit set; updating network parameters corresponding to the target network element set in the monocular depth estimation interpretable model according to the target loss value.

In an embodiment, the step of generating the depth range with the largest number of pixels according to the error pixel point set to obtain the target depth range includes: according to a preset depth range list, carrying out set division on the error pixel point set to obtain a single-depth-range pixel point set; finding out the single-depth-range pixel point set with the most pixels from each single-depth-range pixel point set to obtain a target pixel point set; and taking the depth range corresponding to the target pixel point set as the target depth range.

In an embodiment, the step of calculating a loss value according to each of the target depth range, each of the depth image prediction data, the depth image calibration value corresponding to each of the training samples in the target sample set, and the target loss function to obtain the target loss value includes: generating a binarization mask in the target depth range corresponding to the first depth image according to the first depth image and the second depth image to obtain a target binarization mask; and inputting the target binary mask, the depth image prediction data and the depth image calibration value corresponding to each training sample in the target sample set into the target loss function for loss value calculation to obtain the target loss value.

is to be

And

the same position vector elements of (a) are added.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a method for training a monocular depth estimation model, including the steps of: acquiring a picture to be predicted; inputting the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation to obtain a target depth image corresponding to the image to be predicted; the training method of the target monocular depth estimation model comprises the following steps: performing fine tuning training on a preset monocular depth estimation interpretable model by adopting a preset training sample set and a target loss function, wherein the target loss function is a loss function obtained based on depth error loss; and taking the monocular depth estimation interpretable model with the fine tuning training finished as the target monocular depth estimation model.

According to the training method of the monocular depth estimation model, fine tuning training is carried out on the monocular depth estimation interpretable model through the loss function obtained based on the depth error loss, and the accuracy of the network is improved based on the interpretable method.

wherein,λ is a hyper-parameter, P is the number of training samples in the target sample set, N_kIs the number of pixel points in the image sample of the kth of the training sample in the target sample set,

is to be

And

the same position vector elements of (a) are added.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method for training a monocular depth estimation model, the method comprising:

acquiring a to-be-predicted image;

2. The method for training a monocular depth estimation model according to claim 1, wherein before the step of inputting the image to be predicted into a preset target monocular depth estimation model for monocular depth estimation to obtain the target depth image corresponding to the image to be predicted, the method further comprises:

taking any one of the single batch sample sets as a target sample set;

calculating a loss value according to the depth image prediction data, the depth image calibration value corresponding to each training sample in the target sample set and the target loss function to obtain a target loss value;

3. The method of claim 2, wherein the step of performing a loss value calculation according to the depth image prediction data, the depth image calibration value corresponding to each of the training samples in the target sample set, and the target loss function to obtain a target loss value includes:

4. A method for training a monocular depth estimation model according to claim 3, wherein the step of updating network parameters of the monocular depth estimation interpretable model according to the target loss value comprises:

updating network parameters corresponding to the target network element set in the monocular depth estimation interpretable model according to the target loss value.

5. The method for training a monocular depth estimation model according to claim 3, wherein the step of generating a depth range with the largest number of pixels according to the error pixel point set to obtain a target depth range includes:

finding out the single-depth range pixel point set with the most pixels from each single-depth range pixel point set to obtain a target pixel point set;

6. The method of claim 3, wherein the step of calculating a loss value according to the target depth range, the depth image prediction data, the depth image calibration value corresponding to each of the training samples in the target sample set, and the target loss function to obtain the target loss value comprises:

generating a binarization mask in the target depth range corresponding to the first depth image according to the first depth image and the second depth image to obtain a target binarization mask;

7. The method for training a monocular depth estimation model according to claim 6, wherein the formula L for calculating the objective loss function_errorComprises the following steps:

where λ is a hyper-parameter, P is the number of training samples in the target sample set, N_kIs the number of pixel points in the image sample of the kth of the training samples in the target sample set,

is to be

And

the co-located vector elements of (a) are multiplied.

8. An apparatus for training a monocular depth estimation model, the apparatus comprising:

the image acquisition module is used for acquiring an image to be predicted;

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program performs the steps of the method according to any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.