CN115170971A

CN115170971A - Construction safety-oriented self-supervision monocular depth estimation transfer learning method and system

Info

Publication number: CN115170971A
Application number: CN202210924807.9A
Authority: CN
Inventors: 郑小玉; 刘自强; 童宁波; 朱华蓉
Original assignee: PowerChina Chengdu Engineering Co Ltd
Current assignee: PowerChina Chengdu Engineering Co Ltd
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-10-11

Abstract

The invention discloses a construction safety-oriented self-supervision monocular depth estimation transfer learning method and a system, which relate to the technical field of artificial intelligence, and the technical scheme key points are as follows: acquiring construction site images including construction equipment and workers, marking and dividing all the construction site images, and constructing a migration data set; establishing a self-supervision monocular depth estimation network model, and loading pre-training parameters of the monocular depth estimation network model; carrying out migration training on the monocular depth estimation network model according to the migration data set, wherein loss function constraint is adopted in the training process; and evaluating the performance of the monocular depth estimation network model after each training by using the quantitative evaluation indexes, and screening to obtain the optimal monocular depth estimation network model. The invention combines the structure characteristics of the coder and the decoder of the self-supervision monocular depth estimation model, and realizes the migration of the self-supervision monocular depth estimation network model from the scenes of automatic driving and the like to the construction scene by utilizing the hierarchical distinction of the scene depth.

Description

Construction safety-oriented self-supervision monocular depth estimation transfer learning method and system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a construction safety-oriented self-supervision monocular depth estimation transfer learning method and system.

Background

Estimating the depth of a construction scene from a single RGB image is a key prerequisite for various applications, including workspace security, localization, productivity analysis, activity recognition, and scene understanding. The self-supervised monocular depth estimation approach utilizes a large amount of unlabeled data to train the depth metric and the method, even better than some supervised approaches.

Currently, the practical application of this type of method to construction sites has inevitable limitations. In the field of self-supervised monocular depth estimation, the main datasets used for training and evaluation, such as the KITTI dataset and the DDAD dataset, are for the autopilot task, which includes true depth data for training, validation and testing of the model. Knowledge migration between the autonomous driving and construction scenario analyses will improve characterization learning in downstream tasks due to differences in feature space and data distribution. However, although we can easily solve the migration learning problem between autonomous driving tasks, such as model migration from the KITTI data set to the DDAD data set, migrating tasks to a building construction scenario remains a huge challenge, because true depth values like lidar data are not available at the construction site, migrating and evaluating models is a challenging problem.

Therefore, how to research and design a self-supervision monocular depth estimation migration learning method and system for construction safety, which can overcome the above defects, is a problem that needs to be solved urgently at present.

Disclosure of Invention

In order to solve the defects in the prior art, the invention aims to provide a construction safety-oriented self-supervision monocular depth estimation migration learning method and system, which are combined with the structural characteristics of a coder and a decoder of a self-supervision monocular depth estimation model, realize the migration of the self-supervision monocular depth estimation network model from automatic driving and other scenes to construction scenes by utilizing the hierarchical distinction of scene depth, and provide the reliable evaluation index estimation migration effect.

The technical purpose of the invention is realized by the following technical scheme:

in a first aspect, a construction safety-oriented self-supervision monocular depth estimation transfer learning method is provided, which comprises the following steps:

acquiring construction site images including construction equipment and workers, marking and dividing all the construction site images, and constructing a migration data set;

establishing a self-supervision monocular depth estimation network model, and loading pre-training parameters of the monocular depth estimation network model;

carrying out migration training on the monocular depth estimation network model according to the migration data set, wherein loss function constraint is adopted in the training process;

and evaluating the performance of the monocular depth estimation network model after each training by using the quantitative evaluation index, and screening to obtain the optimal monocular depth estimation network model.

Furthermore, a marking frame in the construction site image is marked by adopting a rectangular boundary frame;

the rectangular bounding box is marked in the region with clear depth level and equal internal depth;

the labels of the rectangular bounding boxes are the serial numbers of all the selected rectangular bounding boxes from small to large according to the depth.

Further, the depth loss calculated by the loss function includes:

loss due to depth deviation between the random variable and the corresponding average value in each labeling box;

and the loss corresponding to the depth sequence among all the labeling frames.

Further, a calculation formula of the loss caused by the depth deviation is specifically as follows:

wherein L is _dd Represents the loss caused by the depth deviation; n represents the number of the labeling frames in the current image; d' _i Then representing D' of the ith labeling frame area in the current image; d' represents a depth map obtained by normalizing the depth map predicted by the depth estimation network in the current image; sigma (D' _i ) Denotes a random variable D' _i The standard deviation of (a);

the calculation formula of the depth map is specifically as follows:

d represents a depth map which is obtained by prediction of a depth estimation network and corresponds to the current image; d _max 、D _min Respectively representing the maximum and minimum values in D.

Further, the loss calculation formula corresponding to the depth order is specifically:

wherein L is _do Representing the loss corresponding to the depth sequence; n represents the number of the labeling frames in the current image; v _ij Representing the inter-frame distance estimation loss between the ith standard frame and the jth standard frame; i and j are integers, which represent the serial number of the labeling frame in the image, and the larger the serial number is, the larger the average depth value of the labeling frame is;

the calculation formula of the distance estimation loss is specifically as follows:

wherein μ (·) represents a function for solving the mean of the corresponding random variable;

to avoid the perturbation that trivial solution adds.

Further, the migration training process of the monocular depth estimation network model specifically includes:

loading a pre-trained PackNet network model on a Kitti automatic driving data set, and freezing an Encoder module of the model;

loading a training picture and a label of a data set;

training a PackNet model, and constraining the Decoder module parameters of the model by adopting a migration loss function;

the model after each round of training was tested using the validation set and evaluation index of the data set.

Further, the performance evaluation process of the monocular depth estimation network model specifically comprises the following steps:

carrying out comprehensive evaluation according to the relative average error of the depth deviation and the sequence accuracy of the depth sequence;

and if the relative average error is lower and the sequence accuracy is higher, the performance of the monocular depth estimation network model is good.

In a second aspect, a construction safety-oriented self-supervision monocular depth estimation transfer learning system is provided, which includes:

the data acquisition module is used for acquiring construction site images including construction equipment and workers, marking and dividing all the construction site images and constructing a migration data set;

the model building module is used for building a self-supervision monocular depth estimation network model and loading pre-training parameters of the monocular depth estimation network model;

the migration training module is used for carrying out migration training on the monocular depth estimation network model according to the migration data set, and loss function constraint is adopted in the training process;

and the evaluation screening module is used for evaluating the performance of the monocular depth estimation network model after each training by utilizing the quantitative evaluation indexes and screening to obtain the optimal monocular depth estimation network model.

In a third aspect, a computer terminal is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the method for self-supervision monocular depth estimation migration learning for construction safety is implemented as described in any one of the first aspect.

In a fourth aspect, a computer-readable medium is provided, on which a computer program is stored, where the computer program is executed by a processor, and the construction safety-oriented self-supervision monocular depth estimation migration learning method according to any one of the first aspect may be implemented.

Compared with the prior art, the invention has the following beneficial effects:

1. the construction safety-oriented self-supervision monocular depth estimation migration learning method provided by the invention combines the structural characteristics of a coder and a decoder of a self-supervision monocular depth estimation model, realizes the migration of the self-supervision monocular depth estimation network model from scenes such as automatic driving and the like to construction scenes by utilizing the hierarchical distinction of scene depth, and provides a reliable evaluation index estimation migration effect;

2. according to the method, a new learning paradigm comprising a self-supervision monocular depth estimation model migration method, a loss function, an evaluation index and a depth marking method is established, so that the problem of model domain drift caused by the fact that a construction scene of a self-supervision monocular depth prediction model has no real depth data is solved, and the depth estimation precision of a downstream task is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flow chart in an embodiment of the invention;

fig. 2 is a block diagram of a system in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Example 1: the construction safety-oriented self-supervision monocular depth estimation transfer learning method comprises the following steps as shown in figure 1:

s1: acquiring construction site images including construction equipment and workers, marking and dividing all the construction site images, and constructing a migration data set;

s2: establishing a self-supervision monocular depth estimation network model, and loading pre-training parameters of the monocular depth estimation network model;

s3: carrying out migration training on the monocular depth estimation network model according to the migration data set, wherein loss function constraint is adopted in the training process;

s4: and evaluating the performance of the monocular depth estimation network model after each training by using the quantitative evaluation index, and screening to obtain the optimal monocular depth estimation network model.

Labeling a data set picture by adopting a method similar to the commonly used object rectangular bounding box labeling method in the field of target detection, wherein the difference is that the rectangular bounding box is a region which is labeled at any depth level and has the same internal depth; the labels of the rectangular bounding boxes are the serial numbers of all the selected rectangular bounding boxes from small to large according to the depth.

The migration loss function, instead of directly calculating the depth loss of each pixel, compares the depth deviation of the random variable from its mean value in each labeled box and the depth order among all labeled boxes, which can be directly applied to the depth loss.

The calculation formula of the loss function is specifically as follows:

L＝αL _do +L _dd

wherein L represents the depth loss calculated by the loss function; l is _dd Is a loss due to depth variation; l is _d o is the loss of depth order; α is a weight adjustment parameter, α =1/3.

The formula for calculating the loss caused by the depth deviation is specifically as follows:

wherein L is _dd Represents the loss due to depth deviation; n represents the number of the labeling frames in the current image; d' _i Then representing D' of the ith label frame area in the current image; d' represents a depth map obtained by normalizing the depth map predicted by the depth estimation network and corresponding to the current image; sigma (D' _i ) Denotes a random variable D' _i Standard deviation of (d).

The calculation formula of the depth map is specifically as follows:

The loss calculation formula corresponding to the depth order is specifically as follows:

wherein L is _do Representing the loss corresponding to the depth sequence; n represents the number of the labeling frames in the current image; v _ij Representing the inter-frame distance estimation loss between the ith standard frame and the jth standard frame; i and j are integers, represent the number of the labeling frame in the image, and the larger the number is, the larger the average depth value of the labeling frame is.

to avoid the perturbation that trivial solution adds.

The transfer training process of the monocular depth estimation network model specifically comprises the following steps: loading a pre-trained PackNet network model on a Kitti automatic driving data set, and freezing an Encoder module of the model; loading a training picture and a label of a data set; training a PackNet model, and constraining Decoder module parameters of the model by adopting a migration loss function; the model after each round of training was tested using the validation set and evaluation index of the data set.

The evaluation index includes two accuracy indexes, and each test model is evaluated according to two losses of depth deviation and depth sequence. The first indicator is a Relative Mean Error (RME) indicator, which is defined as:

wherein, d _j Is the annotation box Y in the corresponding predicted depth image _i Pixels of the region; μ X _i Is the comment box Y _i A depth average of the region; n is the comment box Y _i The number of pixels of a region.

Another indicator is the Order Accuracy (OAcc) indicator, which is defined as:

where K is the total number of comment boxes in the picture.

δ _ij The calculation formula of (c) is:

wherein the content of the first and second substances,

representing the real depth relation of two labeling boxes (IO represents ascending order and DO represents descending order);

representing a normal difference distribution of two random variables; x _i The distribution of the depth values of the ith mark box area is represented; r (X) _i ,X _j ) Indicating that the depth relation function of the two annotation boxes is calculated based on the predicted depth.

The formula for the Label (x) is:

in evaluating the depth estimation model, the higher the OAcc, the better the model with lower RME.

Example 2: the construction safety-oriented self-supervision monocular depth estimation transfer learning system is used for realizing the self-supervision monocular depth estimation transfer learning method described in the embodiment 1, and comprises a data acquisition module, a model construction module, a transfer training module and an evaluation screening module as shown in fig. 2.

The data acquisition module is used for acquiring construction site images including construction equipment and workers, marking and dividing all the construction site images and constructing a migration data set; the model building module is used for building a self-supervision monocular depth estimation network model and loading pre-training parameters of the monocular depth estimation network model; the migration training module is used for carrying out migration training on the monocular depth estimation network model according to the migration data set, and loss function constraint is adopted in the training process; and the evaluation screening module is used for evaluating the performance of the monocular depth estimation network model after each training by utilizing the quantitative evaluation indexes and screening to obtain the optimal monocular depth estimation network model.

The working principle is as follows: the invention combines the structural characteristics of the coder and the decoder of the self-supervision monocular depth estimation model, realizes the migration of the self-supervision monocular depth estimation network model from the scenes of automatic driving and the like to the construction scene by utilizing the hierarchical distinction of the scene depth, and provides the reliable evaluation index evaluation migration effect; by constructing a new learning paradigm including an auto-supervision monocular depth estimation model migration method, a loss function, an evaluation index and a depth marking method, the problem of model domain drift caused by the fact that the auto-supervision monocular depth prediction model has no real depth data in a construction scene is solved, and the depth estimation precision of a downstream task is improved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The construction safety-oriented self-supervision monocular depth estimation transfer learning method is characterized by comprising the following steps of:

and evaluating the performance of the monocular depth estimation network model after each training by using the quantitative evaluation indexes, and screening to obtain the optimal monocular depth estimation network model.

2. The construction safety-oriented self-supervision monocular depth estimation migration learning method of claim 1, characterized in that a labeling frame in the construction site image is labeled by a rectangular bounding box;

3. The construction safety-oriented self-supervision monocular depth estimation transfer learning method of claim 1, wherein the depth loss calculated by the loss function comprises:

and the loss corresponding to the depth sequence among all the labeling frames.

4. The construction safety-oriented self-supervision monocular depth estimation transfer learning method of claim 3, wherein the loss calculation formula caused by the depth deviation is specifically:

wherein L is _dd Represents the loss caused by the depth deviation; n represents the number of the labeling frames in the current image; d' _i Then representing D' of the ith label frame area in the current image; d' represents a depth map obtained by normalizing the depth map predicted by the depth estimation network in the current image; sigma (D' _i ) Denotes a random variable D' _i Standard deviation of (d);

the calculation formula of the depth map is specifically as follows:

5. The construction safety-oriented self-supervision monocular depth estimation migration learning method of claim 3, wherein the loss calculation formula corresponding to the depth order is specifically:

wherein L is _do Representing the loss corresponding to the depth sequence; n represents the number of the labeling frames in the current image; v _ij Representing the inter-frame distance estimation loss between the ith standard frame and the jth standard frame; i and j are integers, represent the serial number of the labeling frame in the image, and the larger the serial number is, the larger the average depth value of the labeling frame is;

in order to avoid perturbations added by trivial solutions.

6. The construction safety-oriented self-supervision monocular depth estimation migration learning method according to any one of claims 1 to 5, characterized in that the migration training process of the monocular depth estimation network model specifically comprises:

loading a training picture and a label of a data set;

training a PackNet model, and constraining Decoder module parameters of the model by adopting a migration loss function;

7. The construction safety-oriented self-supervision monocular depth estimation migration learning method according to any one of claims 1 to 5, characterized in that the performance evaluation process of the monocular depth estimation network model specifically comprises:

8. A construction safety-oriented self-supervision monocular depth estimation transfer learning system is characterized by comprising:

9. A computer terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method for construction safety oriented self-supervision monocular depth estimation migration learning according to any one of claims 1-7 when executing the program.

10. A computer-readable medium, on which a computer program is stored, the computer program being executable by a processor to implement the method for self-supervised monocular depth estimation migration learning for construction safety according to any one of claims 1 to 7.