CN116433856B

CN116433856B - Three-dimensional reconstruction method and system for lower scene of tower crane based on monocular camera

Info

Publication number: CN116433856B
Application number: CN202310148891.4A
Authority: CN
Inventors: 安民洙; 米文忠; 房新奥; 郭振威
Original assignee: Guangdong Light Speed Intelligent Equipment Co ltd; Tenghui Technology Building Intelligence Shenzhen Co ltd
Current assignee: Guangdong Light Speed Intelligent Equipment Co ltd; Tenghui Technology Building Intelligence Shenzhen Co ltd
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2023-12-05
Anticipated expiration: 2043-02-14
Also published as: CN116433856A

Abstract

The invention provides a three-dimensional reconstruction method and a three-dimensional reconstruction system for a scene below a tower crane based on a monocular camera, wherein the method comprises the following steps: acquiring target images of adjacent frames on a time axis, and performing target semantic segmentation based on deep learning to obtain a plurality of segmentation example results, wherein the segmentation example results comprise characteristic information and semantic information; adopting a multistage solving mode, firstly carrying out two-dimensional plane assumption on the relative motion of a monocular camera according to tower crane motion parameters, and calculating a rotation and translation matrix R between two target images ₀ And T ₀ Calculating relative positions R and T between two frames of target images by utilizing characteristic information matching to obtain the relative pose of the monocular camera; performing error redistribution on the inter-frame relative pose results of the long-time sequence by combining semantic segmentation loss, homogenizing accumulated errors to each frame, and obtaining error-optimized relative poses; and performing dense reconstruction according to the relative pose and semantic information after error optimization. The invention has the advantages of small calculated amount, high efficiency, high stability, high precision and low cost.

Description

Three-dimensional reconstruction method and system for lower scene of tower crane based on monocular camera

Technical Field

The invention belongs to the technical field of map reconstruction, and particularly relates to a method and a system for reconstructing a scene under a tower crane based on a monocular camera.

Background

In the field of tower crane construction, the topography and scene information below the large arm of the tower crane has important significance for the operation safety of the tower crane. How to perform stable reconstruction of the topography and the scene below the tower crane is an important problem in the aspects of active safety and automatic driving of the tower crane.

At present, three-dimensional reconstruction of a scene below a tower crane mainly comprises three technical approaches: 1) A binocular camera-based method; 2) A lidar-based method; 3) Methods based on monocular cameras and sequence data analysis.

Binocular camera-based methods are the most classical vision-based terrain reconstruction methods, but binocular cameras require strict relative pose calibration before use, and the baseline length needs to be set according to scene-to-camera height, with significant limitations.

The method based on the laser radar has the best stability and precision in three methods, but the laser radar capable of meeting the operation scene of the tower crane has very high price, and the cost is 100 times or even 1000 times of that of a camera. And the field angle of the laser radar is generally smaller, and a plurality of laser radars are required to cooperatively process, so that the cost is further increased, and the large-scale popularization and application are not facilitated.

The method based on the monocular camera is the method with the lowest cost and the least limitation on the installation and use conditions among three types of methods. However, the method has high requirements on reconstruction algorithms, and needs to perform accurate sequence data analysis and solve a large-scale nonlinear optimization problem, so that the stability is poor, and the reconstruction error is larger and the accuracy is poorer along with the lengthening of the sequence images. Under the background of multiple moving targets and quick scene change in the tower crane construction scene, great challenges are brought to the analysis of sequence images, and the application difficulty of monocular camera sequence data in the reconstruction of the tower crane construction scene is further increased.

Therefore, how to use a monocular camera to efficiently and with small error complete the three-dimensional reconstruction of the scene under the tower crane is a technical problem to be solved in the art.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a three-dimensional reconstruction method and system for a scene under a tower crane based on a monocular camera, which are mainly used for solving the problems that the three-dimensional reconstruction of the scene under the tower crane cannot be completed with high efficiency and small error by using the monocular camera in the prior art.

In order to achieve the above object, in a first aspect, the present invention provides a three-dimensional reconstruction method for a scene under a tower crane based on a monocular camera, where the monocular camera is installed under a tower crane trolley and is used for collecting an image under the tower crane trolley, and the method includes the following steps:

s10, acquiring target images of adjacent frames on a time axis, and performing target semantic segmentation based on deep learning to obtain a plurality of segmentation example results, wherein the segmentation example results comprise characteristic information and semantic information;

s20, adopting a multistage solving mode, firstly carrying out two-dimensional plane assumption on the relative motion of the monocular camera according to tower crane motion parameters, and calculating a rotation and translation matrix R between two target images ₀ And T ₀ Calculating relative positions R and T between two frames of target images by utilizing characteristic information matching to obtain the relative pose of the monocular camera;

s30, combining semantic segmentation loss, carrying out error redistribution on the inter-frame relative pose results of the long-time sequence, homogenizing the accumulated errors to each frame, and obtaining the relative pose after error optimization;

and S40, performing dense reconstruction according to the relative pose and semantic information after error optimization to obtain a scene reconstruction result below the tower crane.

In some embodiments, the tower crane motion parameter comprises at least one of tower crane trolley radial distance, hook height, boom rotation angle.

In some embodiments, alignment of feature information and semantic information in the several segmentation instance results is maintained to assist in the computation of the constrained relative pose as the computation of step S20 is performed.

In some embodiments, in step S20, the multi-stage solution includes a first-stage solution and a second-stage solution;

at the first stage of solvingWhen the tower crane motion parameter is used as an initial value, two-dimensional plane assumption is made on the relative motion of the monocular camera, and a rotation and translation matrix R between two target images is calculated according to the initial value ₀ And T ₀ Downsampling the image with magnification, extracting single-scale feature points, selecting a local window, and performing R ₀ 、T ₀ Under the constraint of the segmentation example result, matching the characteristic points, constructing homonymy point pairs, calculating affine transformation parameters, and completing the relative pose R under the planar assumption ₀ And T ₀ Is corrected to obtain R ₁ And T ₁ ；

When the second-stage solution is carried out, extracting feature points at an original resolution layer of the target image to obtain multi-scale feature points, wherein R is as follows ₁ And T ₁ Under the constraint of (1), matching the characteristic points, and calculating the relative pose R and T between two frames of target images by utilizing characteristic information matching to obtain the relative pose of the monocular camera.

In some embodiments, the R and T are configured such that no two-dimensional plane assumption is made therebetween, and when the relative pose R and T are calculated, the three-dimensional position and rotation are calculated according to the internal azimuth parameters of the monocular camera, so as to obtain the relative pose of the monocular camera.

In some embodiments, in step S30, a modified beam-method adjustment algorithm is used to add semantic segmentation losses to the objective function, the semantic segmentation losses being the euclidean distance of the static objective semantic boundary and the euclidean distance of the dynamic objective semantic boundary, as shown in the following formula:

in the above formula, SIGMA II q-p (c, x) II is a standard beam method adjustment formula to calculate the reprojection error of the key point; sigma II ins-ins' II is the reprojection error of the static target semantic boundary points between adjacent frames; lambda is a penalty factor.

In some embodiments, the Euclidean distance of the dynamic target semantic boundary is less than a dynamic boundary threshold, and the penalty factor λ is 0.2.

In some embodiments, in step S40, the following steps are included:

acquiring homonymous lines according to the relative pose after error optimization and the basic principle of measuring epipolar geometry;

matching a first processing window for each pixel point on the homonym line, introducing semantic information into the matching relation, matching the first processing window corresponding to the semantic information on the homonym line, and searching the optimal homonym point in the first processing window by taking the normalized gray scale correlation coefficient as the similarity;

and (5) performing dense reconstruction to obtain a scene reconstruction result below the tower crane.

In some embodiments, in step S40, semantic information with planar characteristics is identified, boundary matching of the segmentation instance result is performed, planar hypothesis processing is performed on the interior of the segmentation instance result, and homonym matching is not performed.

In a second aspect, the present invention provides a system applied to the above three-dimensional reconstruction method of a scene under a tower crane based on a monocular camera, including:

the monocular camera is arranged below the tower crane trolley and is used for acquiring a lower image;

the semantic segmentation module is used for acquiring target images of adjacent frames on a time axis, and performing target semantic segmentation based on deep learning to obtain a plurality of segmentation example results, wherein the segmentation example results comprise characteristic information and semantic information;

the relative pose solving module is used for adopting a multi-stage solving mode to firstly make a two-dimensional plane assumption on the relative motion of the monocular camera according to the tower crane motion parameters and calculate a rotation and translation matrix R between two target images ₀ And T ₀ Calculating relative positions R and T between two frames of target images by utilizing characteristic information matching to obtain the relative pose of the monocular camera;

the error optimization processing module is used for carrying out error redistribution on the inter-frame relative pose results of the long-time sequence by combining semantic segmentation loss, homogenizing the accumulated errors to each frame, and obtaining the relative pose after error optimization;

and the three-dimensional reconstruction module is used for carrying out dense reconstruction according to the relative pose and semantic information after error optimization to obtain a scene reconstruction result below the tower crane.

Compared with the prior art, the invention has the beneficial effects that at least:

image acquisition is carried out by using a monocular camera, target semantic segmentation is carried out based on deep learning, constraints are provided for relative pose calculation and dense reconstruction of continuous frames, stability is improved, calculated amount is reduced, cost is low, and popularization and application are facilitated;

the method is characterized in that a multistage solving mode is adopted, the relative pose of a camera in an image sequence is calculated based on semantic information and image characteristic information, tower crane information and semantic information are used for restraining, the method is more reliable than the method which uses image characteristic points alone, the relative pose is used as an initial value of dense reconstruction, and the method has a decisive effect on the final dense reconstruction effect;

aiming at the problem of error accumulation in the relative pose calculation of a long-time sequence, the error redistribution is carried out by combining the semantic segmentation loss, and the accumulated errors are homogenized to each frame, so that the accuracy of the relative pose calculation result is improved, and the accuracy of scene reconstruction is ensured.

Drawings

The invention will be further described with reference to the accompanying drawings, in which embodiments do not constitute any limitation of the invention, and other drawings can be obtained by one of ordinary skill in the art without inventive effort from the following drawings.

Fig. 1 is a flow chart of a three-dimensional reconstruction method of a scene under a tower crane based on a monocular camera according to an embodiment.

FIG. 2 is a schematic diagram of a deployment location of a monocular camera under an embodiment.

Fig. 3 is a flow chart of a three-dimensional reconstruction method of a scene under a tower crane based on a monocular camera according to another embodiment.

FIG. 4 is a schematic diagram of a process for object semantic segmentation of an object image in one embodiment.

Fig. 5 is a schematic diagram of a result of three-dimensional reconstruction of a scene under a tower crane in one embodiment.

Fig. 6 is a schematic diagram of a three-dimensional reconstruction system of a scene under a tower crane based on a monocular camera according to an embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Referring to fig. 1 to 2, in a first aspect, the present invention provides a three-dimensional reconstruction method of a scene under a tower crane based on a monocular camera, wherein the monocular camera is installed under a tower crane trolley and is used for acquiring an image of the lower part, and the method comprises the following steps:

s10, acquiring target images of adjacent frames on a time axis, performing target semantic segmentation based on deep learning, wherein each frame of target image is segmented to obtain a plurality of segmentation example results, and the target images comprise static targets and dynamic targets, so that the segmentation example results comprise the static segmentation results and the dynamic segmentation results, the segmentation example results comprise characteristic information and semantic information, and boundaries of each segmentation result are obtained after the target semantic segmentation;

s20, adopting a multi-stage solving mode, firstly according to the movement of the tower craneParameters two-dimensional plane assumption of relative motion of monocular camera, calculating rotation and translation matrix R between two target images ₀ And T ₀ In this stage of solution, a two-dimensional plane assumption is made, and the rotation and translation matrix R between two target images is easily calculated due to the addition of tower crane motion parameters ₀ And T ₀ The method comprises the steps of carrying out a first treatment on the surface of the Still further, in the next level of solution, using feature information matching, at R ₀ And T ₀ Under the constraint of the (2), calculating the relative positions R and T between two frames of target images to obtain the relative pose of the monocular camera;

s30, combining semantic segmentation loss, carrying out error redistribution on the inter-frame relative pose results of the long-time sequence, homogenizing the accumulated errors to each frame, and obtaining the relative pose after error optimization; because the three-dimensional reconstruction is a long-term continuous dynamic process, when the acquired target image sequence is longer and longer, error accumulation exists for the relative pose result calculated according to the adjacent images, the longer the sequence is, the larger the accumulated error is, if error redistribution is not carried out, the front data reconstruction precision is good, and the rear data reconstruction precision is poor, so that the semantic segmentation loss is combined in the step, the error is redistributed, and each frame is uniformly distributed as far as possible;

According to the method, when the tower crane trolley moves and the large arm rotates, an image sequence below the tower crane is obtained by using the monocular camera, and then, target images of adjacent frames on a time axis are obtained in the image sequence, and it is noted that in order to optimize calculation efficiency, the monocular camera can be set to collect images according to a certain frequency, for example, once every 3 seconds, so that the time interval between two target images of the adjacent frames is 3 seconds, the three-dimensional reconstruction of the method takes 3 seconds, and the time of the two images is overlapped, so that the efficiency is effectively improved, and the redundant calculation amount is reduced; after target image is subject to target semantic segmentation, the boundary of a dynamic target and a static target is identified, then the relative pose of a camera is solved based on image feature information and semantic segmentation results, after the relative pose results are obtained, the relative pose results of long-time sequence images are subjected to error redistribution and global adjustment optimization, finally dense reconstruction is carried out according to the relative pose and semantic information, three-dimensional reconstruction of a scene below the tower crane in the dynamic operation process can be completed by using only one monocular camera, complicated sensor calibration is not needed, the cost is extremely low, the large-scale popularization and application are facilitated, the factors of tower crane motion parameters, semantic information constraint and error redistribution are considered in the calculation process, the reconstruction stability and accuracy are improved, the calculation amount is reduced, and the efficiency is improved.

As an implementation mode, the tower crane motion parameters include at least one of the radial distance of the tower crane trolley, the height of the lifting hook and the rotation angle of the big arm, and the lower image acquired by the monocular camera is changed in the moving, the rotation of the big arm and the lifting hook lifting process of the tower crane trolley, or the scene topography under the lower part is changed, or the height of the lifting object is changed, so that in order to improve the efficiency and the precision in calculating the relative pose, the rotation and translation matrix R between two images can be calculated rapidly by combining the tower crane motion parameters under the assumption of a two-dimensional plane ₀ And T ₀ . It should be noted that the acquisition of the tower crane motion parameters and the first-stage solution with two-dimensional plane assumption in the multi-stage solution mode have technical integrity, and the R can be calculated quickly by directly utilizing the tower crane motion parameters in the first-stage solution ₀ And T ₀ Belongs to top-level calculation of the frame pose, and provides a basis for accurate calculation of the following frame pose.

In one embodiment, during the calculation in step S20, alignment of feature information and semantic information in the results of several segmentation examples is maintained, that is, corresponding features and semantics in different frames are both consistent and aligned, so as to assist in calculating the constraint relative pose.

Further, in step S10, a U-Net type segmentation method with cross-layer connection is adopted, a mobilenet v3 is selected in a backbone network part, an ablation experiment is performed on a network according to an actual scene, an optimal balance between a calculation load and an application effect is achieved by adjusting a resolution factor, a width factor and a network layer number, preferably, the width factor is adjusted to be 0.75, the resolution factor is adjusted to be 0.5, the depth is adjusted to be 0.8 of a standard mobilenet v3, under the above parameter configuration, the situation that a large number of bulk building materials are in a tower crane operation scene, such as steel bars, steel pipes, wood bars and the like, a group or a stack of bulk building material targets which are gathered in space are marked as a segmentation example result, and a U-type convolutional neural network with cross-layer connection is adopted to segment the image to obtain a final semantic example segmentation result.

Referring to fig. 3, in the present embodiment, in step S20, the multi-stage solving manner includes a first-stage solving and a second-stage solving;

when the first-stage solution is carried out, the tower crane motion parameter is taken as an initial value, a two-dimensional plane assumption is carried out on the relative motion of the monocular camera, and a rotation and translation matrix R between two target images is calculated according to the initial value ₀ And T ₀ Because the error of the tower crane motion parameter is larger, the R is generally calculated on the meter level through the tower crane motion parameter and the two-dimensional plane assumption ₀ And T ₀ In this stage, the image is further downsampled by a magnification factor, for example, 3-5 times, to extract single-scale feature points, the feature point extraction algorithm selects Harris operator, selects a larger local window, and uses the Harris operator to extract the feature points in R ₀ 、T ₀ Under the constraint of a segmentation example result, matching the characteristic points, calculating HOG in a local window by using the corner positions extracted by Harris in the matching, constructing homonymous point pairs by using the Euclidean distance of the HOG as a matching measure, and eliminating that R is not satisfied ₀ And T ₀ Constrained homonymy point pairs are constructed so far, affine transformation parameters are calculated according to all homonymy point pairs, and relative pose R under plane assumption is completed ₀ And T ₀ Is corrected to obtain R ₁ And T ₁ ；

When the second-stage solution is carried out, extracting feature points at the original resolution layer of the target image, obtaining multi-scale feature points by using SIFT, and carrying out R ₁ And T ₁ Under the constraint of (1) matching the characteristic points, calculating the relative pose R and T between two frames of target images by utilizing characteristic information matching to obtain the relative pose of the monocular camera, and in the second-stage solving, no two-dimensional plane assumption is made between R and T, and when the relative pose R and T are solved, the three-dimensional position and rotation are solved according to the internal azimuth parameters of the monocular camera to obtain the relative pose of the monocular camera.

The first-stage solution is the top-layer calculation of the frame pose, the second-stage solution is the accurate calculation of the frame pose, and a grading method of constraint of tower crane information and semantic information is used in pose estimation, so that the method has stability and precision, and the calculated amount is reduced.

In this embodiment, in step S30, an improved beam method adjustment algorithm is adopted to perform global adjustment optimization on the relative pose result, and in more detail, a semantic segmentation loss is added to an optimization objective function of the beam method adjustment, where the semantic segmentation loss is defined as a euclidean distance of a static target semantic boundary and a euclidean distance of a dynamic target semantic boundary, and the final objective function is shown in the following formula:

g(c,x,instance)＝∑‖q-p(c,x)‖+λ∑‖ins-ins′‖

in the above formula, SIGMA II q-p (c, x) II is a standard beam method adjustment formula to calculate the reprojection error of the key point; sigma II ins-ins' II is the reprojection error of the static target semantic boundary points between adjacent frames in time; lambda is a penalty factor.

In the definition of the semantic segmentation loss, the Euclidean distance of the semantic boundary of the static target is required to be as small as possible, the spatial alignment is realized as much as possible, the Euclidean distance of the semantic boundary of the dynamic target is required to be smaller than a dynamic boundary threshold value, the dynamic target does not need subversion spatial movement change, and the penalty factor lambda is 0.2.

By using Euclidean distance of semantic boundary as regularization term and adding penalty factor into objective function of beam method adjustment, error redistribution can be realized in the process of relative pose result of long-time sequence image, accumulated error can be distributed to each frame uniformly as much as possible, so as to reduce error and improve accuracy.

Referring to fig. 5, in the present embodiment, in step S40, the following steps are included:

matching a first processing window for each pixel point on the homonym line, introducing semantic information into the matching relation, matching the first processing window corresponding to the semantic information on the homonym line, and searching for the optimal homonym point by taking the normalized gray scale correlation coefficient as the similarity inside the first processing window;

Further, since the semantic information is introduced into the matching relationship, the semantic information with plane characteristics, such as a roof, a terrace and the like, is identified, the boundary matching of the segmentation example result is directly carried out, the plane hypothesis processing is carried out on the interior of the segmentation example result, the homonymous point matching is not carried out, and the dense reconstruction jump of the plane area is restrained.

As shown in fig. 5, the left side is a calculation processing diagram in the reconstruction process, and after dense reconstruction, a reconstruction result shown on the right side is formed.

As one embodiment, since various terrains and building materials exist in a scene below the tower crane, a horizontal construction surface exists, a vertical construction surface exists, when the object semantic segmentation based on deep learning is performed, a static object is subdivided, an object closest to the tower crane object is identified, and the identified segmentation example result is marked, for example, a distance from the tower crane object is less than 10m and is marked as 1 class, a distance from the tower crane object is less than 30m and is more than 10m and is marked as 2 class, a distance from the tower crane object is more than 30m and is marked as 3 class, the marking information belongs to one of semantic information, different semantic information is distinguished by the priority of the marking information, and when dense reconstruction is performed, a higher-level area of the semantic information is preferentially reconstructed to form a reconstruction sequence from the center to the periphery or from the high to the low.

Referring to fig. 6, in a second aspect, the present invention provides a system applied to the above three-dimensional reconstruction method of a scene under a tower crane based on a monocular camera, including:

the monocular camera is arranged below the tower crane trolley and is used for continuously collecting lower images;

All the above modules are used to implement the three-dimensional reconstruction method of the scene under the tower crane in the above embodiment, and specific embodiments are not described herein in detail.

Compared with the prior art, the invention provides the three-dimensional reconstruction method and the system for the scene under the tower crane based on the monocular camera, which are characterized in that the monocular camera is utilized for image acquisition, the target semantic segmentation is carried out based on the deep learning, the constraint is provided for the relative pose calculation and dense reconstruction of continuous frames, the stability is improved, the calculated amount is reduced, the cost is low, and the popularization and the application are convenient;

Finally, it should be emphasized that the present invention is not limited to the above-described embodiments, but is merely preferred embodiments of the invention, and any modifications, equivalents, improvements, etc. within the spirit and principles of the invention are intended to be included within the scope of the invention.

The above description is a main flow step of the invention, in which other functional steps may be inserted, and the logic sequence and the flow steps may be disordered, if the data processing manner is similar to the processing manner of the flow step or the core idea of the data processing is similar, the same, all should be protected.

Claims

1. The three-dimensional reconstruction method for the scene below the tower crane based on the monocular camera is characterized by comprising the following steps of:

s40, performing dense reconstruction according to the relative pose and semantic information after error optimization to obtain a scene reconstruction result below the tower crane;

in step S20, the multi-stage solving method includes a first-stage solving and a second-stage solving;

when the first-stage solution is carried out, the tower crane motion parameter is taken as an initial value, a two-dimensional plane assumption is carried out on the relative motion of the monocular camera, and a rotation and translation matrix R between two target images is calculated according to the initial value ₀ And T ₀ Downsampling the image with magnification, extracting single-scale feature points, selecting a local window, and performing R ₀ 、T ₀ Under the constraint of the segmentation example result, matching the characteristic points, constructing homonymy point pairs, calculating affine transformation parameters, and completing the relative pose R under the planar assumption ₀ And T ₀ Is corrected to obtain R ₁ And T ₁ ；

2. The method for three-dimensional reconstruction of a scene under a tower crane based on a monocular camera according to claim 1, wherein the tower crane motion parameters comprise at least one of radial distance of a tower crane trolley, height of a lifting hook and rotation angle of a large arm.

3. The method for three-dimensional reconstruction of a scene under a tower crane based on a monocular camera according to claim 1, wherein the alignment of the feature information and the semantic information in the results of the plurality of divided examples is maintained to assist in the calculation of the constrained relative pose during the calculation of step S20.

4. The three-dimensional reconstruction method of the scene under the tower crane based on the monocular camera according to claim 3, wherein the R and the T are configured to be used for solving the relative pose R and the relative pose T without two-dimensional plane assumption therebetween, and the three-dimensional position and the rotation are solved according to the internal azimuth parameters of the monocular camera to obtain the relative pose of the monocular camera.

5. The method for three-dimensional reconstruction of a scene under a tower crane based on a monocular camera according to claim 4, wherein in step S30, a modified beam method adjustment algorithm is adopted, and semantic segmentation loss is added into an objective function, wherein the semantic segmentation loss is a euclidean distance of a static target semantic boundary and a euclidean distance of a dynamic target semantic boundary, and the semantic segmentation loss is represented by the following formula:

g(c，x，instance)＝∑||q-p(c，x)||+λ∑||ins-ins′||

in the above formula, sigma||q-p (c, x) |is a standard beam method adjustment formula to calculate the re-projection error of the key point; sigma||ins-ins' |is the reprojection error of the static target semantic boundary point between adjacent frames; lambda is a penalty factor.

6. The method for three-dimensional reconstruction of a scene under a tower crane based on a monocular camera according to claim 5, wherein the Euclidean distance of the semantic boundary of the dynamic target is smaller than a dynamic boundary threshold value, and the penalty factor lambda is 0.2.

7. The three-dimensional reconstruction method of a scene under a tower crane based on a monocular camera according to claim 5, wherein in step S40, the method comprises the steps of:

8. The method for three-dimensional reconstruction of a scene under a tower crane based on a monocular camera according to claim 7, wherein in step S40, semantic information with plane characteristics is identified, boundary matching of a division example result is performed, plane hypothesis processing is performed on the inside of the division example result, and homonymy point matching is not performed.

9. A system for applying to the monocular camera-based three-dimensional reconstruction method of a scene under a tower crane as claimed in any one of claims 1 to 8, comprising: