CN110021041B

CN110021041B - Unmanned scene incremental gridding structure reconstruction method based on binocular camera

Info

Publication number: CN110021041B
Application number: CN201910156872.XA
Authority: CN
Inventors: 朱建科; 李昱辰; 章国锋
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-03-01
Filing date: 2019-03-01
Publication date: 2021-02-12
Anticipated expiration: 2039-03-01
Also published as: CN110021041A

Abstract

The invention discloses a binocular camera-based unmanned scene incremental gridding structure reconstruction method. Preprocessing an input binocular video frame, calculating a disparity map of the input binocular video frame, calculating a corresponding depth map based on the disparity map, and then constructing an initial scene gridding structure through triangulation and mesh subdivision stages; carrying out grid structure smoothing treatment after eliminating the grid nail phenomenon; and calculating to obtain the motion information between the current frame and the previous frame by utilizing the positioning information acquired by the satellite navigation equipment, calculating to obtain a new mesh surface patch added with the global scene gridding structure by combining the scene gridding structure reconstruction result of the current frame, and finally completing the splicing work of the scene gridding structure by a local re-triangulation method. The invention innovatively utilizes the visual difference between two frames of images to find a new visual area, and incrementally updates the overall scene reconstruction result in a gridding structure form, thereby obtaining the three-dimensional scene reconstruction method with better performance and higher robustness under various environments.

Description

Unmanned scene incremental gridding structure reconstruction method based on binocular camera

Technical Field

The invention belongs to the technical field of three-dimensional scene reconstruction, and particularly relates to an unmanned scene incremental gridding structure reconstruction method based on a binocular camera

Background

Nowadays, unmanned vehicles such as automatic driving vehicles and automatic flying unmanned aerial vehicles are greatly developed, and for the unmanned vehicles, automatic navigation and path planning in the driving or flying process are very key technical links. For almost all automatic navigation and path planning techniques, various types of scene structure maps are indispensable, and conventional scene reconstruction generally adopts an offline scene structure reconstruction scheme based on various types of high-precision hardware, for example: in the field of automatic driving, researchers often collect and reconstruct scene structure data through a manned vehicle carrying a high-precision RTK (carrier phase differential technology) device and a multi-line-beam laser radar. This acquisition method has two drawbacks: 1) the constructed scene structure is composed of dense laser point clouds and is a scattered point structure essentially, while the scene structure in the real scene is composed of continuous lines or surfaces, and the scattered point structure can lose important scene continuous structure information; 2) the scene structure reconstruction scheme requires high hardware cost and is not suitable for all research institutions and development teams.

Aiming at the defects contained in the off-line scene structure reconstruction scheme based on various high-precision hardware, and along with the development of the SLAM (synchronous positioning and mapping) technology, some advanced SLAM technologies can directly construct a dense scene map in a state estimation mode, but the technology still has some defects which are difficult to perfectly solve at present: 1) the constructed scene structure is composed of dense point clouds with color information and is a scattered point structure essentially, while the scene structure in a real scene is composed of continuous lines or surfaces, and the scattered point structure can lose important scene continuous structure information; 2) due to the diversity of scenes and various unexpected emergencies within a scene, even the most advanced SLAM technology cannot maintain excellent state solution in all types of environments, so a dense scene map constructed entirely by means of SLAM technology often cannot achieve satisfactory accuracy.

In summary, the present invention needs to solve the following problems:

1. the problem of discontinuous scene structure reconstruction results is as follows: the dense map constructed by a plurality of traditional scene structure reconstruction schemes is a scattered-point type 'pseudo dense' map, and obviously, the map cannot directly support tasks such as automatic navigation, path planning and the like of various unmanned vehicles;

2. the problem of too high reconstruction cost of a scene structure is as follows: in a scene structure reconstruction scheme based on a high-precision RTK device and a multi-line beam laser radar, hardware devices with very high prices, such as a high-precision satellite navigation device and a 64-line laser radar, are often purchased by research and development teams, for most of the research and development teams, the hardware cost will cause significant economic burden, and for companies, the hardware device scheme obviously cannot meet the requirement of mass production due to the problem of high cost per se.

3. The problem of insufficient accuracy of scene structure reconstruction results is as follows: many conventional scene structure reconstruction schemes, especially scene structure reconstruction schemes based on various motion information resolving technologies, often have unsatisfactory accuracy of a constructed scene structure result, and the reasons for the situation are many, and the reason is that a single type of algorithm often cannot obtain satisfactory processing results under all situations, which is one of the widely inherent disadvantages of solutions mainly based on algorithm type technologies.

Disclosure of Invention

In order to solve the problems in the background art, the invention combines a stereoscopic vision technology, a triangulation technology, a mesh subdivision technology, a mesh optimization technology and the like, develops a whole set of incremental scene gridding structure reconstruction method capable of keeping high stability in an outdoor open environment, and proves the effectiveness of the system through a large number of experiments.

The technical scheme adopted by the invention comprises the following steps:

1) and inputting a single-frame binocular video frame in a scene acquired by the vehicle-mounted binocular camera, and carrying out scene gridding structure reconstruction on the input single-frame binocular video frame to obtain network structure characteristics.

The method is applied to unmanned driving or other indoor or outdoor scenes with definite visual texture characteristics, and the unmanned driving scenes particularly comprise common road scenes such as urban roads, rural roads, expressways and the like.

The step 1) is specifically as follows:

1.1) respectively extracting visual feature points on two images in a single-frame binocular video frame acquired by a binocular camera; there may be various visual feature point selection schemes, such as FAST feature points, ORB feature points, BRIEF feature points, etc., which may be selected according to different environmental characteristics, different system frame rate requirements, different hardware devices, etc. In different types of environments, the types of feature points with the best effect may not be the same; and obtaining the FAST characteristic points, ORB characteristic points or BRIEF characteristic points as visual characteristic points.

1.2) visual feature point matching stage: matching the visual characteristic points obtained in the step 1.1) based on a Brute-Force matching (Brute-Force matching) method to obtain visual characteristic point pairs, and calculating the parallax value of each pair of visual characteristic points;

1.3) visual feature point depth estimation stage: establishing a camera coordinate system by taking the initial position of the left eye camera as a coordinate origin, and calculating the three-dimensional position of the visual feature point of the left eye image in the visual feature point pair in the camera coordinate system according to the parallax value of the visual feature point pair in the step 1.2), wherein the specific calculation process is as follows:

1.3.1) calculating the depth value of each visual characteristic point, thereby obtaining a sparse depth map of the left eye image:

wherein d is a parallax value, b is a base line length of the binocular camera, f is a focal length of the camera, and z is a depth value;

1.3.2) calculating the three-dimensional position of the visual characteristic point of the left eye image in a camera coordinate system:

P_w＝zK^-1P_uv

wherein K is an internal parameter matrix of the binocular camera, P_uvIs the homogeneous coordinate of the visual feature point on the image pixel coordinate system, the image pixel coordinate system is a two-dimensional coordinate system with the upper left corner of the image as the origin, P_wThe coordinate of the visual feature point in a camera coordinate system, u is the corresponding position of the visual feature point on a transverse coordinate axis in an image pixel coordinate system, v is the corresponding position of the visual feature point on a longitudinal coordinate axis in the image coordinate system, X is the corresponding position of the visual feature point on an X axis in the camera coordinate system, Y is the corresponding position of the visual feature point on a Y axis in the camera coordinate system, and Z is the corresponding position of the visual feature point on a Z axis in the camera coordinate system;

1.4) triangulation stage: adopting a Delaunay triangulation method to subdivide the interior of a visual characteristic point set to obtain a triangular mesh structure, wherein the visual characteristic point set is a set of visual characteristic points on a left eye image;

1.5) mesh subdivision stage: performing depth interpolation on the triangular mesh structure of the current frame according to the triangular mesh structure generated in the step 1.4) and the depth value of each mesh vertex, iteratively searching visual feature points on the first 5% left eye image with the largest depth error in the whole triangular mesh structure as visual feature points to be updated through a Census descriptor, re-matching the visual feature points to be updated according to the hamming distance, adding the visual feature points to be updated into a visual feature point set, and performing Delaunay triangulation again inside the visual feature point set, thereby obtaining the subdivided scene meshing structure.

2) And searching for the position of the grid spiking phenomenon in the scene gridding structure by using the network structure characteristics, eliminating the grid spiking phenomenon, and improving the smoothness of the whole scene gridding structure by using an approximate Laplace smoothing method.

The "spiking" phenomenon refers to a phenomenon that the depths of a single or few grid vertexes appearing at local positions of the scene gridding structure are excessively different from those of surrounding grids, and is one of the most main reasons for inaccurate estimation of the whole scene gridding structure.

The step 2) is specifically as follows:

2.1) grid "spiking" removal stage: processing each mesh vertex on the scene meshing structure in a traversal mode, and replacing the depth of each mesh vertex by the average depth of each adjacent vertex when the depth of each mesh vertex is larger than or smaller than the depth of each adjacent mesh vertex and the absolute value of the difference between the average depth of each adjacent vertex and the average depth of each adjacent vertex is larger than a threshold value;

2.2) mesh structure smoothing stage: the mesh vertexes in the scene gridding structure are smoothed one by an approximate Laplace smoothing method, and the calculation mode of the approximate Laplace smoothing method is as follows:

wherein Z is_cAs depth values of vertices of the mesh to be processed, Z_cFor the average depth value of all adjacent vertices of the mesh vertex to be processed, alpha is a manually set damping parameter, P_wFor the position of the vertices of the mesh to be processed, P, in the camera coordinate system_oAnd optimizing the positions of the vertexes of the mesh to be processed in the camera coordinate system.

3) Acquiring position information of a scene by adopting vehicle-mounted satellite navigation equipment, calculating motion information between a previous frame and a current frame through the position information to obtain a transformation matrix for describing continuous motion between frames, and then detecting a new visual area; the new visual area is detected as: and constructing a virtual image according to the transformation matrix and the scene gridding structure of the previous frame, and determining a new visual area of the previous frame compared with the current frame by using the visual difference between the virtual image and the current frame.

The method for detecting the new visual area in step 3) is to detect the new visual area in the current frame in a virtual image-based manner, and specifically includes the following steps:

each pixel point of a left eye image in a previous single-frame binocular video frame is processed as follows to construct a virtual image:

P_pis the homogeneous coordinate, z, of the pixel point on the pixel coordinate system of the left eye image in the previous frame of the single-frame binocular video frame_pIs the depth value, P, of the pixel point at the moment of the last frame_cIs the homogeneous coordinate, z, of the pixel point on the pixel coordinate system of the left eye image in the current frame of the single-frame binocular video frame_cThe depth value of a pixel point at the current frame moment is shown, T is a motion matrix corresponding to the inter-frame motion information, and K is an internal parameter matrix of the binocular camera;

the method comprises the steps of utilizing a Navier-Stokes equation to carry out image restoration on a virtual image, then comparing visual differences on each triangular patch area in the restored virtual image and a current frame left-eye image to obtain a visual difference value, wherein the specific calculation mode of the visual difference value is as follows:

wherein n is the total number of pixels in the triangular area, g_piFor the gray value of the ith pixel position in the virtual image, g_ciD is a visual difference value;

finally, selecting a triangular patch with a visual difference value higher than the average visual difference value as a component patch of the new visual area to finish the detection of the new visual area; the average visual difference value is the average value of all visual difference values obtained after the visual difference on each triangular patch area in the repaired virtual image and the current frame left-eye image is compared.

4) And determining the position of the local scene gridding structure to be updated by utilizing the accumulated motion information of the camera position of the current frame relative to the camera position of the initial frame and the scene patch in the constructed overall scene gridding structure and combining the new visual area in the step 3).

The constructed overall scene gridding structure is a scene gridding structure constructed by all frames before the current frame; the overall scene gridding structure is a scene gridding structure constructed by the current frame and all frames before the current frame.

The step 4) is as follows: projecting a triangular patch in the constructed overall scene gridding structure onto a current frame by adopting the following calculation method:

wherein, P_wThe position of the vertex of the triangular patch in a world coordinate system with a first frame as an origin, z is the depth, T is the accumulated motion information of the camera position of the current frame relative to the camera position of the initial frame, and K is the internal parameter of the binocular cameraMatrix, P_uvThe homogeneous coordinate of the vertex of the triangular patch projected on the current frame on the pixel coordinate system;

in the constructed overall scene gridding structure, the position of a triangular patch which is overlapped or partially overlapped with a new visual area in a current frame and has a spatial distance within 5 meters with the new visual area corresponding to the current frame is regarded as the position of a local scene gridding structure to be updated.

5) The incremental splicing method of the scene gridding structure comprises the following steps: and after the new visual area is associated with the position of the grid structure of the local scene to be updated, splicing the grid structure of the whole scene by a Delaunay triangulation method.

The incremental splicing method for the scene gridding structure in the step 5) specifically comprises the following steps: and performing connected region search on all triangular patches in the new visual region of the current frame, sorting out all connected regions, projecting each connected region and the position of the local scene gridding structure to be updated to a position set on the left image of the current frame, and then performing Delaunay triangulation again to update the whole scene gridding structure.

6) And repeating the iteration steps 1-5 until all single-frame binocular video frames obtained by the vehicle-mounted binocular camera are processed, and finally obtaining an overall scene gridding structure reconstruction result capable of meeting the unmanned use requirement.

The invention has the beneficial effects that:

the invention creatively utilizes the visual difference between two frames of images to find a new visual area and incrementally updates the global scene reconstruction result in a gridding structure form, thereby obtaining the three-dimensional scene reconstruction method with better performance and higher robustness under various environments.

The invention obviously optimizes the following three defects of the traditional scene structure reconstruction scheme:

1. the problem of discontinuous scene structure reconstruction results is as follows: the scene structure constructed by the invention is a grid structure which is strictly continuous and has no structural holes, and the requirements of various unmanned vehicles on scene structure maps during tasks such as automatic navigation, path planning and the like can be met.

2. The problem of too high reconstruction cost of a scene structure is as follows: the method does not depend on expensive multi-line beam laser radar equipment at all, and obviously reduces the hardware cost required by a scene structure reconstruction system.

3. The problem of insufficient accuracy of scene structure reconstruction results is as follows: the invention solves the problem of insufficient accuracy of scene structure reconstruction results from two angles: 1) the precision of scene depth calculation is improved through a mesh subdivision technology and a mesh optimization technology; 2) and obtaining more accurate interframe motion information through the satellite navigation equipment than the interframe motion information obtained through the motion information resolving technology.

Drawings

FIG. 1 is an overall frame diagram of the present invention;

FIG. 2 illustrates a scene data collection module;

FIG. 3 shows a single frame scene gridding structure solution module;

FIG. 4 is a mapping of a single frame mesh structure onto a single frame image;

FIG. 5 illustrates a single frame scene gridding structure optimization module;

FIG. 6 illustrates the scene gridding structure optimization effect reflected by a depth map; (a) gridding the scene before optimization reflected by the depth map; (b) gridding the optimized scene reflected by the depth map;

FIG. 7 illustrates a scene gridding structure incremental update module;

FIG. 8 illustrates a generated virtual image; (a) the image is a virtual image before image restoration processing; (b) the virtual image is obtained after image restoration processing;

fig. 9 shows new visual area detection results;

FIG. 10 illustrates a partial map of a new visual area; (a) (c) is the part with inaccurate depth value or incomplete visual information in the previous frame; (b) (d) new visual area detection results;

FIG. 11 shows scene gridding structure reconstruction results; (a) in the form of a grid; (b) in RGB rendering form.

Detailed Description

As shown in fig. 1, the present invention mainly includes the following four technical modules, and each technical module is further described with reference to the accompanying drawings:

first, scene data acquisition module (as shown in fig. 2)

The first stage, video frame acquisition stage based on binocular camera:

the method is characterized in that equipment such as a vehicle-mounted binocular camera, an unmanned aerial vehicle-mounted binocular camera or a handheld binocular camera is utilized to perform mobile exploration in a scene needing to construct a continuous gridding structure and acquire scene continuous binocular video frames, and particularly, the used binocular camera is strictly calibrated. In this stage, it is necessary to ensure that the binocular camera is kept horizontal during the acquisition process and the frame rate is kept stable, so as to ensure the availability of the acquired scene visual information.

And a second stage, namely a motion information acquisition stage based on the satellite navigation equipment:

and synchronously recording position information in a world coordinate system in the process of recording continuous binocular video frames of a scene by utilizing various satellite navigation equipment, and resolving the position information to obtain interframe motion information. It should be noted that, due to different frame numbers or other errors that are difficult to completely eliminate, scene visual information recorded by a binocular camera and position information recorded by satellite navigation equipment cannot be completely synchronized in a strict sense, and the solution adopted by the present invention is: assuming that the motion of the mobile carrier is a uniform motion within a short time (<0.1s), then performing uniform position compensation according to the difference between the corresponding scene visual information timestamp and the position information timestamp. Experiments have shown that this solution is effective.

Two, single frame scene gridding structure resolving module (as shown in figure 3)

The first stage, visual feature point extraction stage:

in the invention, the structure is presented in a grid form, so that the grid vertex is expected to more efficiently reflect the remarkable structural characteristics of the scene, and the accurate pixel point matching between the two images is realized by adopting a visual characteristic point extraction and matching mode. There may be various visual feature point selection schemes, such as FAST feature points, ORB feature points, BRIEF feature points, etc., and the selection may be flexible according to different environment characteristics, different system frame rate requirements, different hardware devices, etc., and experiments prove that the feature point types with the best effect may be different in different environments.

The second stage, visual feature point matching stage:

since the basic idea of scene depth estimation in the present invention is a stereoscopic vision idea, after the visual feature points on two images in a single frame of binocular video frame are extracted, the visual feature points on the two images need to be accurately matched. The process can be realized by various schemes, and as the scene structure reconstruction of the invention is not on-line reconstruction but off-line reconstruction, the invention has no strict requirement on the real-time performance of the matching scheme, so that the invention adopts a Brute-Force matching (Brute-Force mather) mode, directly calculates the Euclidean distance or the Hamming distance between each visual feature point between two images according to the type of the selected visual feature point, and then performs the visual feature point matching work between the two images, and experiments prove that the matching strategy is effective.

The third stage, a visual feature point depth estimation stage:

after the accurate matching of the visual feature points between the two images in the single-frame binocular video frame is completed, the parallax of the corresponding position can be obtained according to the pixel difference in the horizontal direction between the matched visual feature point pairs. In the invention, the left camera in the binocular camera is used as the basis of scene structure perception, so that the sparse disparity map of the left eye image can be obtained through the processing of the stage, and the position with disparity information is the position of the visual feature point on the left eye image. After the disparity value and the parameters of the binocular camera are obtained, the accurate depth value of the corresponding point can be obtained according to the principle of a binocular camera model, and the calculation mode is as follows:

wherein d is a parallax value, b is a binocular camera baseline length, f is a camera focal length, and z is a depth result;

thus, a sparse depth map of the left eye image is obtained.

After obtaining the sparse depth map of the left eye image, the accurate position of the pixel point with the depth value in the three-dimensional space needs to be known, and the accurate position can be obtained by calculation according to the pinhole camera model principle, and the calculation method is as follows:

P_w＝zK^-1P_uv

wherein K is an internal parameter matrix of the binocular camera, P_uvFor homogeneous coordinates, P, of the visual feature points on the image pixel coordinate system_wIs the coordinate of the visual feature point in three-dimensional space.

Therefore, the accurate position of each visual characteristic point in the left eye image in the three-dimensional space is obtained.

A fourth stage, a triangulation stage:

the invention needs to reconstruct a scene gridding structure with complete continuity, and the accurate three-dimensional position of the sparse scene characteristic points is not enough. Here, the positions of the visual feature points representing the salient features of the scene in the left eye image in the three-dimensional space are obtained, and the scattered points need to form a continuous mesh through the connection between the points through a triangulation scheme. The invention is realized by adopting a Delaunay triangulation scheme, and the main reason for selecting the triangulation scheme is that the Delaunay triangulation scheme maximizes a minimum angle and can form a triangulation network which is closest to regularization, which is very helpful for truly embodying a scene structure.

The fifth stage, the mesh subdivision stage:

the scene gridding structure obtained through Delaunay triangulation is an initial scene gridding structure, the accuracy is not enough to support practical use, and the mesh needs to be further subdivided through a mesh subdivision scheme, so that the mesh achieves higher accuracy. Determining the mesh vertices newly added to the existing mesh is the main driving force of the mesh refinement scheme. In the third stage of the module, a sparse disparity map of the left eye image is obtained, in the fourth stage of the module, an initial scene gridding structure corresponding to the left eye image is obtained, based on the data bases, a dense disparity map of the left eye image can be obtained through a triangle internal interpolation method, according to the dense disparity map of the left eye image, any pixel position in the left eye image can be corresponding to the right eye image, and the error degree of the corresponding disparity is judged by measuring the hamming distance between Census descriptors (a descriptor calculation mode based on pixel neighborhood statistics) of paired pixels. After a dense error degree graph of a left eye image is obtained, the image is uniformly divided into a plurality of parts, a plurality of pixel positions with the highest error degree are taken in each part as positions of grid vertexes newly added into an initial scene grid, but the points do not have corresponding accurate positions in a three-dimensional space, for the points, as the whole system is based on a binocular camera system and the binocular camera is kept horizontal when scene visual information is recorded, only the corresponding positions in the right eye image with the highest matching degree are searched on a horizontal line according to Census descriptors, then parallax is calculated, the depth values of the points can be obtained, and the accurate positions of the points in the three-dimensional space can be obtained according to the depth values and a parameter matrix in the camera. After the mesh vertex position of a newly added mesh is determined, the original Delaunay triangulation result is emptied, and the Delaunay triangulation is carried out again according to all the mesh vertices to obtain the subdivided scene meshing structure. According to the actual precision requirement, the grid subdivision process is iterated for multiple times, and the scene grid structure reconstruction result which meets the precision requirement and is shown in fig. 4 can be obtained.

Three, single frame scene gridding structure optimization module (as shown in figure 5)

The first stage, the grid nail removing stage:

after the reconstruction result of the grid structure of the single frame scene is obtained, because the grid vertexes are only added to the grid structure iteratively and are not deleted according to strict logic in the grid subdivision stage, the grid structure of the scene still has errors, wherein the most significant error is the nail sticking error. The expression form of the error is that a few triangular mesh vertexes with very high prominence suddenly exist in certain parts of the mesh, the vertexes form a visual effect similar to 'nailing', the vertex mainly comes from matching errors in the process of resolving depth based on stereoscopic vision, and the phenomenon of 'nailing' can be caused when the parallax obtained according to the matching result is too large or too small. In the invention, the scheme of removing the grid nail prick is direct, whether a grid vertex is a key vertex causing the nail prick phenomenon is judged, the absolute value of the difference between the depth of the grid vertex and the average depth of all adjacent vertexes is mainly calculated, the depth of the grid vertex is compared with the depth of all adjacent vertexes, when the absolute value of the depth difference is larger than a certain threshold set according to practical experience and the depth of the grid vertex is larger than or smaller than the depth of all adjacent vertexes, the point is considered to be the key vertex causing the nail prick phenomenon, and the average depth of all adjacent vertexes of the point is used for replacing the original depth of the point, so that the single nail prick phenomenon can be solved. Experiments prove that after the process is applied to the whole grid area, all 'nailing' phenomena in the scene grid structure can be solved.

And a second stage, namely a grid structure smoothing stage:

in a conventional mesh structure smoothing scheme, a laplacian smoothing scheme is often used to smooth the mesh structure, which can significantly improve the smoothness of the mesh structure and also change the positions of a plurality of mesh vertices inside the mesh. However, in the present invention, in order to prevent the positions of the mesh vertices whose positions are changed from being changed on the image pixel coordinate system, an approximately laplacian smoothing scheme is adopted, which is still processed mesh-vertex-by-mesh vertex, according to the following formula:

wherein Z is_cAs depth values of vertices of the mesh to be processed, Z_nFor the average depth value of all adjacent vertices of the mesh vertex to be processed, alpha is a manually set damping parameter, P_wFor the position of the vertices of the mesh to be processed in three-dimensional space, P_oAnd optimizing the position of the vertex of the mesh to be processed in the three-dimensional space.

Experiments prove that after the process is applied to the whole grid area, the smoothness of the scene gridding structure can be obviously improved to be closer to the real scene structure, as shown in fig. 6, the optimization effect of the scene gridding structure reflected by a depth map is shown, and fig. 6(a) and 6(b) are the scene gridding structures before and after optimization reflected by the depth map respectively.

Fourth, incremental updating module of scene gridding structure (as shown in FIG. 7)

The first stage, scene gridding structure updating detection stage:

after the single-frame scene gridding structure is processed by the first-time single-frame scene gridding structure optimizing module, a single-frame scene gridding structure reconstruction result is obtained, and all three-dimensional triangular surface patches contained in the reconstruction result are stored in a scene gridding structure library in a unified mode. Firstly, after a new frame of binocular image is input into the system, the system can construct a scene gridding structure corresponding to a new binocular image frame according to the three modules, then judge which areas on a left eye image in the new binocular image belong to new visual areas, and take the new visual areas and related triangular patches as new triangular patches which need to be added into an original scene gridding structure library. In the invention, a virtual image scheme is adopted to complete the selection work of a new triangular patch. The system obtains a scene gridding structure corresponding to a previous frame of binocular image, a scene gridding structure corresponding to a current frame of binocular image and motion information between the two frames of binocular images, and according to the computer vision principle, the position of each point on a left eye image in the previous frame of binocular image projected on the left eye image in the current frame of binocular image after the action of the motion information can be obtained, and a specific calculation mode aiming at a single point on the left eye image in the previous frame of binocular image is as follows:

setting the homogeneous coordinate of the point on the pixel coordinate system in the left eye image in the previous frame of binocular image as P_pThe depth value of the point at the time of the last frame is z_pThe homogeneous coordinate of the point on the pixel coordinate system in the left eye image in the current frame binocular image is P_cThe depth value of the point at the current frame time is z_cThe motion matrix corresponding to the interframe motion information is T, the intrinsic parameter matrix of the camera is K, and the calculation is carried out according to the following formula:

after the process is applied to all pixel points with depth values in a left-eye image in a previous frame of binocular image, a virtual image generated according to the depth values and inter-frame motion information can be obtained, so that the virtual image is an image with holes and needs to be restored in an interpolation mode. After the restored virtual image is obtained, comparing the gray value of the image with the gray value of the left eye image in the current frame binocular image, counting the average gray value error, selecting a triangular patch with the gray value error higher than the average gray value error as a component patch of a new visual area (for example, a white patch area shown in fig. 9 shows a new visual area detection result), and adding the triangular patch into a scene gridding structure library to prepare for the later scene gridding structure updating.

Fig. 10 shows a partial correspondence map of a new visual region, wherein (a) (c) is a portion of a previous frame where depth values are inaccurate or visual information is incomplete; (b) and (d) the new visual region detection results corresponding to (a) and (c).

And a second stage, namely a scene gridding structure updating stage:

in order to make the incrementally generated scene grid structure be a continuous grid structure without redundant patches, after the triangular patches representing the new visual area obtained by the previous stage processing are obtained, it is also necessary to select which scene triangular patches in the original scene grid structure library need to be optimized. In the invention, the scheme based on the interframe mapping is still adopted to complete the work. The specific implementation process comprises the following steps: mapping a triangular patch in a scene gridding structure library corresponding to a plurality of frames (ten frames are set in the specific embodiment) before the current frame to the current frame according to the motion information, and regarding the triangular patch which is overlapped or partially overlapped with a new visual area in the current frame as the triangular patch to be optimized; then, searching communicated areas of all triangular patches in the new visual area in the current frame, and sorting out all the communicated areas; secondly, according to the image coordinate vertexes of the triangular patches to be optimized after being mapped to the current frame and the image coordinate vertexes corresponding to all triangular patches in the new visual area in the current frame, the triangular patches are gathered together, and through a Delaunay triangulation technology, each connected area is taken as a reference to carry out re-triangulation work; and finally, deleting the triangular patches to be optimized from the scene gridding structure library, and adding the triangular patches corresponding to the new Delaunay triangulation area into the scene gridding structure library. The operation is carried out frame by frame, and the reconstruction result of the gridding structure of the continuous scene on the global scope can be obtained. Fig. 11 shows a reconstruction result of a gridding structure of a scene, where fig. 11(a) is in a grid form, a structure above a certain depth is replaced by scattered points for visualization, and fig. 11(b) is in an RGB rendering form.

Finally, it should be noted that the above embodiments are merely representative examples of the present invention. Obviously, the technical solution of the present invention is not limited to the above-described embodiments, and many variations are possible. A person skilled in the art may make modifications or changes to the embodiments described above without departing from the inventive idea of the present invention, and therefore the scope of protection of the present invention is not limited by the embodiments described above, but should be accorded the widest scope of the innovative features set forth in the claims.

Claims

1. A binocular camera-based unmanned scene incremental gridding structure reconstruction method is characterized by comprising the following steps:

1) inputting a single-frame binocular video frame in a scene collected by an on-board binocular camera, and carrying out scene gridding structure reconstruction on the input single-frame binocular video frame;

the step 1) is specifically as follows:

1.1) respectively extracting visual feature points on two images in a single-frame binocular video frame acquired by a binocular camera;

P_w＝zK^-1P_uv

wherein K is an internal parameter matrix of the binocular camera, P_uvFor homogeneous coordinates, P, of the visual feature points on the image pixel coordinate system_wThe coordinate of the visual feature point in a camera coordinate system, u is the corresponding position of the visual feature point on a transverse coordinate axis in an image pixel coordinate system, v is the corresponding position of the visual feature point on a longitudinal coordinate axis in the image coordinate system, and X is the visual feature point on the camera coordinate systemThe corresponding position of the visual characteristic point on the x axis in the coordinate system, Y is the corresponding position of the visual characteristic point on the Y axis in the camera coordinate system, and Z is the corresponding position of the visual characteristic point on the Z axis in the camera coordinate system;

1.5) mesh subdivision stage: performing depth interpolation on the triangular mesh structure of the current frame according to the triangular mesh structure generated in the step 1.4) and the depth value of each mesh vertex, iteratively searching visual feature points on the first 5% left eye image with the largest depth error in the whole triangular mesh structure as visual feature points to be updated through a Census descriptor, re-matching the visual feature points to be updated according to the hamming distance, adding the visual feature points to be updated into a visual feature point set, and performing Delaunay triangulation again inside the visual feature point set, thereby obtaining a subdivided scene gridding structure;

2) searching for the position of a grid spiking phenomenon in the scene gridding structure by using the network structure characteristics, eliminating the grid spiking phenomenon, and improving the smoothness of the whole scene gridding structure by using an approximate Laplace smoothing method;

3) acquiring position information of a scene by adopting vehicle-mounted satellite navigation equipment, calculating motion information between a previous frame and a current frame through the position information to obtain a transformation matrix for describing continuous motion between frames, and then detecting a new visual area; the new visual area is detected as: constructing a virtual image according to the transformation matrix and the scene gridding structure of the previous frame, and determining a new visual area of the previous frame compared with the current frame by using the visual difference between the virtual image and the current frame;

4) determining the position of a local scene gridding structure to be updated by utilizing the accumulated motion information of the camera position of the current frame relative to the camera position of the initial frame and the scene facet in the constructed overall scene gridding structure and combining the new visual area in the step 3);

5) the incremental splicing method of the scene gridding structure comprises the following steps: after the new visual area is associated with the position of the grid structure of the local scene to be updated, splicing the grid structure of the whole scene by a Delaunay triangulation method;

6) and repeating the iteration steps 1-5 until all single-frame binocular video frames obtained by the vehicle-mounted binocular camera are processed, and finally obtaining an overall scene gridding structure reconstruction result.

2. The binocular camera-based unmanned scene incremental gridding structure reconstruction method according to claim 1, wherein the binocular camera-based unmanned scene incremental gridding structure reconstruction method comprises the following steps:

the step 2) is specifically as follows:

wherein Z is_cAs depth values of vertices of the mesh to be processed, Z_nFor the average depth value of all adjacent vertices of the mesh vertex to be processed, alpha is a manually set damping parameter, P_wFor the position of the vertices of the mesh to be processed, P, in the camera coordinate system_oAnd optimizing the positions of the vertexes of the mesh to be processed in the camera coordinate system.

3. The binocular camera-based unmanned scene incremental gridding structure reconstruction method according to claim 1, wherein the binocular camera-based unmanned scene incremental gridding structure reconstruction method comprises the following steps: the method for detecting the new visual area in step 3) is to detect the new visual area in the current frame in a virtual image-based manner, and specifically includes the following steps:

and finally, selecting a triangular patch with the visual difference value higher than the average visual difference value as a composition patch of the new visual area to finish the detection of the new visual area.

4. The binocular camera-based unmanned scene incremental gridding structure reconstruction method according to claim 1,

wherein, P_wThe position of the vertex of the triangular patch in a world coordinate system with a first frame as an origin, z is depth, T is accumulated motion information of the camera position of the current frame relative to the camera position of the initial frame, K is an internal parameter matrix of a binocular camera, and P is_uvThe homogeneous coordinate of the vertex of the triangular patch projected on the current frame on the pixel coordinate system;

in the constructed overall scene gridding structure, the position of a triangular patch which is overlapped or partially overlapped with a new visual area in a current frame and has a spatial distance within 5 meters with the new visual area corresponding to the current frame is regarded as the position of the local scene gridding structure to be updated.

5. The binocular camera-based unmanned scene incremental gridding structure reconstruction method according to claim 1, wherein the binocular camera-based unmanned scene incremental gridding structure reconstruction method comprises the following steps: