CN116258817A

CN116258817A - Automatic driving digital twin scene construction method and system based on multi-view three-dimensional reconstruction

Info

Publication number: CN116258817A
Application number: CN202310123079.6A
Authority: CN
Inventors: 李涛; 李睿航; 潘之杰
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-02-16
Filing date: 2023-02-16
Publication date: 2023-06-13
Anticipated expiration: 2043-02-16
Also published as: CN116258817B

Abstract

The invention provides an automatic driving digital twin scene construction method and system based on multi-view three-dimensional reconstruction. The invention firstly provides an automatic driving digital twin scene construction system based on multi-view three-dimensional reconstruction, which comprises six modules including data acquisition and processing, camera pose estimation, multi-view three-dimensional reconstruction, point cloud model scale correction, empty point cloud model fusion and model precision quantitative evaluation. The multi-view three-dimensional reconstruction method of the automatic driving scene is also provided, and the problems of insufficient feature extraction in the texture complex region, large influence by noise data and the like in the prior art are solved, so that the reconstruction precision of the three-dimensional model of the automatic driving scene is improved, and the occupied space of a video memory is reduced. In addition, the invention also provides an automatic outside-cab scene image acquisition scheme based on the unmanned aerial vehicle, which can effectively and comprehensively acquire image data required by the multi-view three-dimensional reconstruction method.

Description

Automatic driving digital twin scene construction method and system based on multi-view three-dimensional reconstruction

Technical Field

The invention relates to the field of computer vision and the technical field of automatic driving, in particular to an automatic driving digital twin scene construction method and system based on multi-view three-dimensional reconstruction.

Background

In recent years, due to the limitation of the perception range and the capability of a single vehicle, automatic driving gradually progresses towards intelligent networking with digital twinning as a core. The automatic driving digital twin technology constructs a high-precision traffic scene in a three-dimensional virtual environment based on sensing data in a real scene. The reconstructed three-dimensional virtual scene not only provides a complete, sufficient and editable scene for automatic driving simulation test, but also provides a large amount of driving data for automatic driving algorithm optimization, and provides a high-efficiency scheme for automatic production of a high-precision map. The autopilot digital twinning technique establishes a link between the physical world and the virtual world by mapping real traffic scene elements to virtual space. In the virtual world, people can control each link in the automatic driving in the whole process and the whole element, freely switch the time and the space according to certain logic and rules, study the key technology in the automatic driving with low cost and high efficiency, and further drive the development of the automatic driving technology in the real world.

The essence of the autopilot three-dimensional virtual scene reconstruction is the perception and restoration of the three-dimensional geometry of the traffic scene. Compared with a laser radar scanning scheme, the vision-based method analyzes the multi-view geometric relationship and the internal relation from the two-dimensional projection, and has the advantages of low cost, convenience in data acquisition, dense reconstruction model, rich textures and the like. The traditional Colmap multi-view three-dimensional reconstruction algorithm is based on classical calculation geometry theory, utilizes a normalized cross-correlation matching method to measure the optical consistency between views, and utilizes PatchMatch to carry out depth transfer. The Colmap algorithm has strong interpretability, but the depth map calculation time is long, and the Colmap algorithm is difficult to apply to large-scale outdoor scenes such as automatic driving. In recent years, with the development of deep learning, deep neural networks exhibit a strong capability of extracting image features. Many studies input multi-view and corresponding camera poses into a depth neural network to achieve end-to-end multi-view stereo matching and depth map estimation. MVSNet proposes a homography transformation construction cost body which can be made micro, and three-dimensional convolution is carried out on the aggregated cost body by using a multi-scale Unet-like structure to obtain a smoothed probability body, and the depth map of each image is estimated according to the smoothed probability body. The CVP-MVSNet optimizes the depth map quality from thick to thin by designing a pyramid structure; the Fast-MVSNet adopts a sparse cost body construction method to improve the speed of depth map estimation. These algorithms greatly reduce depth map estimation time compared to the traditional Colmap algorithm.

Although the prior art achieves good results on indoor scene data sets, it faces major challenges in large-scale scenes outside of the autopilot, mainly including:

(1) In the prior art, the problem of insufficient feature extraction exists in an automatic driving scene with a complex structure, so that the reconstruction precision is insufficient. The automatic driving cab exterior scene has the characteristics of wide reconstruction range, complex scene structure, large light change and the like, and low textures and areas rich in textures coexist in the scene, and the prior art adopts fixed convolution kernel size to extract the characteristics, so that the scene characteristics rich in the areas with complex textures are ignored; in addition, the prior art gives equal weights to the characteristic channels, and noise data cannot be filtered well, so that model reconstruction accuracy is insufficient.

(2) In the prior art, the occupation of the video memory space in the automatic driving large-scale outdoor scene reconstruction is overlarge. In an autopilot large-scale outdoor scene, the distance difference between different objects in the scene and the camera is large, and the assumed depth range of the depth map must be set large; in addition, in order to obtain a better reconstruction effect, the image resolution is set to be larger, which directly leads to the volume of the constructed cost body to be larger, and the network model occupies a large amount of video memory space during reasoning. For example, when the image width is set to 1200 pixels, the height is 800 pixels, and the depth range is set to 1024, the algorithm reasoning stage occupies about 29Gb of video memory space, which is difficult to run on a common consumer-level video card.

(3) The quality of the input multi-view images directly influences the effect of reconstructing the scene, and most of data sets used in the prior art are collected around a certain object or a certain building, so that the collection mode is not suitable for automatically driving a large-scale outdoor scene. For this purpose, a scheme for acquiring and processing the outdoor image data of the automatic driving scene needs to be proposed.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an automatic driving digital twin scene construction method and system based on multi-view three-dimensional reconstruction. The feature extraction module in the automatic driving scene multi-view three-dimensional reconstruction method further extracts features of the texture complex region by changing the size of the convolution kernel, and gives different weights to different feature channels. In addition, the extracted feature map is subjected to homography transformation to form a plurality of feature bodies and is aggregated into a cost body, the cost body is sliced along the depth dimension direction, and a time-space recurrent neural network is used for regularization treatment of a slice sequence, so that the correlation among slices is reserved while the size of a video memory space occupied by a model reasoning stage in a large-scale scene outside an automatic cab is reduced. In addition, the invention also provides an automatic outside-cab scene image acquisition scheme based on the unmanned aerial vehicle, which can effectively and comprehensively acquire image data required by the multi-view three-dimensional reconstruction method. Finally, the invention also provides an automatic driving digital twin scene construction system based on multi-view three-dimensional reconstruction, which comprises six modules, namely data acquisition and processing, camera pose estimation, multi-view three-dimensional reconstruction, point cloud model scale correction, space-place cloud model fusion and model precision quantitative evaluation.

The aim of the invention is realized by the following technical scheme: an autopilot digital twin scene construction system based on multi-view three-dimensional reconstruction, the autopilot digital twin scene construction system comprising:

the data acquisition and processing module (M100) is used for acquiring and preprocessing the multi-view images of the automatic driving scene and dividing the processed image data into a plurality of groups;

a camera pose estimation module (M200) for taking the collected multi-view images as input, and outputting the corresponding position and pose of a camera for shooting each image, thereby obtaining the internal and external parameter sequences of the camera;

the multi-view three-dimensional reconstruction module (M300) is used for constructing a network model, extracting a characteristic image sequence of a multi-view image through the network model, constructing a cost body by combining an internal and external parameter sequence of a camera, slicing the cost body along the depth dimension direction, processing the sliced cost body to obtain a probability body, estimating a depth image of the multi-view image according to the probability body, and finally fusing the depth image to obtain a three-dimensional dense point cloud of the scene;

the point cloud model scale correction module (M400) is used for constructing a triangle patch with a medium proportion in a virtual three-dimensional space by taking three characteristic points obtained by processing in the module (M100) and side lengths of triangles formed by the three characteristic points as input parameters, finding out positions of three corresponding characteristic points in a scene three-dimensional dense point cloud obtained by the multi-view three-dimensional reconstruction module (M300), registering the three characteristic points in the virtual point cloud model with the corresponding three points in the triangle patch at the same time, and performing scale transformation on the three-dimensional dense point cloud;

The space-point cloud model fusion module (M500) is used for dividing the three-dimensional dense point cloud obtained by reconstructing the image acquired by the unmanned aerial vehicle in the data acquisition and processing module (M100) into a space-point cloud model and dividing the three-dimensional dense point cloud obtained by reconstructing other groups of images into a ground point cloud model; registering a plurality of ground point cloud models to an aerial point cloud model by the module to form a final automatic driving digital twin scene model;

and the model precision quantitative evaluation module (M600) is used for quantitatively evaluating the precision of the three-dimensional model of the automatic driving scene and judging that the precision of the three-dimensional model of the automatic driving scene meets the requirement of a subsequent automatic driving task.

As a preferred embodiment of the present invention, the method for collecting and processing data by the data collecting and processing module (M100) includes the steps of:

s201, demarcating an automatic driving scene range to be rebuilt;

s202, presetting a data acquisition route, flying according to a preset S-shaped route at a fixed flying height by using an unmanned aerial vehicle, and shooting scene images at shooting points;

s203, lowering the flight height of the unmanned aerial vehicle, and shooting around a building in a scene in a splayed flight-around mode;

s204, for buildings beside the road and road sections where the road is completely covered by trees, acquiring data in a surrounding shooting mode by using a handheld shooting device;

S205, preprocessing all collected image data: adjusting the image size to 3000 pixels wide and 2250 pixels high by reserving the most central area of the image, and then downsampling the image to 1600 pixels wide and 1200 pixels high;

s206, dividing the preprocessed image data into a plurality of groups, wherein the images acquired in the step S202 and the step S203 are divided into a group to be used as a first group of images; grouping the images photographed for each building or road section in step S204 individually;

s207, selecting three most obvious characteristic points in the real scene covered by each group of images, and recording the positions of the characteristic points and the millimeter-level precision side length of the formed triangle.

As a preferred scheme of the invention, the camera pose estimation module (M200) comprises a retrieval matching unit and an incremental reconstruction unit; the searching and matching unit takes multi-view images as input, searches for image pairs which are geometrically verified and have overlapping areas, and calculates the projection of the same point in space on two images in the image pairs; the incremental reconstruction unit is used for outputting the corresponding position and posture of the camera for shooting each image.

More preferably, the specific process of outputting the position and the posture corresponding to the camera shooting each image by the incremental reconstruction unit is as follows: first selecting and registering a pair of initial image pairs at a multi-view image dense location; then selecting the image with the largest registration point number with the currently registered image; registering the newly added view with the image set with the determined pose, and estimating the pose of a camera shooting the image by using a PnP problem solving algorithm; then, for the unreconstructed spatial points covered by the newly added registered image, triangulating the image and adding the new spatial points to the reconstructed spatial point set; and finally, performing one-time beam adjustment optimization adjustment on all the current estimated three-dimensional space points and camera pose.

The invention also provides a scene construction method of the automatic driving digital twin scene construction system, which comprises the following steps:

s501: the data acquisition and processing module (M100) is used for carrying out omnibearing sensing on an automatic driving scene and processing the acquired multi-view image data;

s502: inputting the acquired multi-view image data into a camera pose estimation module (M200), estimating the position and the pose corresponding to a camera shooting each image by searching, matching and incremental reconstruction methods, and obtaining an internal and external parameter sequence of the camera

S503: camera inner and outer parameter sequence

Inputting the image data acquired by the data acquisition and processing module (M100) into a network model constructed by the multi-view three-dimensional reconstruction module (M300); extracting an image sequence from image data using a network model>

Is->

And according to the camera in-out parameter sequence->

And feature map sequence

Constructing a feature sequence and polymerizing the feature sequence into a cost body; slicing the cost body along the depth dimension direction, and simultaneously processing each slice and two adjacent slices in front and behind through a network model to obtain probability describing probability distribution of each pixel at different depthsA body;

S504, adjusting the effective depth value in the real depth map into a real value body by a single-heat coding mode, wherein the real value body is used as a label for supervised learning; inputting the probability body and the true value body into an original network model, and obtaining a trained network model by a multi-round training mode to minimize the cross entropy loss function value between the probability body and the true value body;

s505, processing each image of the input multi-view image sequence through a trained network model to obtain a probability body, and adjusting the probability body into a depth map; then, filtering and fusing the depth map sequence to obtain a reconstructed three-dimensional dense point cloud of the scene;

s506: constructing a triangle patch with a medium proportion in a virtual three-dimensional space through a point cloud model scale correction module (M400), registering three characteristic points in the reconstructed three-dimensional dense point cloud with corresponding points in the triangle patch, and performing scale transformation on the three-dimensional dense point cloud of the scene;

s507: dividing the three-dimensional dense point cloud into an air point cloud model and a ground point cloud model by an air-to-ground point cloud model fusion module (M500); registering a plurality of ground point cloud models to an aerial point cloud model to form a final automatic driving digital twin scene model; and quantitatively evaluating the precision of the reconstructed three-dimensional model of the automatic driving scene to ensure that the precision meets the requirements of subsequent automatic driving tasks.

As a preferred embodiment of the present invention, the sequence of feature maps in step S503

The acquisition of (1) is specifically as follows: the offset of the convolution kernel direction vector is learned through the network model, so that the convolution kernel can adapt to areas with different texture structures, and finer features are extracted; secondly, up-sampling the feature images with different sizes to the original input image size, and connecting the feature images to form a feature image with 32 channels; then, the two-dimensional information u of each characteristic channel is calculated _c (i, j) compressing into one-dimensional real number z _c And performing two-stage full connection; finally, each real is written using a sigmoid functionNumber z _c Is limited to [0,1 ]]In the range, each channel of the feature map has different weights, so that noise data and irrelevant features in the matching process are weakened; repeating the above steps for each input image to obtain a feature map sequence +.>

As a preferred embodiment of the present invention, the aggregation of the cost body in step S503 specifically includes: selecting an image of the sequence of feature maps as reference feature map F ₁ The rest feature images are used as source feature images

Then according to the camera inner and outer parameter sequence

Projecting all feature images onto a plurality of parallel planes under a reference image through homography transformation to form N-1 feature bodies +. >

Finally, the feature bodies are aggregated into a cost body through an aggregation function.

As a preferred embodiment of the present invention, the probability body formation in step S503 is specifically: cutting the cost body into D pieces, wherein D is depth priori, and the depth value can be any one value from 0 to D; then the cost body slice sequence is regarded as a time sequence, the time-space recurrent neural network in the network model is sent to regularize, the time-space recurrent neural network uses ST-LSTM to transmit the storage state in the time sequence (horizontal direction) and the space domain (vertical direction), the connection between slice sequences is reserved, and the situation that the probability body has multiple peaks is reduced; in the horizontal direction, the first layer unit at a certain moment receives the hidden state and the memory state of the last layer unit at the previous moment and transmits the hidden state and the memory state layer by layer in the vertical direction; finally, a softmax normalization operation is used for outputting probability values of each pixel at the depth d E [0, D ] to form a probability body.

As a preferable aspect of the present inventionIn this case, step S505 specifically includes: performing argmax operation on a probability body obtained by reasoning a trained network model of each image to obtain a depth map sequence, filtering out a depth map with low confidence coefficient based on a luminosity consistency criterion and a geometric consistency criterion, and finally obtaining a depth map sequence through the formula p=dm ^-1 K ^-1 p fusing the depth map into a three-dimensional dense point cloud; where P is the pixel coordinate, d is the depth value inferred by the network model, and P is the three-dimensional coordinate in the world coordinate system.

Compared with the prior art, the invention has the following beneficial technical effects:

(1) The invention provides a multi-view three-dimensional reconstruction method for a large-scale scene outside an automatic cab. According to the method, the offset of the convolution kernel direction vector is learned through the sub-network model, so that different convolution kernel sizes are realized when different image parts are processed, different receptive fields are realized when the areas with different texture complexity are faced, and the method can be more suitable for the areas with the texture complexity. According to the method, different weights are given to different channels in the extracted feature map, so that important features in the matching process are enhanced, noise data and irrelevant features are weakened, and the accuracy and the robustness of feature extraction are improved. Aiming at the characteristics of coexistence of regions with complex scene structure, low texture and rich texture in an automatic driving outdoor scene, the invention provides a novel feature extraction module, and solves the problems of insufficient feature extraction, large influence by noise data and the like in the region with complex texture in the prior art, thereby improving the reconstruction precision of an automatic driving scene model.

(2) According to the multi-view three-dimensional reconstruction method of the automatic driving scene, the cost body is sliced along the depth dimension direction, and then the slice sequence is loaded into the space-time recurrent neural network, so that the size of the occupied space of the display memory in the reasoning stage is irrelevant to the assumed depth range d, and the network model can be used for reconstructing a large-scale scene outside the automatic driving scene with a wide assumed depth range. The invention uses the ST-LSTM to transfer the storage state in the horizontal and vertical directions, reduces the occupied space of the video memory, simultaneously reserves the connection between slice sequences, reduces the occurrence of multiple peak values of probability bodies in the prior art, and improves the accuracy of depth prediction.

(3) The invention provides an image data collection and data processing scheme for an autopilot scene, which presets an unmanned aerial vehicle flight route and shooting points in a mode of calculating an image overlapping rate, and performs flight shooting in a mode of combining an S shape and a splay around flight, so that the quality of collected multi-view images is ensured. In addition, aiming at the problem that part of lanes are blocked by trees under the air view angle, the scheme also uses the handheld shooting equipment to acquire image data of scenes near the ground and performs grouping processing on shot images, thereby providing a new scheme for effectively and comprehensively acquiring and processing outdoor image data of the automatic driving scene.

(4) The invention provides an automatic driving digital twin scene construction system based on multi-view three-dimensional reconstruction, which provides the whole process of automatic driving digital twin scene reconstruction and precision evaluation, and comprises a data acquisition and processing module, a camera pose estimation module, multi-view three-dimensional reconstruction module, a point cloud model scale correction module, an air-space point cloud model fusion module and a model precision quantitative evaluation module. The system can effectively and comprehensively collect image data of scenes outside the automatic cab and estimate the position and the posture of a camera for shooting the images. In addition, the system can reconstruct a three-dimensional model of the automatic driving scene with smaller occupied video memory space and stronger characteristic extraction capability of the texture complex region, improves the integrity of the scene model in a mode of empty space cloud model fusion, and also provides a quantitative evaluation method for evaluating the accuracy of the reconstructed scene model.

Drawings

Fig. 1 is a deep neural network model structure of the multi-view three-dimensional reconstruction method for a large-scale scene outside an automatic cab.

Fig. 2 is a feature extraction module of fig. 1.

Fig. 3 is a flowchart of an automatic driving scene data collection method according to the present invention.

FIG. 4 is a schematic diagram of an automatic driving scene data collection method according to the present invention; in the figure, S202, S203, S204 are the corresponding operation steps in the embodiment.

Fig. 5 is a schematic diagram of processing and grouping of image data in steps S205 and S206 in the embodiment.

Fig. 6 is a block diagram of an autopilot digital twin scene building system according to the present invention.

Fig. 7 is a schematic diagram of the addition of new spatial points by triangulation in the camera pose estimation module M200 in the system.

Fig. 8 is a schematic diagram of a spatial triangle patch constructed in the system point cloud model scale correction module M400.

Fig. 9 is a model fusion effect diagram of the system hollow place cloud model fusion module M500.

Fig. 10 is a diagram of the quantitative evaluation result of the system hollow place cloud model fusion module M600.

FIG. 11 is a scene effect map reconstructed by the autopilot digital twin scene construction system according to the present invention.

FIG. 12 is a scene effect map reconstructed by the autopilot digital twin scene construction system according to the present invention.

Detailed Description

The invention is further illustrated and described below in connection with specific embodiments. The described embodiments are merely exemplary of the present disclosure and do not limit the scope. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.

The structure of the automatic driving digital twin scene construction system based on multi-view three-dimensional reconstruction, which is provided by the invention, is shown in fig. 6, and the system consists of 6 modules, and the automatic driving digital twin scene is constructed according to the 6 modules:

the camera pose estimation module M200: in this embodiment, the image obtained by the M100 module is input to the module, and the camera pose for capturing these images is obtained through the processing of the search matching unit M201 and the incremental reconstruction unit M202. Specifically, the present embodiment matches image input retrievalThe unit firstly detects characteristic points of images of each view angle and extracts SIFT characteristic descriptors of the characteristic points, and then randomly selects a group of possibly matched image pairs C= { (I) _a ,I _b )|I _a ,I _b E, I, a < b, calculating a relation matrix M of corresponding characteristics of two images _ab ∈F _a ×F _b . Finally, the embodiment uses an eight-point method to estimate the base matrix of the camera, uses a RANSAC method to remove the corresponding relation which is judged to be the outer point when estimating the base matrix, and constructs the scene graph. The embodiment inputs the scene graph into the incremental reconstruction unit, and the embodiment selects the image with the largest number of registration points with the currently registered image and registers the newly added view with the image set with the determined pose. Optionally, in this embodiment, the PnP problem is solved by using a DLT algorithm, 2n constraint equations are obtained according to world coordinates of n points and pixel coordinates in the image, and then an overdetermined equation set is solved by using SVD, so as to realize estimation of the camera pose matrix. As shown in fig. 7, according to the projection of the same point in space in a plurality of images with different view angles, the embodiment searches for the unreconstructed spatial point covered by the newly added registered image and adds the unreconstructed spatial point to the reconstructed spatial point set. Finally, the present embodiment performs one-time beam adjustment optimization on all the three-dimensional spatial points and the camera pose estimated by the present embodiment. Specifically, the present embodiment minimizes the re-projection error in equation 1 by adjusting the camera pose and the estimated scene point position.

Wherein P is _i For the ith three-dimensional point in space, m represents the number of three-dimensional points in space, n represents the number of cameras, R and t are the rotation matrix and translation matrix in n cameras, e _ij ＝π(P _i ,R _j ,t _j )-p _ij Representing the projection error of the ith three-dimensional point in space at the jth camera.

Multi-view three-dimensional reconstruction module M300: the method comprises the steps of constructing a network model, extracting a characteristic image sequence of a multi-view image through the network model, constructing a cost body by combining an internal and external parameter sequence of a camera, slicing the cost body along the depth dimension direction, processing the sliced cost body to obtain a probability body, estimating a depth image of the multi-view image according to the probability body, and finally fusing the depth image to obtain a three-dimensional dense point cloud of a scene;

point cloud model scale correction module M400: as shown in fig. 8, the method is used for constructing a triangle patch with a medium proportion in a virtual three-dimensional space by using three feature points obtained by processing in a module (M100) and side lengths of triangles formed by the three feature points as input parameters, finding positions of three corresponding feature points in a scene three-dimensional dense point cloud obtained by a multi-view three-dimensional reconstruction module (M300), registering the three feature points in a virtual point cloud model with the corresponding three points in the triangle patch simultaneously, and performing scale transformation on the three-dimensional dense point cloud; in this embodiment, three feature points obtained in the module M100 and the side lengths of the triangle formed by the feature points are used as input parameters to construct an equal-proportion triangle patch R in a virtual three-dimensional space ₀ R ₁ T ₂ . Then, in the reconstructed three-dimensional point cloud model of the scene, the embodiment finds the positions of three corresponding feature points, registers the three feature points in the virtual point cloud model with the corresponding three points in the triangular patch at the same time, and performs scale transformation on the reconstructed three-dimensional point cloud model. Specifically, the embodiment expands the scale of the reconstructed three-dimensional point cloud model by 1.1 times.

Air-space cloud model fusion module M500: the three-dimensional dense point cloud reconstruction method comprises the steps of dividing a three-dimensional dense point cloud obtained by reconstructing images acquired by an unmanned aerial vehicle in a data acquisition and processing module (M100) into an aerial point cloud model, and dividing three-dimensional dense point clouds obtained by reconstructing other groups of images into a ground point cloud model; registering a plurality of ground point cloud models to an aerial point cloud model by the module to form a final automatic driving digital twin scene model; in this embodiment, three feature points also existing in the aerial point cloud model are found out from the two ground point cloud models, and the feature points are distributed at the intersection points of the edge and the three vertical surfaces of the building. According to the matched characteristic points, the embodiment respectively carries out translation matrix transformation and rotation matrix transformation on the two ground point cloud models, and fuses the two ground point cloud models into an aerial point cloud. The fusion result is shown in fig. 9, wherein the darker portions (buildings in the figure and roads below the figure that are blocked by trees) are the ground point cloud model before fusion.

Model accuracy quantitative evaluation module M600: the method is used for quantitatively evaluating the precision of the three-dimensional model of the automatic driving scene and judging that the precision of the three-dimensional model of the automatic driving scene meets the requirement of a subsequent automatic driving task; as shown in fig. 10, according to the positions and distances of 40 feature point pairs related to the lane recorded by the M100 module, the present embodiment searches for a corresponding feature point pair in the reconstructed three-dimensional point cloud model of the scene, measures the distance of the corresponding feature point pair, and compares the absolute error value and the error percentage of the distance of each point pair in the virtual three-dimensional point cloud model and the distance in the actual scene. The evaluation result of this embodiment is shown in fig. 10, and under 40 feature point pairs of the scene, the maximum absolute error in the three-dimensional point cloud model of the automatic driving digital twin scene constructed in this embodiment is 6.08cm, the average absolute error is 2.5cm, the maximum percentage is 2.361%, and the average percentage is 0.549%, where the percentage measurement index refers to the ratio of the error to the point-to-distance in the actual scene. The autopilot digital twin scene constructed in this embodiment is shown in fig. 11 and 12.

According to the multi-view three-dimensional reconstruction method for the large-scale scene outside the automatic cab provided by the first aspect of the invention, the embodiment comprises the following steps:

S1: the data acquisition and processing module (M100) is used for carrying out omnibearing sensing on an automatic driving scene and processing the acquired multi-view image data;

s2: inputting the acquired multi-view image data into a camera pose estimation module (M200), estimating the position and the pose corresponding to a camera shooting each image by searching, matching and incremental reconstruction methods, and obtaining an internal and external parameter sequence of the camera

S3: camera inner and outer parameter sequence

Is->

And according to the camera in-out parameter sequence->

And feature map sequence

Constructing a feature sequence and polymerizing the feature sequence into a cost body; slicing the cost body along the depth dimension direction, and simultaneously processing each slice and two adjacent slices in front and behind through a network model to obtain probability bodies describing probability distribution of each pixel at different depths;

s4: the effective depth value in the real depth map is adjusted to a real value body in a single-heat coding mode, and the real value body is used as a label for supervised learning; inputting the probability body and the true value body into an original network model, and obtaining a trained network model by a multi-round training mode to minimize the cross entropy loss function value between the probability body and the true value body;

S5: for an input multi-view image sequence, each image is processed by a trained network model to obtain a probability body, and the probability body is adjusted to be a depth map; then, filtering and fusing the depth map sequence to obtain a reconstructed three-dimensional dense point cloud of the scene;

s6: constructing a triangle patch with a medium proportion in a virtual three-dimensional space through a point cloud model scale correction module (M400), registering three characteristic points in the reconstructed three-dimensional dense point cloud with corresponding points in the triangle patch, and performing scale transformation on the three-dimensional dense point cloud of the scene;

s7: dividing the three-dimensional dense point cloud into an air point cloud model and a ground point cloud model by an air-to-ground point cloud model fusion module (M500); registering a plurality of ground point cloud models to an aerial point cloud model to form a final automatic driving digital twin scene model; and quantitatively evaluating the precision of the reconstructed three-dimensional model of the automatic driving scene to ensure that the precision meets the requirements of subsequent automatic driving tasks.

The third step in this embodiment is specifically: the size of the input image is wide w=1600 pixels and high h=1200 pixels. As shown in fig. 2, in this embodiment, three conventional convolution operations are performed on each input image to obtain three sizes

Is a feature map of (1). Since the convolution kernel of the traditional convolution operation is fixed in size, the receptive fields are the same, and many important features are often ignored when facing the complex texture region. Therefore, the embodiment learns the offset of the convolution kernel direction vector through the sub-network model, and further extracts the characteristics of the texture complex region. In this embodiment, the extracted feature map X is processed into a feature map X' composed of different channel weights by the formula 2. Specifically, the present embodiment upsamples feature maps of different sizes to the resolution of w×h by means of interpolation, and connects feature maps having 32 channels; the two-dimensional characteristics u of each channel are then determined according to equation 3 _c (i, j) compressing into one-dimensional real number z _c The method comprises the steps of carrying out a first treatment on the surface of the Then pair 32 real numbers z according to equation 4 _c Make two-stage full connection W ₁ ,W ₂ Finally, the range of each real number is limited to [0,1 ] by using sigmoid function]In the range, important features in the matching process are further enhanced, irrelevant features are weakened, and the accuracy of feature extraction is improved.

X′＝F _scale {[F _sq [f(upsaaple(X _C ))],s _c } (2)

s _c ＝F _ex (z,W)＝σ(g(z,W))＝σ(W ₂ δ(W ₁ z)) (4)

Then through the input camera internal and external parameter sequence

And the sequence of feature maps->

Constructing a cost body. In this embodiment, one of the input feature images is selected as the reference image feature image, and the other feature images are regarded as the source image feature images, and the depth range value is assumed to be d. Therefore, the embodiment maps the N-1 sheet source image feature map to d parallel planes under the reference image feature map through a micro single strain transformation to obtain N-1 feature volumes ∈ - >

These features characterize the ith feature pattern F in a plane of depth d _i And reference to feature map F ₁ Homography transformation relation between the two. For details of the homography transformation, reference may be made to Yao Y, et al MVSNET: depth inference for unstructured multi-view stereo [ C ]]Springer, european Conference on Computer Vision (ECCV), munich,2018:785-801 for these features, this example is treated according to equation 5, aggregated into a cost body C, where>

The average value of the cost volume sequence.

The resulting cost volume of volume size v=w·h·d·f is sliced into D pieces of size w·h·f, where W and H are the dimensions of the feature map, F is the number of channels of the feature map, and D is depth a priori. The sliced cost volume is not an isolated individual, and the sliced cost volume sequence is regarded as a time sequence and sent into a space-time recurrent neural network for regularization in the embodiment. The space-time recurrent neural network used in the implementation consists of a plurality of ST-LSTM memory units, consists of three moments in the horizontal direction, and respectively represents a cost body slice input at the last moment, the current moment and the next moment, each moment consists of four layers in the vertical direction, each layer is connected by one ST-LSTM memory unit, and the first layer unit at the current moment receives the hidden state and the memory state of the last layer unit at the last moment. The specific structure of the ST-LSTM unit may refer to y.wang et al, "PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi:10.1109/tpami.2022.3165153. Therefore, the present embodiment establishes a link of a slice sequence of a cost body in the above manner, reduces occurrence of multiple peaks in the obtained probability body, and improves accuracy of depth prediction. In addition, the network model only processes three slices at a time, so that the occupation of the video memory space in the model reasoning process is reduced. Specifically, in this embodiment, using the Pytorch deep learning framework, model reasoning is performed on a graphics card of model NVIDIA GeForce GTX 3060 (video memory 6 GB), and the above procedure is implemented.

The effective depth value in the real depth map is adjusted to a real value body by a single thermal coding mode, and the optimization objective function adopted in the model training process in the embodiment is formula 6, wherein G is as follows ^(d) (P) and P ^(d) (p) is the depth realism and estimate at the depth hypothesis d and pixel p, respectively; since the depth values in the real depth map are incomplete, in this embodiment, only pixels in the depth map where valid depth values are present are used for supervision.

And filtering and fusing the depth map. The present embodiment filters the depth map according to photometric consistency criteria and geometric consistency criteria. Specifically, the present embodiment regards a point whose probability value corresponding to the estimated depth value is lower than 0.8 as an outlier, and requires that the effective pixel satisfies the inequality described in equation 7

Wherein p is _i For reference image pixel point p ₁ Depth value d at ₁ In the neighborhood view p _i Projection at a point corresponding to a depth value d _i ，p _reproj Is p _i Depth value d at point _i Reprojection on the reference image corresponds to a depth value d _reproj 。

As shown in fig. 3, the method for acquiring and processing data of an automatic driving scene according to the second aspect of the present invention includes the following steps:

and S201, demarcating the range of the automatic driving scene to be rebuilt.

S202, presetting a data acquisition route, flying according to the preset route at a fixed flying height by using the unmanned aerial vehicle, and shooting scene images at shooting points.

S203, the flying height of the unmanned aerial vehicle is reduced, and a splayed flying mode is adopted to shoot around a building in a scene.

S204, acquiring data in a surrounding shooting mode by using a handheld shooting device for buildings beside the road and road sections where the road is completely blocked by trees.

And S205, preprocessing all the collected image data.

S206, dividing the image data acquired and processed according to the steps into a plurality of groups.

And S207, selecting characteristic points in the real scene covered by each group of images.

According to step S201, the present embodiment selects an area of about forty thousand square meters where the consumer unmanned aerial vehicle is allowed to fly without a large amount of highly reflective material, and selects a midday time period where the lighting conditions are sufficient, the traffic volume is small, and the unmanned aerial vehicle is used to take a photograph.

As shown in fig. 4, according to step S202, the embodiment fixes the flying height of the unmanned aerial vehicle to 90 m, uses the 35mm equivalent focal length of the onboard camera to 24mm, the frame size to 24×36mm, and sets the heading overlap rate and the side overlap rate to 85%, and calculates the shooting point position according to formulas 8 and 9.

Wherein fw _overlap Is the course overlap ratio, side _overlap Is the side overlap ratio, frame _w And frame _l The width and height of the frame size, respectively, focal is the equivalent focal length. In this embodiment, it is calculated that a shooting point is set every 13.5 meters along the flight direction of the unmanned aerial vehicle, and the lateral distance of the S-shaped flight route is 20 meters. At each shooting point, in the embodiment, one image is shot in five directions of forward tilting, backward tilting, forward tilting left, forward tilting right and downward tilting, and the total number of collected effective images is 380.

According to step S203, the embodiment adjusts the flying height of the unmanned aerial vehicle to 35 meters, photographs the obvious buildings in the scene around the buildings in the scene in a way of the figure 4 in which the figure eight is flown around, and each point in the figure 4 figure eight is a photographing point, and the total number of collected effective images is 307.

According to step S204, the present embodiment uses a photographing apparatus with a single camera to collect data of buildings beside the road, road sections where the road is completely blocked by trees, and collect 181 effective images in total, as in the manner of fig. 4.

According to step S205, the present embodiment performs batch processing on the image data, as shown in the left side of fig. 5, retains the most central area of the original image, adjusts the image size to 3000 pixels wide and 2250 pixels high, and downsamples the image to 1600 pixels wide and 1200 pixels high. According to step S206, the embodiment divides the processed image data into 3 groups, as shown in the right side of fig. 5, the images collected by the unmanned aerial vehicle are one group, the roads completely blocked by the trees are one group, and the buildings are one group. According to step S207, after the above steps are completed, in order to meet the requirements of the automatic driving task, in this embodiment, 3 cuboid stone pier vertices and lane line corner points are selected as feature points in the scene to be used for the cloud scale correction module in the system, and 40 pairs of feature point pairs are uniformly selected in the scene to be used for quantitative assessment of the cloud model accuracy in the system.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit of the invention.

Claims

1. An autopilot digital twin scene construction system based on multi-view three-dimensional reconstruction, characterized in that the autopilot digital twin scene construction system comprises:

2. The automated driving digital twinning scenario construction system according to claim 1, wherein the method of data acquisition and processing by the data acquisition and processing module (M100) comprises the steps of:

s201, demarcating an automatic driving scene range to be rebuilt;

3. The automated driving digital twin scene building system of claim 1, wherein the camera pose estimation module (M200) comprises a search matching unit and an incremental reconstruction unit; the searching and matching unit takes multi-view images as input, searches for image pairs which are geometrically verified and have overlapping areas, and calculates the projection of the same point in space on two images in the image pairs; the incremental reconstruction unit is used for outputting the corresponding position and posture of the camera for shooting each image.

4. The automated driving digital twin scene building system according to claim 3, wherein the specific process of outputting the position and posture corresponding to the camera capturing each image by the incremental reconstruction unit is: first selecting and registering a pair of initial image pairs at a multi-view image dense location; then selecting the image with the largest registration point number with the currently registered image; registering the newly added view with the image set with the determined pose, and estimating the pose of a camera shooting the image by using a PnP problem solving algorithm; then, for the unreconstructed spatial points covered by the newly added registered image, triangulating the image and adding the new spatial points to the reconstructed spatial point set; and finally, performing one-time beam adjustment optimization adjustment on all the current estimated three-dimensional space points and camera pose.

5. A scene construction method of an autopilot digital twin scene construction system according to claim 1, comprising the steps of:

S503: camera inner and outer parameter sequence

Is->

And according to the camera in-out parameter sequence->

And the sequence of feature maps->

6. The scene construction method according to claim 5, wherein the feature map sequence in step S503

The acquisition of (1) is specifically as follows: the offset of the convolution kernel direction vector is learned through the network model, so that the convolution kernel can adapt to areas with different texture structures, and finer features are extracted; secondly, up-sampling the feature images with different sizes to the original input image size, and connecting the feature images to form a feature image with 32 channels; then, the two-dimensional information u of each characteristic channel is calculated _c (i, j) compressing into one-dimensional real number z _c And performing two-stage full connection; finally, each real number z is determined using a sigmoid function _c Is limited to [0,1 ]]Within the range, each channel of the feature map has different weights, weakening the matchingNoise data and irrelevant features in the process; repeating the above steps for each input image to obtain a feature map sequence +.>

7. The scene construction method according to claim 5, wherein the aggregation of the cost volumes in step S503 is specifically: selecting an image of the sequence of feature maps as reference feature map F ₁ The rest feature images are used as source feature images

Then according to the camera inner and outer parameter sequence->

8. The scene construction method according to claim 5, wherein the forming of the probability volume in step S503 is specifically: cutting the cost body into D pieces, wherein D is depth priori, and the depth value can be any one value from 0 to D; then, the cost body slice sequence is regarded as a time sequence, the time-space recurrent neural network in the network model is sent to regularize, the time-space recurrent neural network uses ST-LSTM to transfer the storage state in the time sequence, namely the horizontal direction and the airspace, namely the vertical direction, the relation between slice sequences is reserved, and the situation that the probability body has multiple peaks is reduced; in the horizontal direction, the first layer unit at a certain moment receives the hidden state and the memory state of the last layer unit at the previous moment and transmits the hidden state and the memory state layer by layer in the vertical direction; finally, a softmax normalization operation is used for outputting probability values of each pixel at the depth d E [0, D ] to form a probability body.

9. The scene construction method according to claim 5, wherein step S505 is specifically: performing argmax operation on a probability body obtained by reasoning a trained network model of each image to obtain a depth map sequence, filtering out a depth map with low confidence coefficient based on a luminosity consistency criterion and a geometric consistency criterion, and finally obtaining a depth map sequence through the formula p=dm ^-1 K ^-1 p fusing the depth map into a three-dimensional dense point cloud; where P is the pixel coordinate, d is the depth value inferred by the network model, and P is the three-dimensional coordinate in the world coordinate system.