CN117456136A

CN117456136A - Digital twin scene intelligent generation method based on multi-mode visual recognition

Info

Publication number: CN117456136A
Application number: CN202311466377.1A
Authority: CN
Inventors: 周旺; 黎涛; 雷奇奇; 孔凡伟; 郑月玲; 莫洪源
Original assignee: Beijing Dataojo Technology Co ltd
Current assignee: Beijing Dataojo Technology Co ltd
Priority date: 2023-11-06
Filing date: 2023-11-06
Publication date: 2024-01-26

Abstract

The application discloses a digital twin scene intelligent generation method based on multi-mode visual recognition, which realizes real-time modeling and gesture tracking of a dynamic object through a dynamic object tracking algorithm based on EM-ICP, and fuses dynamic object and static scene information in real time to obtain three-dimensional space information of a complete target scene. By means of fusion of monocular and multiview visual information, an implicit dynamic nerve radiation field model is established, rapid three-dimensional modeling and dynamic updating of physical entities in a digital twin scene are achieved through semantic analysis of visual and geometric data, specifically, real-time modeling and gesture tracking of dynamic objects are achieved through an EM-ICP-based dynamic object tracking algorithm, the dynamic objects and static scene information are fused in real time, three-dimensional space information of a complete target scene is obtained, movement and change of the objects in the scene can be accurately calculated, and a dynamic three-dimensional scene with high accuracy and high authenticity and capable of achieving real-time tracking and historical playback is synthesized.

Description

Digital twin scene intelligent generation method based on multi-mode visual recognition

Technical Field

The application relates to the technical field of intelligent generation of digital twin scenes based on multi-modal visual recognition, in particular to an intelligent generation method of digital twin scenes based on multi-modal visual recognition.

Background

The traditional digital twin base modeling mostly adopts oblique photography, realizes semiautomatic modeling and professional modeling software artificial modeling through GIS data, requires more labor cost and time period, is difficult to realize low-cost rapid updating of a real scene, cannot meet implementation tracking of dynamic targets in the scene, cannot adapt to requirements of digital twin application precision mapping, synchronous growth and full life cycle management, and influences deep application of digital twin in the fields of smart cities, public safety, intelligent manufacturing, national defense and military industry, building design and the like.

Disclosure of Invention

The application provides a digital twin scene intelligent generation method based on multi-mode visual recognition, which aims to solve the problems that the prior art is difficult to realize low-cost and rapid updating of a real scene, cannot meet implementation tracking of a dynamic target in the scene, and cannot adapt to requirements of accurate mapping, synchronous growth and full life cycle management of digital twin application.

A digital twin scene intelligent generation method based on multi-modal visual recognition, the method comprising:

collecting target scene data, collecting GPS information of the target scene, quickly splicing digital twin three-dimensional scenes of the target, introducing a local map of joint visual geographic information by using a third-party geographic map, generating a local orthographic image, and fusing the local orthographic image in real time to obtain a global orthographic image of the target;

extracting information of different levels in the target global orthographic image through a hierarchical attention mechanism network, and simultaneously extracting local features and global context information of the target global orthographic image;

extracting the three-dimensional point cloud of the target scene through the multi-view image, estimating the point cloud plane parameters by using a RANSAC algorithm, extracting the edge characteristics of the point cloud of the building of the target scene, and introducing determined conditions in the process of extracting the 3D characteristic lines to obtain effective characteristic lines and obtain a surface model of the building with definite semantics;

reconstructing a static object and a dynamic object of the target scene through fusion of multi-view information of the target scene and regression of attitude parameters;

And acquiring static scene data in the target scene through a motion mask, realizing three-dimensional reconstruction of a static space through a nerve radiation field model, realizing real-time modeling and posture tracking of a dynamic object through a dynamic object tracking algorithm based on EM-ICP, and fusing the dynamic object and static scene information in real time to obtain three-dimensional space information of the complete target scene.

In the above scheme, optionally, the collecting the target scene data, collecting the target scene GPS information, quickly splicing the digital twin three-dimensional scene of the target, introducing a local map of joint visual geographic information by using a third-party geographic map, generating a local orthographic image, and fusing the local orthographic image in real time to obtain a target global orthographic image, including:

acquiring three-dimensional scene data of a target city by using equipment such as oblique photography, a three-dimensional laser scanner and the like to obtain single image or video data;

the method comprises the steps of utilizing a pose estimation module based on geographic information, and based on GPS information, realizing rapid splicing of a digital twin three-dimensional scene, wherein the pose estimation module comprises an initialization system, frame pose estimation and key frame screening, and the key frame screening is realized through loop detection;

Utilizing the established geographic map, introducing a local map of joint visual geographic information, generating a local orthographic image in an image splicing module based on orthographic maintenance, and screening unnecessary quality degradation key frames through image quality judgment; and fusing the local orthographic images entering the splicing module in real time and in an incremental manner to obtain a complete target global orthographic image.

In the above solution, optionally, the extracting, through a hierarchical attention mechanism network, information of different levels in the target global orthographic image, and meanwhile, extracting local features and global context information of the target global orthographic image specifically includes:

three-dimensional scene semantic recognition and segmentation based on a hierarchical attention mechanism divide an urban scene image into a plurality of small blocks for processing, and respectively perform feature extraction and joint processing on each small block;

extracting the whole information in each small block by using an attention mechanism module, carrying out attention weighting on the feature images from two aspects of a channel and a space by adopting global pooling operation, completing the space attention weighting by using full-connection operation, and equally treating the rows, the columns and the channels of the feature images by adopting a completely symmetrical operation mode;

Expanding each strip weighting result to the size of the original feature map, adding the expanded feature map and the original feature map, and multiplying the expanded feature map and the original feature map to finish the weighting operation of the original feature map;

the bottleneck structure in ResNet is used as a convolution block to extract target local features, convolution, batchnorm normalization and RELU activation operations are carried out in each convolution block, a self-attention mechanism of a transducer is utilized to carry out weighting processing on an image, an original feature map is divided into small blocks, the blocks are unfolded into one-dimensional marks, category embedding is added for classification tasks, and semantic recognition and segmentation are obtained.

In the above solution, optionally, the extracting the three-dimensional point cloud of the target scene by using the multi-view image, estimating a point cloud plane parameter by using a RANSAC algorithm, extracting edge features of the point cloud of the building of the target scene, introducing a determined condition in the process of extracting the 3D feature line, obtaining an effective feature line, and obtaining a building surface model with definite semantics, including:

the digital twin scene is segmented and identified by a semantic segmentation algorithm based on a convolutional neural network,

extracting three-dimensional space point cloud information from the multi-view image, and carrying out parameter estimation on a point cloud plane by using a RANSAC algorithm; extracting 3D characteristic lines by utilizing point cloud plane parameters and intersection line segments formed between two intersection planes, and introducing determined conditions in the process of extracting the 3D characteristic lines so as to obtain effective characteristic lines;

And a geometric driven boundary optimization algorithm is introduced, a closed building surface model is generated according to the extracted 3D characteristic lines, the storage cost of fine-grained semantics is reduced, and the degree of freedom of semantic representation is improved.

In the above solution, optionally, the introducing a geometry-driven boundary optimization algorithm generates a closed building surface model according to the extracted 3D feature line, reduces storage cost of fine-grained semantics, and improves freedom of semantic representation, which specifically includes:

combining the generated 3D characteristic lines into a polygonal boundary, and defining the boundary of scattered point cloud data;

simplifying the boundary, and fitting the simplified boundary to obtain the closed building surface model.

In the above scheme, optionally, the reconstructing the static object and the dynamic object of the target scene through fusion of multi-view information of the target scene and regression of gesture parameters includes:

the method comprises the steps of reconstructing a multi-view static object of a target scene, reconstructing a multi-view dynamic object of a dynamic object, performing mixed training on multi-view vehicle and personnel dynamic object data and various open source data sets which are independently constructed, designing and constructing a multi-view end-to-end vehicle and personnel gesture three-dimensional reconstruction network, combining the constructed multi-view vehicle and personnel data sets, realizing high-precision dynamic three-dimensional parameterization vehicle and personnel model reconstruction in a training mode of a pure visual depth model, and when the multi-view three-dimensional reconstruction vehicle and personnel model reconstruction is generated, taking multi-view video frames as input, predicting vehicle and personnel rough parameters by utilizing multi-layer grid characteristics of images, and optimizing vehicle or personnel parameter output by iteratively correcting vertex and characteristic alignment of the vehicle and personnel model.

In the above solution, optionally, the reconstructing the multi-view static object of the target scene static object includes: the image semantic analysis result is adopted as a priori to guide the construction of geometric relations between different types of entity targets, the matching of corresponding points of the images under different angles is realized by extracting robust and efficient local descriptors in the images, the feature level domain adaptation loss is introduced, the high-level feature distribution inconsistency of different images is penalized, and the descriptor inconsistency corresponding to the pixel level key points is compensated by the pixel level cross-domain consistency loss; meanwhile, ternary loss and cross-domain consistency loss are adopted for carrying out descriptor supervision, so that good distinguishing capability of descriptors is ensured.

In the above solution, optionally, the reconstructing the dynamic target from multiple view angles includes: adopting two-stage vehicle and personnel dynamic model parameter regression, quickly initializing a constraint range of dynamic target parameters by using coarse granularity characteristics, and iteratively refining a vehicle and personnel dynamic target parameter model by using fine granularity characteristics; the method has the advantages that through the master-slave visual angle coupling training, the master-slave visual angle vehicle and personnel dynamic target parameters are coupled in a nonlinear manner, the supervision data robustness of the master visual angle image is improved, through the rapid coarse parameter prediction of the multi-visual angle dynamic target, the master visual angle dynamic target parameters are constrained by a plurality of slave visual angles to refine, and the prediction of complex gestures, occlusion vehicles and personnel is realized.

In the above scheme, optionally, the static scene data under the target scene is obtained through a motion mask, the static space three-dimensional reconstruction is realized through a neural radiation field model, the real-time modeling and gesture tracking of the dynamic object are realized through a dynamic object tracking algorithm based on EM-ICP, and the dynamic object and the static scene information are fused in real time to obtain the three-dimensional space information of the complete target scene, specifically:

generating a binary motion Mask by using a Mask R-CNN method and an optical flow method, and superposing video data of a plurality of view angles to obtain a final motion Mask, wherein the motion Mask is used for eliminating dynamic objects in the video;

under the condition that dynamic objects are removed, a neural radiation field model of a static space is constructed by utilizing RGBD images of a plurality of view angles, specifically, all images are projected into a coordinate system taking a scene as a center, depth estimation is carried out by utilizing RGBD information, a point cloud set is generated, and characteristics are extracted and segmented by using PointNet and other algorithms, so that a point cloud image of the static scene is obtained;

real-time modeling and gesture tracking of a dynamic object are realized by adopting an EM-ICP-based dynamic object tracking algorithm, the EM-ICP-based dynamic object tracking algorithm builds a model through previous track and point cloud data, the model is matched with a target point cloud in a current frame, an matching function is optimized through iteration of the EM algorithm, and gestures between adjacent frames are tracked by using the ICP algorithm, so that accurate gesture information of the current object is obtained;

And fusing the result of the static space three-dimensional reconstruction with the result of the dynamic object real-time modeling and gesture tracking to obtain the three-dimensional space information of the complete target scene.

Compared with the prior art, the application has the following beneficial effects:

based on further analysis and research on the problems of the prior art, the application realizes that the low-cost and rapid updating of the real scene is difficult to realize in the prior art, the implementation tracking of dynamic targets in the scene cannot be met, and the problems of accurate mapping, synchronous growth and full life cycle management requirements of digital twin application cannot be met.

According to the invention, the target scene GPS information is collected through collecting target scene data, the digital twin three-dimensional scene of the target is quickly spliced, a local map of joint visual geographic information is introduced by using a third-party geographic map, local orthographic images are generated, and the local orthographic images are fused in real time to obtain a target global orthographic image; extracting information of different levels in the target global orthographic image through a hierarchical attention mechanism network, and simultaneously extracting local features and global context information of the target global orthographic image; extracting the three-dimensional point cloud of the target scene through the multi-view image, estimating the point cloud plane parameters by using a RANSAC algorithm, extracting the edge characteristics of the point cloud of the building of the target scene, and introducing determined conditions in the process of extracting the 3D characteristic lines to obtain effective characteristic lines and obtain a surface model of the building with definite semantics; reconstructing a static object and a dynamic object of the target scene through fusion of multi-view information of the target scene and regression of attitude parameters; the static scene data under the target scene is obtained through the motion mask, the static space three-dimensional reconstruction is realized through the nerve radiation field model, and based on the scheme, the real-time modeling and gesture tracking of the dynamic object are realized through the dynamic object tracking algorithm based on the EM-ICP, and the dynamic object and the static scene information are fused in real time, so that the three-dimensional space information of the complete target scene is obtained. By means of fusion of monocular and multiview visual information, an implicit dynamic nerve radiation field model is established, quick three-dimensional modeling and dynamic updating of physical entities in a digital twin scene are achieved through semantic analysis of visual and geometric data, on the basis, RGB and depth sensor information is utilized, a proper network structure and a deep learning algorithm are designed, movement and change of objects in the scene are accurately calculated, and a dynamic three-dimensional scene which is high in accuracy and high in reality and can be tracked in real time and played back in history is synthesized.

Drawings

Fig. 1 is a schematic flow chart of a digital twin scene intelligent generation method based on multi-modal visual recognition according to an embodiment of the present application;

FIG. 2 is a schematic illustration of performing scene orthographic imaging based on geographic information according to an embodiment of the present application;

FIG. 3 is a flow chart of three-dimensional scene semantic recognition and segmentation based on hierarchical attention mechanisms according to one embodiment of the present application;

FIG. 4 is a schematic diagram of a training process of a three-dimensional vehicle and personnel reconstruction neural network according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a neural network training process for optimizing parameters of vehicles and personnel according to an embodiment of the present application;

fig. 6 is a diagram of a dynamic neural radiation field expression method based on dynamic scene and static scene separation according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided a digital twin scene intelligent generation method based on multi-modal visual recognition, including the steps of:

Collecting target scene data, collecting GPS information of the target scene, quickly splicing digital twin three-dimensional scenes of the target, introducing a local map of joint visual geographic information by using a third-party geographic map, generating a local orthographic image, and fusing the local orthographic image in real time to obtain a global orthographic image of the target; in the embodiment, first, the equipment such as an unmanned plane is used for collecting urban scene data to obtain single image or video data. The purpose of this step is to obtain the raw materials that need to be spliced into a digital twin three-dimensional scene.

And a pose estimation module based on geographic information is utilized, and the rapid splicing of the digital twin three-dimensional scene is realized on the basis of GPS information. The method aims at realizing rapid splicing of the digital twin three-dimensional scene by using GPS information and providing necessary data support for the subsequent steps.

The pose estimation module comprises an initialization system, frame pose estimation and key frame screening. The key frame screening is realized through loop detection. The main purpose of this step is to enable tracking of images and screening of key frames to determine key frames that need to be added to the map.

And introducing a local map of the joint visual geographic information by using the established geographic map so as to increase the efficiency of exploring surrounding areas. The goal of this step is to combine visual and geographic information, improve exploration efficiency, and further optimize the geographic map.

In the image stitching module based on orthographic preservation, local orthographic images are generated first, and unnecessary quality degradation key frames are screened out through image quality judgment. The purpose of this step is to ensure that the image data input in the subsequent steps is complete and of high quality.

And (3) fusing the local orthographic images entering the splicing module in real time and in an incremental manner to finally obtain a complete global orthographic image. The purpose of this step is to fuse the individual local orthographic images to construct a global orthographic image for later use.

The bearing relation of the embodiment is as follows: firstly, acquiring raw materials by acquiring urban scene data; secondly, a pose estimation module based on geographic information is utilized to realize rapid splicing of the digital twin three-dimensional scene, and necessary data support is provided for the subsequent steps; then, tracking the image and screening the key frames are realized in the pose estimation module, and the key frames needing to be added into the map are determined; then, introducing a local map of the united visual geographic information by using the established geographic map, improving the exploration efficiency and optimizing the geographic map; then, generating a local orthographic image, and screening unnecessary quality-reduced key frames through image quality judgment; and finally, fusing the local orthographic images entering the splicing module in real time and in an incremental manner to construct a global orthographic image.

The method is characterized in that the method comprises the steps of acquiring urban scene data by using unmanned aerial vehicle equipment and the like, and quickly splicing digital twin three-dimensional scenes to obtain a complete global orthographic image. The achievement of the objective depends on the orderly execution of the various steps.

the embodiment is three-dimensional scene semantic recognition and segmentation based on a hierarchical attention mechanism. The following are the purposes of each step and the bearing relationship of the previous and subsequent steps:

data preprocessing: the urban scene image is divided into a plurality of small blocks for processing, and preparation is made for feature extraction.

Feature extraction: and extracting the whole information in each small block by using a hierarchical attention mechanism module, and carrying out attention weighting on the feature map from two aspects of a channel and a space by adopting global pooling operation so as to avoid the influence of scaling on the accuracy of the image information. Meanwhile, the space attention weighting is completed by using the full connection operation, so that the defect that the convolution ignores global information is avoided. Finally, the weighting result of each strip (row, column and channel) is expanded to the size of the original characteristic diagram, and added with the original characteristic diagram to finish the weighting operation of the original characteristic diagram.

Extracting local features: the bottleneck structure in ResNet is used as a Convolution Block (CB) to extract local features.

Global context information extraction: the image is weighted by a transform self-attention mechanism, the original feature map is divided into small tiles (patches), expanded into one-dimensional labels, and category embedding is added for classification tasks. Then, the position of each mark in the original image is embedded by using the position, and the full connection operation is performed on the marks to obtain keys (k), queries (q) and values (v). Then, the k and q are transposed multiplied and normalized to obtain the correlation between the labels, and the correlation is multiplied by value to obtain the final output.

Extracting the three-dimensional point cloud of the target scene through the multi-view image, estimating the point cloud plane parameters by using a RANSAC algorithm, extracting the edge characteristics of the point cloud of the building of the target scene, and introducing determined conditions in the process of extracting the 3D characteristic lines to obtain effective characteristic lines and obtain a surface model of the building with definite semantics; the whole aim of the embodiment is to realize the identification and segmentation of the city scene semantics. By adopting a hierarchical attention mechanism network, information of different levels in the urban scene can be effectively extracted, and then semantic recognition of the scene is realized. Meanwhile, by utilizing local feature extraction and global context information extraction, important information can be extracted more pertinently, and judgment capability and accuracy of the model are enhanced.

The purpose, connection relation and overall goal of each step of the embodiment.

The purpose of the digital twin scene segmentation and recognition is to carry out semantic segmentation on the digital twin scene and provide basic data for subsequent processing. The result of this step is a two-dimensional pixel label map. The three-dimensional point cloud extraction from the multi-view image requires the use of the pixel label map, so that there is a bearing relationship between the two steps.

The purpose of extracting the three-dimensional point cloud from the multi-view images is to extract three-dimensional point cloud information of a building from the multi-view images, and estimate plane parameters by using a RANSAC algorithm to obtain edge characteristics of the building point cloud. The result of this step is a point cloud model containing building point cloud edge features as input to extract 3D feature lines. The previous steps of digital twin scene segmentation and recognition provide a pixel tag map and a semantic segmentation result for extracting a three-dimensional point cloud for the multi-view image.

The purpose of the 3D characteristic line extraction is to extract a 3D characteristic line with a definite starting point and an end point by using the point cloud plane parameters and the formed intersection line segments so as to obtain the smooth boundary information of the building. The result of this step is an extracted 3D feature line that provides input for the subsequent boundary optimization algorithm.

The boundary optimization algorithm aims at generating a closed building surface model according to the extracted 3D characteristic lines, reducing the storage cost of fine-grained semantics and improving the degree of freedom of semantic representation. The result of this step is an optimized building surface model.

In general, the objective of the present embodiment is to implement lightweight fine granularity semantic modeling of a digital twin three-dimensional scene, including steps of digital twin scene segmentation and recognition, multi-view image extraction of three-dimensional point cloud, 3D feature line extraction, and boundary optimization algorithm. Through the processing of the steps, a building surface model with definite semantics can be obtained, and the data storage efficiency and the extraction capability of fine-grained scene information are improved.

Reconstructing a static object and a dynamic object of the target scene through fusion of multi-view information of the target scene and regression of attitude parameters; the purpose of the individual steps of this embodiment is as follows:

multi-view static object reconstruction: and guiding the construction of geometric relations between different types of entity targets through the image semantic analysis result. Efficient local descriptors are used to match corresponding points of the image at different angles in order to reconstruct the geometric information of the static object. Meanwhile, the feature level domain adaptation loss and the pixel level cross-domain consistency loss are introduced, so that the robustness and consistency of the descriptors are enhanced.

Multi-view dynamic target reconstruction: and for a dynamic target, a two-stage parameter regression method is adopted to realize reconstruction. Firstly, the constraint range of dynamic target parameters is initialized rapidly by utilizing coarse-granularity characteristics, and then vehicle and personnel dynamic target parameter models are optimized iteratively through fine-granularity characteristics, so that more accurate dynamic target attitude information is obtained.

Neural network training process: and the multi-view vehicle and personnel dynamic target data and other open source data sets are used for mixed training, so that the performance of the network on reconstruction of the multi-view three-dimensional vehicle and personnel gestures is improved. And designing and constructing a multi-view end-to-end vehicle and personnel posture three-dimensional reconstruction network, and realizing high-precision dynamic three-dimensional parameterized vehicle and personnel model reconstruction by a training mode of a pure visual depth model.

There is a bearing relation between the steps, and each step further perfects the reconstruction result based on the previous step. Specifically:

the multi-view static object reconstruction provides geometric information of the static object as a priori for the multi-view dynamic object reconstruction, and provides more accurate image corresponding point matching for subsequent steps.

The multi-view dynamic target reconstruction depends on the multi-view static object reconstruction result, and accurate posture information of the dynamic target is obtained through posture parameter regression.

The neural network training process is carried out on the basis of the two steps, the capability of reconstructing the postures of the multi-view three-dimensional vehicle and the personnel is improved through the training network, and a more accurate dynamic three-dimensional model reconstruction result is realized.

In the whole, the embodiment aims to realize accurate reconstruction of a static object and a dynamic target through fusion of multi-view information and attitude parameter regression. Through the continuous operation of the steps, the accuracy and the integrity of reconstruction can be improved, so that more accurate three-dimensional object modeling is realized.

And acquiring static scene data in the target scene through a motion mask, realizing three-dimensional reconstruction of a static space through a nerve radiation field model, realizing real-time modeling and posture tracking of a dynamic object through a dynamic object tracking algorithm based on EM-ICP, and fusing the dynamic object and static scene information in real time to obtain three-dimensional space information of the complete target scene. The objective of the specific steps and the connection relationship between the preceding and following steps in this embodiment are as follows:

the purpose of the motion mask acquisition is to exclude dynamic regions in the video. In order to ensure the accuracy of the three-dimensional reconstruction of the static space, the dynamic object needs to be excluded first. The generation of the motion Mask involves Mask R-CNN and optical flow methods, the results of which combine to obtain the final motion Mask.

The purpose of the static spatial three-dimensional reconstruction is to generate a three-dimensional model of the entire scene. The method mainly utilizes RGBD images of multiple view angles to construct a nerve radiation field model of a static space, so that a point cloud set of a scene is obtained. The point cloud data can be further used in the fields of 3D modeling, virtual reality and the like.

The purpose of the real-time modeling and the gesture tracking of the dynamic object is to realize the real-time modeling and the gesture tracking of the dynamic object. Since the motion mask has already excluded dynamic objects, this step only models and tracks dynamic objects. And adopting a dynamic object tracking algorithm based on EM-ICP, iterating and optimizing a matching function in real time, and tracking the gesture between adjacent frames by using the ICP algorithm. Through the steps, the accurate posture information of the current object can be obtained.

The purpose of dynamic and static space fusion is to combine the result of static space three-dimensional reconstruction with the result of dynamic object real-time modeling and gesture tracking. The method comprises the steps of converting dynamic object point cloud data into object representations in a nerve radiation field model, and fusing the object representations with the nerve radiation field model in a static space to obtain complete three-dimensional space information.

The connection relationship between the steps in this embodiment is as follows:

motion mask acquisition: the result of this step is a binary motion mask that excludes dynamic regions in the video.

Three-dimensional reconstruction of static space: this step excludes dynamic objects with motion masks, so that three-dimensional reconstruction is only performed for static space.

Dynamic object real-time modeling and gesture tracking: this step models and tracks only for dynamic objects, excluding the effects of static space.

Dynamic and static spatial fusion: the step fuses the dynamic object and the static scene information to obtain complete three-dimensional space information.

The overall purpose of the steps of the embodiment is to realize real-time modeling and gesture tracking of a dynamic object while ensuring accurate three-dimensional reconstruction of a static scene, and to fuse the dynamic object and static scene information to obtain complete three-dimensional space information. The method can be applied to the fields of 3D modeling, virtual reality and the like.

In this embodiment, the collecting target scene data, collecting the target scene GPS information, quickly splicing the digital twin three-dimensional scene of the target, introducing a local map of joint visual geographic information by using a third-party geographic map, generating a local orthographic image, and fusing the local orthographic image in real time to obtain a global orthographic image of the target, including:

Acquiring three-dimensional scene data of a target city through equipment such as oblique photography, a three-dimensional laser scanner and the like to obtain a plurality of single images or point cloud data, and further obtaining single image or video data; the method comprises the steps of utilizing a pose estimation module based on geographic information, and based on GPS information, realizing rapid splicing of a digital twin three-dimensional scene, wherein the pose estimation module comprises an initialization system, frame pose estimation and key frame screening, and the key frame screening is realized through loop detection;

In this embodiment, the extracting, through the hierarchical attention mechanism network, information of different levels in the target global orthographic image, and simultaneously extracting local features and global context information of the target global orthographic image specifically includes:

In this embodiment, the extracting the three-dimensional point cloud of the target scene through the multi-view image, estimating a point cloud plane parameter by using a RANSAC algorithm, extracting edge features of the point cloud of the target scene building, introducing a determined condition in the process of extracting the 3D feature line, obtaining an effective feature line, and obtaining a building surface model with definite semantics, including:

In this embodiment, the introducing a geometry-driven boundary optimization algorithm generates a closed building surface model according to the extracted 3D feature line, reduces storage cost of fine-grained semantics, and improves freedom degree of semantic representation, and specifically includes:

In this embodiment, the reconstructing the static object and the dynamic object of the target scene through fusion of multi-view information of the target scene and regression of the gesture parameters includes:

In this embodiment, the reconstructing the multi-view static object of the target scene static object includes: the image semantic analysis result is adopted as a priori to guide the construction of geometric relations between different types of entity targets, the matching of corresponding points of the images under different angles is realized by extracting robust and efficient local descriptors in the images, the feature level domain adaptation loss is introduced, the high-level feature distribution inconsistency of different images is penalized, and the descriptor inconsistency corresponding to the pixel level key points is compensated by the pixel level cross-domain consistency loss; meanwhile, ternary loss and cross-domain consistency loss are adopted for carrying out descriptor supervision, so that good distinguishing capability of descriptors is ensured.

In this embodiment, the reconstructing the multi-view dynamic target includes: adopting two-stage vehicle and personnel dynamic model parameter regression, quickly initializing a constraint range of dynamic target parameters by using coarse granularity characteristics, and iteratively refining a vehicle and personnel dynamic target parameter model by using fine granularity characteristics; the method has the advantages that through the master-slave visual angle coupling training, the master-slave visual angle vehicle and personnel dynamic target parameters are coupled in a nonlinear manner, the supervision data robustness of the master visual angle image is improved, through the rapid coarse parameter prediction of the multi-visual angle dynamic target, the master visual angle dynamic target parameters are constrained by a plurality of slave visual angles to refine, and the prediction of complex gestures, occlusion vehicles and personnel is realized.

In this embodiment, the static scene data in the target scene is obtained through a motion mask, three-dimensional reconstruction of a static space is realized through a neural radiation field model, real-time modeling and gesture tracking of a dynamic object are realized through a dynamic object tracking algorithm based on EM-ICP, and the dynamic object and static scene information are fused in real time to obtain three-dimensional space information of the complete target scene, which is specifically as follows:

In one embodiment, a method for efficiently collecting cross-view scene information and fusing multi-mode information is provided, including:

digital twin three-dimensional scene rapid splicing based on multi-view information, namely, in order to realize effective acquisition of urban scene data, unmanned aerial vehicle can be utilized for data acquisition. However, since the unmanned aerial vehicle can only capture a single image or video data, a scene graph with a large range cannot be formed. In order to solve the problem, the invention aims to combine GPS information to realize the rapid splicing construction of the urban scene orthographic images, and the specific technical process is shown in figure 2;

This embodiment uses two key modules to achieve this goal: the first module is a pose estimation module based on geographic information, and the pose estimation module adopts a structure similar to the front end and the rear end of a monocular SLAM system so as to ensure real-time performance. The second module is an orthographic-preserving-based image stitching module, which is responsible for generating real-time orthographic images. In SLAM systems, the loop detection module is an important component that can significantly reduce the previously accumulated error by detecting and optimizing loops in the motion profile. However, in unmanned aerial vehicle navigation, it is often desirable to cover the largest area in the shortest time, resulting in the actual flight trajectory approximating a zigzag path, making it difficult to detect loops. Therefore, in order to keep the system compact, the loop detection system is omitted.

The pose estimation module based on the geographic information is divided into a tracking module based on the geographic information and a map module based on the geographic information. The geographic information tracking module receives the incremental image sequence and the corresponding GPS information, and aims to calculate the pose of each frame of image in real time and screen out the key frames. The module mainly comprises three parts: initializing a system, estimating the pose of a frame and screening key frames. For each image input to the system, we process in frames, each frame containing the corresponding GPS and local feature information of the image. Optional local features in the system are ORB features or GPU-based rootsift features. In order to speed up feature matching, the invention uses k-means++ to cluster ORB and rootsift features and uses TF-IDF index to train word bags. In the process of feature extraction, the extracted features are subjected to homogenization treatment so as to avoid the problem of feature point aggregation. The tracking module may be divided into an initialization section, a tracking section, and a relocation section. The method is characterized in that two continuous images with high overlapping degree and moderate base line length are utilized for preliminary reconstruction, and a world coordinate system is determined. The similarity transformation from world coordinates to geographic coordinates is solved by combining GPS information attached to the image through the similarity transformation between the world coordinates and the geographic coordinates.

The input to the geographic information based map module is a key frame from the tracking module that aims to update and optimize an overall geographic map in real time, including camera pose in the current system, location of waypoints, and similar transformation from world coordinate system to geographic coordinate system. The present invention contemplates the introduction of a local map of joint visual geographic information to increase the efficiency of exploring surrounding areas. The local map of the joint visual geographic information can absorb the efficiency of searching based on the GPS local map and give consideration to accuracy.

In the case of an orthopatch module based on orthoretention, it is necessary to perform local orthographic image generation and image patch based on multi-band fusion. The generation of the orthographic image may be pose-based or point cloud-based, with the inputs including key frames from the pose estimation module, corresponding poses, and sparse point clouds, and corresponding GPS information. The output of the module is a local orthographic image which is fused at the stitching module to finally obtain a complete global orthographic image. For each key frame, it is first required to project it onto the plane to be projected, typically a plane of fixed elevation and parallel to the ground plane. The specific altitude is obtained by a pose estimation module, which is called ground plane. Similarity transformation of camera coordinate system of key frame to geographic coordinate system Where s is the scale, R is the rotation matrix, and t is the translation vector. The internal reference matrix of the camera is K, the landmark points of the camera coordinate system are marked as pc, the point bases pq in the geographic coordinate system, and the homogeneous sitting marks in the image coordinate system are marked as pi. The normal vector of the ground plane in the geographic coordinate system is n, and a point on the plane is pt. Thus, the geographic coordinates corresponding to the four corners on the image can be calculated by the following two equations:

p _g ＝s ₀ sRK ^-1 p _i +t；

before projective transformation, a simple determination of the quality of the key frame is required to prevent unnecessary quality degradation. The stitching module functions to fuse the partial orthographic images entering the module into an integral puzzle in real time and incrementally. For the partial orthographic images entering the module, each image corresponds to a four-channel visual picture in BGRA format and a corresponding single-channel weight picture. In order to improve the storage and loading efficiency of the map, the weight picture and the four-channel visual picture are split into tiles, and then updated by taking the tiles as units. And finally, smooth splicing is realized, and more details are kept.

Three-dimensional scene semantic recognition and segmentation based on hierarchical attention mechanisms, wherein the embodiment is to adopt a hierarchical attention mechanism network to realize the recognition and segmentation of the semantics in the urban scene.

The urban scene contains various different layers of information, so that a multi-level mode is needed to effectively extract the information. Scaling an image can affect the accuracy of the image information due to the huge overall scene information. To solve this problem, we divide the image into a plurality of patches (patches) and perform the blocking and joint processing on the patches.

In order to extract the whole information inside each small block, the invention adopts an attention mechanism module. This module has plug and play characteristics and a small number of parameters. Considering that the attention weighting needs to be performed from both channel and space aspects, we use global pooling operation, which greatly reduces the number of parameters. In addition, the invention uses full connection instead of convolution to complete the spatial attention weighting, avoiding the disadvantage that convolution ignores global information. Meanwhile, the invention adopts a completely symmetrical operation mode, and equally treats the rows, the columns and the channels of the characteristic diagram. The specific calculation process is as follows:

T _F ＝σ(T _WF +T _HF +T _CF )；

wherein σ represents a nonlinear activation function (sigmoid function), T _WF ，T _HF And T _CF Three different strips are shown. For example, the present invention globally pools rows and channels of a feature map to eliminate the effect of the rows and channels on column information when weighting the columns. The column vectors are then weighted using the full concatenation. To reduce parameter overhead, the hidden activation size is set to Where r is the reduction ratio:

T _HF ＝BN(MLP(ReLU(BN ₁ (MLP(s _H )))))＝BN(W ₁ (ReLU(BN ₁ (W ₀ s _H +b ₀ )))+b ₁ )；

wherein,the above-described operations relating to columns are equally applicable to rows and channels in the process of weighted attention on the rows, columns and channels. The weighted result of each stripe is extended to the size of its original feature map. Since initially averaging pooling is employed, the expanded results are orders of magnitude identical to the original signature. Finally, the feature images obtained by expansion are added and then multiplied by the original feature images, so that the weighting operation on the original feature images is completed. When using full concatenation for attention weighting, the present invention uses the same compression and expansion methods as SE, and the number of parameters can be greatly reduced.

To extract local features, the present embodiment contemplates using bottleneck structures in ResNet as convolution blocks (Convolution Block, CB). The size of the feature map was first converted to 1/2 of the original size using the convolution block of 1*1, then two 3*3 convolutions were used, and finally one 1*1 convolutions were used to restore it to the original size. In this way, the local feature information is extracted more efficiently while the number of parameters is greatly reduced. In each convolution block, the convolution is followed by the Batchnorm normalization, and then the RELU activation operation. The result processed by the convolution block is standardized, and the next operation can be directly performed.

To extract global context information in the urban scene, the present invention proposes to weight the images using the transform's self-attention mechanism. First, the original feature map is divided into small tiles, i.e., patches, and then each patch is expanded into one-dimensional labels and a class insert is added to the classification task. Furthermore, it is also necessary to use position embedding to mark the position of each mark in the original image. In the self-attention model, these labels are first fully connected in the dimension direction and then decomposed into three labels of the same size, named key (k), query (q) and value (v), respectively. From the point of view of the attention mechanism, this is in fact the attention weighting on the primary channel for the markers. Next, transpose multiplication is performed on k and q, and then the result is normalized. Thus, the correlation between the labels is obtained, which is then multiplied by v as the final output. The essence of the self-attention mechanism is to replace the query with a weighted value, and the whole process is shown in fig. 3.

Light-weight fine-granularity semantic modeling of a digital twin three-dimensional scene: the aim of the embodiment is to realize light-weight fine-grained semantic modeling of a digital twin three-dimensional scene. After the digital twin three-dimensional scene is segmented and identified, a multi-view image is used for extracting a three-dimensional point cloud, and a vector building model is generated. By adopting the multi-view image and the geometric boundary optimization algorithm, the vector fine granularity semantic expression can be generated, so that the semantic richness of the building model is enhanced. Compared with the representation based on the point cloud, the vector model significantly reduces the storage space and is more suitable for practical application.

In the embodiment, a RANSAC algorithm is adopted to estimate parameters of a point cloud plane in a digital twin city scene. And the edge characteristics of the building point cloud are effectively extracted through the plane parameters of the point cloud. Buildings are typically constructed of a geometry consisting of several planes, so the intersection between planes is critical to the description of the boundaries of the building. The intersection line segment formed between two intersecting planes is defined as a 3D feature line having definite starting and ending points. Since a building typically contains multiple planes, any two planes do not necessarily constitute an effective 3D feature line. In order to obtain effective results, certain conditions are introduced in the process of solving the 3D feature line.

For the three-dimensional point cloud obtained through multi-view reconstruction, due to the limitations of the flight height and shooting angle of the unmanned aerial vehicle, the roof of the building in most images occupies the main part of the images, and the data of the elevation of the building are relatively less. This results in the reconstructed point cloud building facades possibly being very sparse or even missing. This lack of geometric information makes the 3D feature lines of the roof edge generally computationally infeasible.

In order to solve the problems, the invention adopts the comprehensive information of the multi-view images and the point cloud plane parameters to extract the three-dimensional characteristic lines so as to obtain the smooth boundary information of the building and remove the noise in the reconstruction process. Meanwhile, a geometrically driven boundary optimization algorithm is designed to generate a closed building surface model, so that the storage cost of fine granularity semantics is reduced, and the degree of freedom of semantic representation is improved.

Semantic modeling of three-dimensional scene and fine modeling of three-dimensional object: the embodiment aims at researching multi-view dense depth estimation, and guiding geometrical relation construction among different types of entity targets by using an image semantic analysis result as a priori so as to realize efficient three-dimensional object modeling. In particular, the invention focuses on the reconstruction of static targets (e.g., facilities, equipment, pipelines, etc.) and dynamic targets (e.g., vehicles, personnel, etc.) in three-dimensional space, and studies on parametric modeling methods.

In the aspect of three-dimensional static object reconstruction, robust and efficient local descriptors in images are mainly focused on, so that corresponding points of the images under different angles are realized, and a better matching effect is obtained. Therefore, the invention provides the feature level domain adaptation loss, which is used for improving the robustness of descriptors and punishing the inconsistency of high-level feature distribution of different images. Meanwhile, descriptor inconsistency corresponding to the pixel level key points is compensated through pixel level cross-domain consistency loss. And combining ternary loss and cross-domain consistency loss to conduct descriptor supervision so as to ensure that the descriptors have good distinguishing capability. In addition, the invention rapidly constructs training data sets oriented to different objects and adapts to complex application scenes.

In terms of three-dimensional dynamic target reconstruction, traditional optimization-based methods are time consuming and susceptible to initial values. However, the existing end-to-end network only uses coarse granularity or fine granularity characteristics to predict vehicle and personnel parameters, so that the prediction accuracy and the operation efficiency are difficult to balance. The invention adopts two-stage vehicle and personnel dynamic model parameter regression: and quickly initializing a constraint range of the dynamic target parameters through coarse-granularity characteristics, and iteratively refining a vehicle and personnel dynamic target parameter model through fine-granularity characteristics. In addition, through the master-slave visual angle coupling training, the master-slave visual angle vehicle and personnel dynamic target parameters are coupled in a nonlinear mode, and the supervision data robustness of the master visual angle image is improved. Finally, the prediction of complex gestures and occlusion vehicles and personnel is realized by fast coarse parameter prediction of the multi-view dynamic target and refinement of the main view dynamic target parameters by a plurality of secondary view constraint.

In a specific implementation, the embodiment takes multi-view video data as input, and utilizes the independently constructed multi-view vehicle and personnel dynamic target data and a plurality of open source data sets to perform mixed training so as to improve the performance of the network on reconstruction of the multi-view three-dimensional vehicle and personnel gestures. Designing and constructing a multi-view end-to-end vehicle and personnel posture three-dimensional reconstruction network, and combining the constructed multi-view vehicle and personnel data set, and realizing high-precision dynamic three-dimensional parameterized vehicle and personnel model reconstruction by a training mode of a pure visual depth model. Inputting an image sequence which is a three-view video frame, taking IUV data, vehicle, personnel 2D and 3D gesture data as training constraints, and simultaneously optimizing model parameters of a feature encoder and a vehicle and personnel parameter regressor. The neural network training process is shown in fig. 4.

When in generation, multi-view video frames are taken as input, vehicle and personnel coarse parameters are predicted by utilizing image multi-layer grid characteristics, and vehicle and personnel model vertexes and characteristics are aligned through iterative correction so as to optimize vehicle and personnel parameter output, as shown in fig. 5, a neural network training process schematic diagram for optimizing vehicle and personnel parameters is adopted

The embodiment adopts a depth residual error network as a layered image feature encoder for extracting coarse granularity depth features in an image. Firstly, uniformly sampling grid points of a coarse-granularity depth feature map, and reducing the dimension of the feature by using a multi-layer perceptron (MLP) as the input of a regressor. Regressors are constrained by a priori parameters of the vehicle and personnel and predict initial parameters of the vehicle and personnel. On the basis, a vehicle model and a personnel model are generated according to initial parameters, sparse projection is carried out on the model on a feature map, and vertex projection features with finer granularity are obtained and used as feature input of subsequent iterative optimization. In the iterative optimization process, the vehicle and personnel model parameters are optimized by combining the multi-layer fine granularity characteristics so as to generate the optimal vehicle and personnel model parameters.

In a word, through the technical route, the multi-mode three-dimensional object fusion modeling under the monocular and the multi-view systems can be realized, and the data of different visual angles are integrated into a consistent three-dimensional object model, so that the accuracy and the integrity of reconstruction are improved.

Implicit dynamic neural radiation field model and object real-time tracking: in this embodiment, the input data is a scene video with multiple perspectives acquired via an RGBD camera. In order to efficiently process such video data, a series of techniques are employed. Firstly, a dynamic region in a video is eliminated by utilizing a motion Mask, the Mask R-CNN method is used for shielding common moving objects, the optical flow of continuous frames is used for obtaining a binary motion Mask, and the Mask R-CNN and the optical flow method result are combined to obtain a final motion Mask. Then, during the training process of the nerve radiation field, the static space is separated from the dynamic space for training and reconstruction. In dynamic space, real-time modeling of dynamic objects and tracking of their pose is achieved. In static space, three-dimensional reconstruction of the whole scene is realized. And finally, fusing the dynamic space information and the static space information, thereby obtaining the complete expression of the space information.

As shown in fig. 6, the dynamic neural radiation field expression method based on the separation of dynamic scenes and static scenes can efficiently process scene videos with multiple viewing angles, accurately distinguish dynamic and static parts, and respectively model and reconstruct in dynamic space and static space by applying the above technology. The fused spatial information provides comprehensive scene expression, so that the method has remarkable advantages in scene understanding and reconstruction.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

Claims

1. A digital twin scene intelligent generation method based on multi-modal visual recognition, which is characterized by comprising the following steps:

2. The method of claim 1, wherein the collecting the target scene data, collecting the target scene GPS information, quickly splicing the digital twin three-dimensional scene of the target, introducing a local map of joint visual geographic information by using a third-party geographic map, generating a local orthographic image, and fusing the local orthographic image in real time to obtain a target global orthographic image, comprises:

acquiring three-dimensional scene data of a target city by using an oblique photography and a three-dimensional laser scanner to obtain single image or video data;

3. The method according to claim 1, wherein the extracting, through the hierarchical attention mechanism network, information of different levels in the target global orthographic image, and simultaneously performing local feature extraction and global context information extraction on the target global orthographic image, specifically includes:

4. The method according to claim 1, wherein the extracting the three-dimensional point cloud of the target scene through the multi-view image, estimating the point cloud plane parameters by using the RANSAC algorithm, extracting the edge characteristics of the point cloud of the building of the target scene, introducing the determined conditions in the process of extracting the 3D characteristic line, obtaining the effective characteristic line, and obtaining the building surface model with definite semantics, includes:

5. The method according to claim 4, wherein the introducing a geometry-driven boundary optimization algorithm generates a closed building surface model according to the extracted 3D feature line, reduces storage cost of fine-grained semantics, and improves freedom of semantic representation, and specifically comprises:

6. The method according to claim 1, wherein the reconstruction of the static object and the dynamic object of the target scene through fusion of multi-view information of the target scene and regression of attitude parameters comprises:

7. The method of claim 6, wherein the multi-view static object reconstruction of the target scene static object comprises: the image semantic analysis result is adopted as a priori to guide the construction of geometric relations between different types of entity targets, the matching of corresponding points of the images under different angles is realized by extracting robust and efficient local descriptors in the images, the feature level domain adaptation loss is introduced, the high-level feature distribution inconsistency of different images is penalized, and the descriptor inconsistency corresponding to the pixel level key points is compensated by the pixel level cross-domain consistency loss; meanwhile, ternary loss and cross-domain consistency loss are adopted for carrying out descriptor supervision, so that good distinguishing capability of descriptors is ensured.

8. The method of claim 6, wherein the multi-view dynamic object reconstruction of the dynamic object comprises: adopting two-stage vehicle and personnel dynamic model parameter regression, quickly initializing a constraint range of dynamic target parameters by using coarse granularity characteristics, and iteratively refining a vehicle and personnel dynamic target parameter model by using fine granularity characteristics; the method has the advantages that through the master-slave visual angle coupling training, the master-slave visual angle vehicle and personnel dynamic target parameters are coupled in a nonlinear manner, the supervision data robustness of the master visual angle image is improved, through the rapid coarse parameter prediction of the multi-visual angle dynamic target, the master visual angle dynamic target parameters are constrained by a plurality of slave visual angles to refine, and the prediction of complex gestures, occlusion vehicles and personnel is realized.

9. The method according to claim 1, wherein the static scene data in the target scene is obtained through a motion mask, static space three-dimensional reconstruction is realized through a neural radiation field model, real-time modeling and gesture tracking of a dynamic object are realized through a dynamic object tracking algorithm based on EM-ICP, and the dynamic object and static scene information are fused in real time to obtain three-dimensional space information of a complete target scene, specifically: