CN111311663B

CN111311663B - Real-time large-scene three-dimensional semantic modeling method

Info

Publication number: CN111311663B
Application number: CN202010095361.4A
Authority: CN
Inventors: 方璐; 韩磊; 郑添; 王好谦
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2023-04-18
Anticipated expiration: 2040-02-17
Also published as: CN111311663A

Abstract

The invention provides a real-time large scene three-dimensional semantic modeling method, which comprises the following steps: s1: a three-dimensional geometric model is constructed through an RGB (red, green and blue) image and a depth image which are obtained by scanning a scene through a sensor; s2: inputting the three-dimensional geometric model into a three-dimensional convolution neural network to complete semantic segmentation; s3: integrating semantic labels output by the three-dimensional convolutional neural network into the three-dimensional geometric model to complete semantic modeling; wherein the building of the three-dimensional geometric model and the semantic segmentation are combined in a multi-threaded manner and performed simultaneously. Realizing the combined real-time three-dimensional geometric reconstruction and semantic reconstruction; by adopting the sparse convolutional neural network and accelerating the calculation of the convolutional network, the performance of real-time operation can be achieved; the three-dimensional convolution is used for replacing the two-dimensional convolution in the UNet, meanwhile, the depth of the UNet network is increased, the capacity of the convolution network is larger, and the semantic segmentation accuracy is improved.

Description

Real-time large-scene three-dimensional semantic modeling method

Technical Field

The invention relates to the technical field of computer vision, in particular to a real-time large-scene three-dimensional semantic modeling method.

Background

Three-dimensional reconstruction of large scenes is an important problem in the field of computer vision. Technologies such as automatic driving, indoor robot navigation, AR and VR all rely on large scene three-dimensional reconstruction technology. It aims at: the portable device is used for dynamically scanning the scene, and the three-dimensional model of the whole scene can be conveniently generated through algorithm processing.

In addition to obtaining geometric information of a scene, another information that we are interested in practical applications is semantic information, i.e. we also want to know into which parts, respectively what objects, a scene can be divided. This is the target of three-dimensional semantic segmentation. Good semantic information is very important for realizing complex interaction of intelligent robots and intelligent AR/VR application.

At present, corresponding research is carried out in the two fields of three-dimensional reconstruction technology and three-dimensional semantic segmentation of a large scene, but the existing three-dimensional semantic segmentation technology is not enough in computational efficiency to realize real-time computation, so the two fields are not integrated together in the prior art. A popular method for large-scene three-dimensional modeling is based on RGBD simultaneous localization mapping technology. The technology can generate dense point cloud (dense point cloud) in real time, the quality of the reconstructed model is far better than that of the traditional vision-based simultaneous localization mapping or Sfm technology, but all RGBD simultaneous localization mapping systems only generate geometric models and do not have semantic information. In the field of three-dimensional semantic segmentation, models based on a three-dimensional convolutional neural network are greatly developed, but the models are limited on a common three-dimensional data set and cannot be directly used on a real scene. Moreover, the models are often huge in computation amount, can only execute off-line computation, and cannot achieve real-time computation.

The goal of three-dimensional geometric modeling is to reconstruct a three-dimensional model of space using sensor data (typically images, depth maps). The main methods for three-dimensional reconstruction are as follows:

structure from motion (Sfm) -based method

A group of pictures of the same scene from different visual angles are input by the method based on Sfm, and a three-dimensional model of the scene is estimated. The principle is as follows: and calculating the camera parameters and the pose of the pictures by finding the corresponding characteristic points in the pictures, thereby calculating the position of each point in the space through the geometric relationship and generating the three-dimensional model. Similar to binocular vision, if the parameters of the camera, the pose of the camera, and the corresponding points of the image are known, we can calculate the positions of the corresponding points in the three-dimensional space by using the geometric product (Epipolar geometry). The core of Sfm is that it only needs to find the corresponding points in the image and calculate the data of camera parameters and camera pose from a large number of input images. Thus, these images may be unstructured (discontinuous in capture position), captured with entirely different cameras, captured under different lighting conditions. Therefore, the Sfm technique is often used for low-cost three-dimensional reconstruction.

Method based on simultaneous localization and mapping (SLAM)

SLAM is collectively referred to as Simultaneous localization and mapping. SLAM solves the problem of how to obtain a three-dimensional model of a scene through dynamic scanning of a sensor in an unfamiliar environment and obtain the self pose at the same time. SLAM typically has several modules: the vision odometer is used for calculating the position change of the camera from the input image sequence to obtain the pose of the camera; the optimization module is used for reducing the accumulation of errors by associating the current state with all previous states in the optimization process because the visual odometer inevitably has accumulated errors; and the image building module integrates the shot images into a three-dimensional model by using the pose of the camera.

Three-dimensional semantic segmentation

In the field of semantic segmentation, a method commonly adopted by the academic circles at home and abroad is to use a convolutional neural network. According to different principles, two methods are specifically classified. One type is that a two-dimensional convolution neural network is used for carrying out semantic prediction on a single frame RGB-D image, and then a two-dimensional semantic label is projected into a three-dimensional space to form a three-dimensional semantic model. However, the two-dimensional image is a sample of the three-dimensional world, and understanding the three-dimensional scene from only local two-dimensional information necessarily loses much global information and three-dimensional geometric information. Therefore, some scholars apply the structure of the two-dimensional semantic segmentation network to three dimensions, directly perform convolution operation on voxels, realize an end-to-end semantic segmentation network, and output three-dimensional semantic labels. However, the computation and parameter quantities of the three-dimensional convolutional neural network are far larger than those of the traditional two-dimensional convolutional neural network. Therefore, the existing three-dimensional semantic segmentation method cannot realize real-time operation, and therefore cannot be directly used for real-time semantic modeling.

The existing real-time semantic modeling system usually combines a two-dimensional semantic segmentation method with an SLAM system, integrates a two-dimensional semantic label into a three-dimensional model, and realizes real-time three-dimensional modeling and semantic segmentation. Because of the limitation of two-dimensional semantic segmentation to three-dimensional space understanding, a semantic SLAM system based on two-dimensional semantic segmentation is often poor in effect.

The prior art does not solve the problem of three-dimensional semantic reconstruction.

The above background disclosure is only for the purpose of assisting understanding of the concept and technical solution of the present invention and does not necessarily belong to the prior art of the present patent application, and should not be used for evaluating the novelty and inventive step of the present application in the case that there is no clear evidence that the above content is disclosed at the filing date of the present patent application.

Disclosure of Invention

The invention provides a real-time large-scene three-dimensional semantic modeling method for solving the existing problems.

In order to solve the above problems, the technical solution adopted by the present invention is as follows:

a real-time large scene three-dimensional semantic modeling method comprises the following steps: s1: constructing a three-dimensional geometric model through an RGB (red, green and blue) image and a depth image obtained by scanning a scene through a sensor; s2: inputting the three-dimensional geometric model into a three-dimensional convolution neural network to complete semantic segmentation; s3: integrating semantic labels output by the three-dimensional convolutional neural network into the three-dimensional geometric model to complete semantic modeling; wherein the building of the three-dimensional geometric model and the semantic segmentation are combined in a multi-threaded manner and performed simultaneously.

Preferably, the sensor is an RGB-D depth sensor, and a three-dimensional geometric model is constructed through a three-dimensional reconstruction system based on simultaneous localization and mapping.

Preferably, constructing the three-dimensional geometric model comprises the steps of: s11: calculating relative displacement between frames of the RGB image through a tracking thread to estimate the pose of the sensor; s12: optimizing the thread to further optimize the pose of the sensor; s13: merging point clouds of the depth map into a signed distance field; s14: extracting, by the thread, a network from the signed distance field to generate the three-dimensional geometric model.

Preferably, inputting the three-dimensional geometric model into a three-dimensional convolutional neural network to complete semantic segmentation comprises the following steps: s21: constructing a sparse convolution layer; s22: and constructing the three-dimensional convolutional neural network by using the sparse convolutional layers.

Preferably, constructing the sparse convolution layer comprises the following steps: s211: dividing the point cloud of the sensor into a plurality of M grids according to three-dimensional coordinates, wherein M represents the side length of each grid; s212: judging whether point cloud exists in each square, if so, determining that the square is an effective square and keeping; if the point cloud does not exist, the grid is an empty grid and is discarded; s213: performing sparse convolution in parallel on all of the active tiles.

Preferably, the performing, by using a graphics processor, a sparse convolution on all the valid tiles in parallel includes: the effective grid has N three-dimensional points, the number of input channels is I, the number of output channels is O, and V is the space volume of a convolution kernel, so that the size of a parameter matrix required by one layer of sparse convolution operation is I x O x V; and splitting the number of the input channels and the number of the output channels by taking K as the size, wherein each graphics processor thread is responsible for calculating K input channels and convolving the K input channels with a parameter matrix of K x V to obtain K output channels.

Preferably, the three-dimensional convolutional neural network comprises: a downsampling section comprising a series of convolutional layers, batch normalization layers, activation layers, and downsampling layers; wherein the step length of the down-sampling layer is 2, and the convolution kernel size is 3; an up-sampling part including a series of convolution layers, a batch normalization layer, an activation layer, and an up-sampling layer; the step length of the up-sampling layer is 2, and the size of the convolution kernel is 3; the upsampling portion and the downsampling portion are symmetric and cascaded; at each resolution level, features in the downsampled layer are stitched into features of the upsampled layer.

Preferably, the three-dimensional convolutional neural network is pre-trained with a common data set S3 DIS.

Preferably, the semantic modeling comprises: s31: inputting the three-dimensional geometric model into the three-dimensional convolution neural network after data processing; s32: and integrating the output semantic labels into the three-dimensional geometric model to perform rendering.

The invention also provides a computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of the method of any one of the above.

The invention has the beneficial effects that: the method for real-time large-scene three-dimensional semantic modeling is characterized in that a three-dimensional geometric model is obtained through a three-dimensional geometric modeling system, and meanwhile, a three-dimensional semantic segmentation convolutional neural network is operated in a multithreading mode, so that combined real-time three-dimensional geometric reconstruction and semantic reconstruction are realized.

Furthermore, by using a sparse convolutional neural network and speeding up the computation of the convolutional network, real-time operation performance can be achieved.

Still further, the use of three-dimensional convolution replaces two-dimensional convolution in UNet, while increasing the depth of the UNet network. The capacity of the convolution network is larger, and the semantic segmentation accuracy is improved.

Drawings

FIG. 1 is a schematic diagram of a real-time large scene three-dimensional semantic modeling method according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart of building a three-dimensional geometric model according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a method for constructing a three-dimensional geometric model according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of a method for inputting a three-dimensional geometric model into a three-dimensional convolutional neural network to perform semantic segmentation according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of a method for constructing a sparse convolution layer according to an embodiment of the present invention.

FIG. 6 is a diagram illustrating a sparse convolution acceleration method according to an embodiment of the present invention.

FIG. 7 is a schematic diagram of a three-dimensional convolutional neural network in an embodiment of the present invention.

FIG. 8 is a schematic flow chart of real-time large scene three-dimensional semantic modeling in an embodiment of the present invention.

FIG. 9 is a schematic diagram of a semantic modeling method in an embodiment of the invention.

Fig. 10 is a schematic diagram of hardware in an embodiment of the invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing or a circuit communication.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings to facilitate the description of the embodiments of the invention and to simplify the description, and are not intended to indicate or imply that the device or element so referred to must have a particular orientation, be constructed in a particular orientation, and be constructed in a particular manner of operation, and are not to be construed as limiting the invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.

As shown in FIG. 1, the invention provides a real-time large scene three-dimensional semantic modeling method, which comprises the following steps:

s1: a three-dimensional geometric model is constructed through an RGB (red, green and blue) image and a depth image which are obtained by scanning a scene through a sensor;

s2: inputting the three-dimensional geometric model into a three-dimensional convolution neural network to complete semantic segmentation;

s3: integrating semantic labels output by the three-dimensional convolutional neural network into the three-dimensional geometric model to complete semantic modeling;

wherein the building of the three-dimensional geometric model and the semantic segmentation are combined in a multi-threaded manner and performed simultaneously.

The method can simultaneously obtain the geometric and semantic models of the large scene.

In an embodiment of the invention, the sensor is an RGB-D depth sensor, which can obtain an accurate depth map, thereby obtaining a higher modeling quality.

In another embodiment of the present invention, the sensor is a binocular depth sensor, a structured light sensor, a ToF sensor, a lidar or the like.

In one embodiment of the invention, three-dimensional geometric modeling employs other simultaneous localization and mapping based reconstruction systems.

As shown in fig. 2, the present invention can construct a three-dimensional geometric model by a three-dimensional reconstruction system based on simultaneous localization and mapping.

As shown in fig. 3, constructing a three-dimensional geometric model includes the following steps:

s11: calculating relative displacement between frames of the RGB image through a tracking thread to estimate the pose of the sensor;

the tracking thread specifically comprises: acquiring the input of the RGBD camera, calculating the association between the frame of the RGB image and the current key frame, judging whether the displacement is greater than a threshold value, if so, marking the displacement as a new key frame, and performing loop detection to realize tracking of the running track of the RGBD camera.

S12: optimizing the thread to further optimize the pose of the sensor;

s13: merging point clouds of the depth map into a signed distance field;

the thread optimization specifically includes: the RGBD camera global is registered, then beam adjusted, and then the point clouds of the depth maps are fused into signed distance fields. The optimization thread is used for optimizing the overall camera pose and reducing the accumulated error.

S14: extracting, by the thread, a network from the signed distance field to generate the three-dimensional geometric model.

The mapping process specifically includes extracting a network and performing GUI rendering after extracting the network.

In another embodiment of the invention, in order to realize real-time performance, a convolution acceleration method applied to sparse point cloud data is adopted.

As shown in fig. 4, inputting the three-dimensional geometric model into a three-dimensional convolutional neural network to complete semantic segmentation includes the following steps:

s21: constructing a sparse convolution layer;

s22: and constructing the three-dimensional convolutional neural network by using the sparse convolutional layer.

In the invention, the three-dimensional convolution neural network adopts sparse convolution as a construction basis, and all convolution layers in the network adopt sparse convolution. In deep learning, for three or more dimensions of data, the total amount of data may grow exponentially as the dimensions grow. In this case, the sparsity of the data must be exploited to reduce the required computational resources. For spatial three-dimensional data, such as point clouds captured by an RGB-D camera or a polygonal mesh model reconstructed from a three-dimensional scene, the data are very sparse, and only a small part of the area in the space has data, and most of the area is empty. The invention uses a Sparse convolution calculation library SSCN (Sparse convolution Networks), which is characterized in that convolution calculation is only carried out on voxels with values in the space, and voxels with empty values are ignored, so that the memory space and the calculation complexity can be greatly saved. However, the existing techniques are still not sufficiently computationally efficient to meet the real-time computation requirements, and therefore, the following necessary technical improvements are made to the sparse convolution to further speed up the computation process.

In one embodiment of the invention, the algorithm is further optimized using a spatial blocking based approach. As shown in fig. 5, constructing a sparse convolution layer includes the following steps:

s211: dividing the point cloud of the sensor into a plurality of M squares according to three-dimensional coordinates, wherein M represents the side length of each square;

s212: judging whether point cloud exists in each square, if so, determining that the square is an effective square and keeping; if the point cloud does not exist, the point cloud is an empty square grid and is discarded;

s213: performing sparse convolution in parallel on all of the active tiles.

In one specific embodiment of the invention, M is 0.05 meters.

As shown in fig. 6, the graph processor is used to perform parallel sparse convolution on all the effective squares, which specifically includes: the effective grid is provided with N three-dimensional points, the number of input channels is I, the number of output channels is O, and V is the space volume of a convolution kernel, so that the size of a parameter matrix required by one layer of sparse convolution operation is I x O x V; and splitting the number of the input channels and the number of the output channels by taking K as the size, wherein each graphics processor thread is responsible for calculating K input channels and convolving the K input channels with a parameter matrix of K x V to obtain K output channels.

In one embodiment of the invention, taking K =16, there may be different optimal parameters for different hardware. The advantage of such splitting is not only that the parallelism is maximized, but also that the memory access efficiency of the GPU can be improved. Because the convolution calculation in each square is independent, the input point cloud and the convolution parameter matrix can be stored in the shared memory, and the memory reading efficiency of the GPU is improved.

As shown in fig. 7, the improved sparse convolution layers are connected according to the structure in the figure to construct a convolutional neural network model. Wherein, the dotted arrows are splicing, the implementation arrows are addition, input is an Input layer, SSC is a convolution layer, SC is a down-sampling layer, deconv is an up-sampling layer, LN is a linear layer, K is the size of the convolution, and S is the step length.

The convolutional neural network model structure adopted in the invention is a variation of a UNet network which is common in the field of two-dimensional image segmentation. The model replaces two-dimensional convolution in UNet with three-dimensional convolution, and meanwhile, the depth of the UNet network is increased. The advantage of this design is that the model capacity is larger, and the precision of semantic segmentation can be improved. The three-dimensional convolutional neural network includes:

a downsampling section comprising a series of convolutional layers, batch normalization layers, active layers, and downsampling layers; wherein the step length of the down-sampling layer is 2, and the convolution kernel size is 3;

an up-sampling part including a series of convolution layers, a batch normalization layer, an activation layer, and an up-sampling layer; the step length of the up-sampling layer is 2, and the size of the convolution kernel is 3;

the upsampling portion and the downsampling portion are symmetric and cascaded; at each resolution level, the features in the downsampled layer are stitched into the features of the upsampled layer.

In one embodiment of the present invention, the three-dimensional convolutional neural network comprises 6 downsampling layers and 6 upsampling layers in total, and 3 convolutional layers and batch normalization layers are arranged between every two downsampling layers or upsampling layers which are close to each other. The number of output channels of these layers corresponds to the resolution. In the three-dimensional convolutional neural network, the three-dimensional convolutional neural network comprises 7 resolution levels in total, and the number of corresponding convolutional layer output channels is 16, 32, 48, 64, 80, 96 and 112 in sequence.

After input data passes through an input layer, a down-sampling part and an up-sampling part, a semantic prediction result of the network is finally output through a linear layer. The output channel number corresponds to the number of the semantic labels. In this embodiment, the total number of predicted tags is 20, and thus the number of final output channels is 20.

In another embodiment of the present invention, the same type of neural network structure can be used instead, and different network parameters can be used, such as changing the number of upsampling and downsampling layers, changing the size of a convolution kernel, changing the number of convolution layer channels, and the like.

In one embodiment of the invention, the building of the three-dimensional geometric model and the semantic segmentation are combined in a multi-threaded manner, i.e. the three-dimensional geometric modeling and the three-dimensional semantic segmentation are performed simultaneously and in parallel in different threads.

As shown in fig. 8, in the mapping thread, a series of data preprocessing is performed on the geometric model obtained after the mesh extraction, and then the geometric model is input into the semantic segmentation network. And then, integrating the output semantic tags into the three-dimensional model for real-time rendering.

As shown in fig. 9, the semantic modeling includes:

s31: inputting the three-dimensional geometric model into the three-dimensional convolution neural network after data processing;

s32: and integrating the output semantic tags into the three-dimensional geometric model for rendering.

In this embodiment, the three-dimensional convolutional neural network is pre-trained using the common data set S3DIS, and then model fine-tuning is performed by collecting a small amount of real scene data.

By accelerating the convolution calculation, the invention can achieve the performance of real-time operation; at the same time, the acceleration algorithm reduces the amount of computation, so that the system can work on portable devices with limited hardware resources (such as Microsoft Surface Pro).

As shown in fig. 10, which is a schematic diagram of hardware according to an embodiment of the present invention, the hardware includes a notebook computer 1 and an RGBD camera 2, and the specifically adopted devices are: surface Book + rotation depth camera. The method can carry out semantic reconstruction in real time, and the semantic tag updating rate is 2Hz. The semantic reconstruction result has high quality, and objects such as the ground, the wall, a desk, a chair and the like can be well distinguished. This experiment demonstrates the effectiveness of the present invention and verifies that it can be run on a portable device in real time.

It will be appreciated that the invention may also be used with other computing devices such as desktop computers and the like.

In order to compare the quality of the semantic model generated by the method with the existing method, the semantic segmentation accuracy of the method is tested in a public data set S3 DIS. The common data set S3DIS covers many types of scenes and contains correct semantic labels, which are widely used for evaluation of three-dimensional semantic segmentation results. And training a three-dimensional neural network model on the S3DIS training set, and calculating the accuracy of the semantic model generated by the method on the test set. The model results are evaluated using the IoU, i.e., the ratio of the intersection and union of the calculated results and the true values, and then averaged for all object types. The method of the invention obtains an average IoU value of 68.34 percent which is higher than the prior highest value, 65.4 percent obtained by MinkowskiEngine, and proves the effectiveness of the method.

The effectiveness of the sparse convolution acceleration method of the present invention was also evaluated. Under the condition of inputting a group of same real large scene data, the time consumed by adopting the convolutional neural network once prediction constructed by the existing sparse convolution is 1267 milliseconds, while the time consumed by adopting the convolutional neural network once prediction in the method is 242.1 milliseconds, so that the calculation efficiency of the method is greatly improved.

It will be understood by those skilled in the art that all or part of the steps for implementing the embodiments described above may be implemented by hardware, or may be implemented by hardware related to instructions of a program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the various method embodiments described above may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signal, telecommunications signal, and software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The method can be used for real-time semantic reconstruction to provide effective scene information for applications such as AR/VR and the like; alternatively, a sparse convolution acceleration algorithm is used to increase the training speed of other networks.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. It will be apparent to those skilled in the art that various equivalent substitutions and obvious modifications can be made without departing from the spirit of the invention, and all changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

1. A real-time large scene three-dimensional semantic modeling method is characterized by comprising the following steps:

s1: constructing a three-dimensional geometric model through an RGB (red, green and blue) image and a depth image obtained by scanning a scene through a sensor;

the sensor is an RGB-D depth sensor, and a three-dimensional geometric model is constructed through a three-dimensional reconstruction system based on simultaneous positioning and mapping;

the method comprises the following steps:

s12: optimizing the thread to further optimize the pose of the sensor;

s13: merging point clouds of the depth map into a signed distance field;

s14: extracting, by the thread, a network from the signed distance field to generate the three-dimensional geometric model;

wherein the constructing of the three-dimensional geometric model and the semantic segmentation are combined in a multithreading manner and performed simultaneously.

2. The method for real-time large scene three-dimensional semantic modeling according to claim 1, wherein the step of inputting the three-dimensional geometric model into a three-dimensional convolutional neural network to perform semantic segmentation comprises the following steps:

s21: constructing a sparse convolution layer;

3. The method for real-time large scene three-dimensional semantic modeling according to claim 2, wherein constructing the sparse convolution layer comprises the steps of:

s212: judging whether point cloud exists in each square, if so, determining that the square is an effective square and keeping; if the point cloud does not exist, the grid is an empty grid and is discarded;

s213: performing sparse convolution in parallel on all of the active tiles.

4. The method according to claim 3, wherein the performing a sparse convolution on all the active squares in parallel using a graphics processor comprises: the effective grid is provided with N three-dimensional points, the number of input channels is I, the number of output channels is O, and V is the space volume of a convolution kernel, so that the size of a parameter matrix required by one layer of sparse convolution operation is I x O x V; and splitting the number of the input channels and the number of the output channels by taking K as the size, wherein each graphics processor thread is responsible for calculating K input channels and convolving the K input channels with a parameter matrix of K x V to obtain K output channels.

5. The method for real-time large scene three-dimensional semantic modeling according to claim 2, wherein the three-dimensional convolutional neural network comprises:

a downsampling section comprising a series of convolutional layers, batch normalization layers, activation layers, and downsampling layers; wherein the step length of the down-sampling layer is 2, and the convolution kernel size is 3;

6. The method for real-time large scene three-dimensional semantic modeling according to claim 5, wherein the three-dimensional convolutional neural network is pre-trained with a common data set S3 DIS; and S3DIS is a large-scene indoor 3D point cloud data set.

7. The method for real-time large scene three-dimensional semantic modeling according to claim 5, wherein the semantic modeling comprises:

s32: and integrating the output semantic tags into the three-dimensional geometric model for real-time rendering.

8. A computer-readable storage medium, characterized in that a computer program is stored thereon which, when being executed by a processor, carries out the steps of the method according to any one of claims 1-7.