CN111462324B

CN111462324B - Online spatiotemporal semantic fusion method and system

Info

Publication number: CN111462324B
Application number: CN202010418823.1A
Authority: CN
Inventors: 于耀; 骆润豪; 周余; 都思丹
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2022-05-17
Anticipated expiration: 2040-05-18
Also published as: CN111462324A

Abstract

The invention relates to an online space-time semantic fusion method and system. The method comprises the following steps: acquiring initial data of an object to be semantically fused and a 2D semantic segmentation network; determining information of each data point in single-frame point cloud data by using initial data as input and adopting a 2D semantic segmentation network; transforming the single-frame point cloud data into a three-dimensional world coordinate system by taking a voxel as a basic unit, and establishing a three-dimensional voxel grid map by using a dictionary data structure; generating a voxel data set according to voxels in the three-dimensional voxel grid map; acquiring an online spatiotemporal semantic fusion network; and determining a three-dimensional semantic fusion map of an object to be semantically fused by using an online space-time semantic fusion network by taking the image feature vector of the voxel in the voxel data set as input. The method and the system provided by the invention creatively use the networking method based on the attention mechanism to solve the problem that the prior art can only trace limited frame data and can not fully utilize historical information, and have the characteristics of high efficiency and complete semantic fusion.

Description

Online spatiotemporal semantic fusion method and system

Technical Field

The invention relates to the field of computer vision and deep learning, in particular to an online space-time semantic fusion method and system.

Background

In the development process of artificial intelligence technology, the development of intelligent robot technology is more and more emphasized, and in the robot technology, it is always very important to obtain accurate 3D structure and semantic information.

Currently, in the field of semantic segmentation, 2D semantic segmentation processes a single picture directly, which is cost effective but does not utilize temporal and spatial information. On the other hand, some methods directly use 3D spatial structure information for semantic segmentation, but some problems are often encountered in practical use. In the method related to the Pointnet, the input data usually needs to be well spliced point clouds, which brings about the problems of large calculation amount and incapability of real-time operation. In addition, some methods perform convolution directly in 3D space, but convolution in 3D space generally cannot achieve as good an effect as 2D convolution due to sparsity of 3D space.

The field of robotics often uses time-series 3D scans and image data, and when 3D structure and semantic information are obtained from these data, some methods project the semantic probability of pixels obtained by 2D methods onto three-dimensional space for bayesian probability fusion in order to overcome some of the drawbacks of the above methods. Although they can achieve good real-time performance, this method based on Hidden Markov Model (HMM) Model can usually trace only limited frame data, and cannot fully utilize historical information. Moreover, the fusion method using probabilities directly results in the loss of a large amount of information.

Disclosure of Invention

The invention aims to provide an online space-time semantic fusion method and system, which can solve the problems that only limited frame data can be traced and historical information cannot be fully utilized in the prior art, and have the characteristics of high efficiency and complete semantic fusion.

In order to achieve the purpose, the invention provides the following scheme:

an online spatiotemporal semantic fusion method, comprising:

acquiring initial data of an object to be semantically fused and a 2D semantic segmentation network; the initial data comprises point cloud data and an RGB picture;

determining the information of each data point in single-frame point cloud data by using the 2D semantic segmentation network by taking the initial data as input; the information for each of the data points includes: coordinates of the data points in a world coordinate system, image semantic feature vectors corresponding to the data points and semantic label values of the data points;

transforming the single-frame point cloud data into a three-dimensional world coordinate system by taking the voxel as a basic unit, and establishing a three-dimensional voxel grid map by using a dictionary data structure; each voxel in the three-dimensional voxel grid map comprises an image characteristic vector and a label corresponding to the data point;

generating a voxel data set from the voxels in the three-dimensional voxel grid map;

acquiring an online spatiotemporal semantic fusion network;

determining a three-dimensional semantic fusion map of the object to be semantically fused by using the online space-time semantic fusion network by taking the image feature vector of the voxel in the voxel data set as input; the three-dimensional semantic fusion map comprises voxel type probability.

Preferably, the transforming the single-frame point cloud data into a three-dimensional world coordinate system by using the voxel as a basic unit and establishing a three-dimensional voxel grid map by using a dictionary data structure specifically include:

taking a key of each element in the dictionary data structure as a center coordinate of a voxel, taking a value of each element as a list, and storing data of data points of historical frames falling in the voxel;

after new single-frame point cloud data are obtained, the center coordinates of voxels falling on each data point are used as keys to index the dictionary data structure, and an index result is obtained;

if the index result is that the center coordinate is indexed, adding data of the data point to the tail part of the list of the corresponding element;

if the index result is that the center coordinate cannot be indexed, creating a new element taking the voxel center coordinate as a key, and creating an empty list to add the data point into the dictionary data structure;

and returning to the step of indexing the dictionary data structure by taking the central coordinate of the voxel of each data point as a key after new single-frame point cloud data is obtained, and obtaining an index result until all frames of point cloud data are accumulated in the dictionary data structure, thereby obtaining the three-dimensional voxel grid map.

Preferably, the construction process of the online spatiotemporal semantic fusion network includes:

acquiring an observation self-adaptive semantic state updating network and a self-attention information fusion network; the input of the observation self-adaptive semantic state updating network is an image feature vector in the voxel data set, and the output is an observation self-adaptive semantic fusion state of each voxel in time; the input of the self-attention information fusion network is the observation self-adaptive semantic fusion state, and the output is a voxel type prediction vector;

and constructing the online spatiotemporal semantic fusion network according to the observation self-adaptive semantic state updating network and the self-attention information fusion network.

Preferably, the process for constructing the observation adaptive semantic state updating network includes:

acquiring a normal vector of a voxel in the voxel data set and image semantic features corresponding to historical frame point cloud data in the voxel data set, and taking a central coordinate of the voxel as a sensor pose;

a gating circulation unit is adopted, and the normal vector and the sensor pose are used as input to obtain an observation effective state; the observed validity state is:

in the formula (I), the compound is shown in the specification,

to observe the significance state, V_i ^tFor the sensor pose of the ith voxel at time t,

the normal vector of the ith voxel at the time t is taken, and GRU is a gating cycle;

according to the image semantic features and the observation validity state, determining an observation self-adaption semantic fusion state of each voxel on time by adopting the gating circulation unit; the observation self-adaptive semantic fusion state is as follows:

in the formula, F_i ^tAdaptive semantic fusion for observationsIn the closed state, GRU is a gating cycle, concatenate is a series operation,

in order to be a semantic feature of the image,

to observe the validity state.

Preferably, the process of constructing the self-attention information fusion network includes:

taking a cube taking a current voxel as a center as a search range, searching a neighborhood voxel of the current voxel, and adding the neighborhood voxel into a neighborhood voxel list of the current voxel; the neighborhood voxel list contains the current voxel;

acquiring the observation self-adaptive semantic fusion state and the offset vector from the neighborhood voxel to the current voxel;

determining a normalized attention weight according to the observation adaptive semantic fusion state and the offset vector;

determining the semantic hidden layer output of the voxel after space-time fusion according to the normalized attention weight; and obtaining a semantic probability prediction vector through a full connection layer and a softmax layer after the hidden layer is output.

An online spatiotemporal semantic fusion system, comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring initial data of an object to be semantically fused and a 2D semantic segmentation network; the initial data comprises point cloud data and an RGB picture;

a data point information determining module, configured to determine information of each data point in the single-frame point cloud data by using the 2D semantic segmentation network with the initial data as input; the information for each of the data points includes: coordinates of the data points in a world coordinate system, image semantic feature vectors corresponding to the data points and semantic label values of the data points;

the three-dimensional voxel grid map building module is used for transforming the single-frame point cloud data into a three-dimensional world coordinate system by taking a voxel as a basic unit and building a three-dimensional voxel grid map by using a dictionary data structure; each voxel in the three-dimensional voxel grid map comprises an image characteristic vector and a label corresponding to the data point;

a voxel data set generating module for generating a voxel data set according to the voxels in the three-dimensional voxel grid map;

the second acquisition module is used for acquiring an online spatiotemporal semantic fusion network;

the semantic fusion module is used for determining a three-dimensional semantic fusion map of the object to be subjected to semantic fusion by using the online spatiotemporal semantic fusion network by taking the image feature vector of the voxel in the voxel data set as input so as to complete the semantic fusion of the object to be subjected to semantic fusion; the three-dimensional semantic fusion map comprises voxel type probability.

Preferably, the three-dimensional voxel grid map building module specifically includes:

a data point data storage unit, configured to store data of data points of a history frame falling in a voxel, where a key of each element in the dictionary data structure is used as a center coordinate of the voxel, and a value of each element is used as a list;

the index result determining unit is used for indexing the dictionary data structure by taking the central coordinate of the voxel of each data point as a key after acquiring new single-frame point cloud data to obtain an index result;

the data point data adding unit is used for adding data of data points to the tail part of the list of the corresponding elements when the index result is that the center coordinate is indexed;

a data point adding unit, configured to create a new element using the voxel center coordinate as a key if the index result is that the index does not reach the center coordinate, and create an empty list to add the data point to the dictionary data structure;

and the three-dimensional voxel grid map determining unit is used for returning to the step of obtaining new single-frame point cloud data, indexing the dictionary data structure by taking the central coordinate of the voxel of each data point as a key, and obtaining an index result until the point cloud data of all the frames are accumulated in the dictionary data structure, so as to obtain the three-dimensional voxel grid map.

Preferably, the system comprises an online spatiotemporal semantic fusion network construction module; the online space-time semantic fusion network construction module comprises:

the acquisition network unit is used for acquiring an observation self-adaptive semantic state updating network and a self-attention information fusion network; the input of the observation self-adaptive semantic state updating network is an image feature vector in the voxel data set, and the output is an observation self-adaptive semantic fusion state of each voxel in time; the input of the self-attention information fusion network is the observation self-adaptive semantic fusion state, and the output is a voxel type prediction vector;

and the online space-time semantic fusion network construction unit is used for constructing the online space-time semantic fusion network according to the observation self-adaptive semantic state updating network and the self-attention information fusion network.

Preferably, the online spatiotemporal semantic fusion network construction module further includes: an observation self-adaptive semantic state updating network construction unit; the observation adaptive semantic state updating network construction unit specifically comprises:

the first acquisition subunit is used for acquiring a normal vector of a voxel in the voxel data set and image semantic features corresponding to historical frame point cloud data in the voxel data set, and taking a central coordinate of the voxel as a sensor pose quantity;

the observation validity state determining subunit is used for obtaining an observation validity state by using the normal vector and the sensor pose as input through a gating circulation unit; the observed validity state is:

in the formula (I), the compound is shown in the specification,

to observe the significance state, V_i ^tSensor pose at time t for the ith voxelThe amount of the compound (A) is,

the observation self-adaptive semantic fusion state determining subunit is used for determining the observation self-adaptive semantic fusion state of each voxel in time by adopting the gating circulation unit according to the image semantic features and the observation validity state; the observation self-adaptive semantic fusion state is as follows:

in the formula, F_i ^tFor observing the self-adaptive semantic fusion state, GRU is a gating cycle, concatenate is a series operation,

in order to be a semantic feature of the image,

to observe the validity state.

Preferably, the online spatiotemporal semantic fusion network construction module further includes: a self-attention information fusion network construction unit; the self-attention information fusion network construction unit specifically comprises:

a neighborhood voxel searching subunit, configured to search a neighborhood voxel of a current voxel by using a cube with the current voxel as a center as a search range, and add the neighborhood voxel to a neighborhood voxel list of the current voxel; the neighborhood voxel list contains the current voxel;

the second acquisition subunit is used for acquiring the observation self-adaptive semantic fusion state and the offset vector from the neighborhood voxel to the current voxel;

a normalized attention weight determination subunit, configured to determine a normalized attention weight according to the observation adaptive semantic fusion state and the offset vector;

a semantic hidden layer output determining subunit, configured to determine, according to the normalized attention weight, a semantic hidden layer output after the voxel is subjected to spatio-temporal fusion; and obtaining a semantic probability prediction vector through a full connection layer and a softmax layer after the hidden layer is output.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the online space-time semantic fusion method and system provided by the invention, the high-dimensional semantic features are fused online in space and time by establishing the three-dimensional voxel structure of spatial hash and adopting the online space-time semantic fusion network, so that a better semantic map result is obtained.

In addition, in the online space-time semantic fusion method and system provided by the invention, a networking method based on an attention mechanism is innovatively used, so that the problem that only limited frame data can be traced and historical information cannot be fully utilized in the prior art can be solved, and the online space-time semantic fusion method and system have the characteristics of high efficiency and complete semantic fusion.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of an online spatiotemporal semantic fusion method provided by the present invention;

FIG. 2 is a diagram of a data transmission process in semantic fusion using the online spatiotemporal semantic fusion method provided by the present invention;

FIG. 3 is a schematic diagram of a specific network structure of an online spatiotemporal semantic fusion network according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an online spatiotemporal semantic fusion system provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The invention provides an online space-time semantic fusion method which is an attention-based online space-time semantic fusion method. Fig. 1 is a flowchart of an online spatiotemporal semantic fusion method provided by the present invention, and as shown in fig. 1, the online spatiotemporal semantic fusion method includes:

step 100: and acquiring initial data of an object to be semantically fused and a 2D semantic segmentation network. The initial data includes point cloud data and an RGB picture. Wherein the obtained 2D semantic segmentation network is a trained 2D semantic segmentation network.

The invention aims to solve the problem of spatiotemporal semantic fusion in the process of obtaining a real-time semantic map by using 3D scanning (such as laser radar) and RGB video sequence data. Such datasets typically contain point cloud data, RGB pictures and their semantic labels and optionally sensor poses. And training the 2D semantic segmentation network to extract semantic feature information of each frame for the semantic fusion network behind the system.

Step 101: and determining the information of each data point in the single-frame point cloud data by using the initial data as input and adopting a 2D semantic segmentation network. The information for each data point includes: coordinates of the data points in the world coordinate system, image semantic feature vectors corresponding to the data points and semantic label values of the data points.

The RGB image in the initial data is processed through the trained 2D semantic segmentation network, the feature map before the network prediction layer is segmented is up-sampled to the resolution of the original image through bilinear interpolation, and the feature map with the same resolution as the original image is obtained and called as an image semantic feature map.

And projecting the point cloud to a pixel coordinate system of a camera to obtain a pixel coordinate, and performing bilinear interpolation according to the pixel coordinate and the image semantic feature map to obtain an image semantic feature corresponding to the point cloud. And performing proximity interpolation according to the pixel coordinates and the picture labels to obtain semantic labels of the point cloud.

Step 102: and transforming the single-frame point cloud data into a three-dimensional world coordinate system by taking the voxel as a basic unit, and establishing a three-dimensional voxel grid map by using a dictionary data structure. Each voxel in the three-dimensional voxel grid map comprises an image feature vector and a label corresponding to the data point. And taking the class with the largest occurrence number in the labels as the label of the voxel.

Step 102 specifically includes:

and taking the key of each element in the dictionary data structure as the center coordinate of the voxel, taking the value of each element as a list, and storing the data of the data points of the historical frame falling in the voxel.

And after new single-frame point cloud data is obtained, indexing the dictionary data structure by taking the central coordinate of the voxel of each data point as a key to obtain an indexing result.

And if the index result is that the center coordinate is indexed, adding the data of the data point to the tail part of the list of the corresponding element.

And if the index result is that the center coordinate cannot be indexed, creating a new element taking the voxel center coordinate as a key, and creating an empty list to add the data point into a dictionary data structure.

And returning to the step of indexing the dictionary data structure by taking the central coordinate of the voxel of each data point as a key after acquiring new single-frame point cloud data to obtain an indexing result, and obtaining the three-dimensional voxel grid map after the point cloud data of all frames are accumulated to the dictionary data structure.

Step 103: a voxel data set is generated from voxels in the three-dimensional voxel grid map. Namely, storing the image semantic features and semantic labels of the point clouds of the historical frames of all the voxels so as to generate a spatio-temporal semantic fused voxel data set.

Step 104: and acquiring an online space-time semantic fusion network.

Step 105: and determining a three-dimensional semantic fusion map of an object to be semantically fused by using an online space-time semantic fusion network by taking the image feature vector of the voxel in the voxel data set as input. The three-dimensional semantic fusion map comprises voxel class probability.

The data transmission process in the semantic fusion by adopting the online spatiotemporal semantic fusion method provided by the invention is shown in figure 2. As can be seen from fig. 2, the three-dimensional semantic fusion map obtained by the online spatiotemporal semantic fusion method provided by the present invention uses different degrees of gray levels to represent the same voxel class probability (in the actual application process, different colors may be used to represent the same voxel class probability). For different applications of objects to be fused, the divided voxel types are divided according to actual needs, for example, for a three-dimensional semantic fusion map of an outdoor scene, the voxel types can be divided into: buildings, people, greenery, vehicles, and the like.

The specific structure of the online spatiotemporal semantic fusion network is shown in fig. 3. The method mainly comprises two parts, namely an observation self-adaptive semantic state updating network on time and a self-attention information fusion network on space. The construction process comprises the following steps:

and acquiring an observation self-adaptive semantic state updating network and a self-attention information fusion network. The input of the observation self-adaptive semantic state updating network is an image feature vector in a voxel data set, and the output is an observation self-adaptive semantic fusion state of each voxel on time. The input of the self-attention information fusion network is an observation self-adaptive semantic fusion state, and the output is a voxel type prediction vector.

And establishing an online space-time semantic fusion network according to the observation self-adaptive semantic state updating network and the self-attention information fusion network.

The function of the temporal observation adaptive semantic state updating network is to fuse the historical frame semantic feature information of each voxel. Because the change of voxel observation is not uniform in the data collection process, a large amount of redundant or invalid observation is easy to generate, the invention is inspired by attention thought, and the validity of each frame observation is evaluated by using a networking method, so that the state updating mainly focuses on effective observation data rather than redundant data. The concrete structure of the temporal observation adaptive semantic state updating network is shown in part a of fig. 3, and is composed of two network parts, including:

1) and observing the effectiveness evaluation network. The present invention assumes that there are two main factors involved in the effectiveness of the observation. One is the sensor pose with the current voxel as the coordinate center. The other is the normal vector of the current voxel. The combination of normal and position may represent the effectiveness of the observation from a geometric perspective. The observation validity evaluation network uses the two variable factors as input, and uses a Gated RecurrentUnit (GRU) to obtain an observation validity state, specifically:

and acquiring a normal vector of a voxel in the voxel data set and image semantic features corresponding to historical frame point cloud data in the voxel data set, and taking the central coordinate of the voxel as the pose of the sensor.

And a gate control circulation unit (GRU) is adopted, and the observation validity state is obtained by taking the normal vector and the sensor pose as input. The observed validity states are:

in the formula (I), the compound is shown in the specification,

GRU is the gating cycle for the normal vector of the ith voxel at time t.

2) And updating the network by the semantic state. The image semantic features and the observation validity state are connected in series, and the observation self-adaptive semantic fusion state of each voxel in time is obtained on line through GRU, specifically:

and determining the observation self-adaptive semantic fusion state of each voxel on time by adopting a gating circulation unit according to the image semantic features and the observation validity state. The observation self-adaptive semantic fusion state is as follows:

in order to be a semantic feature of the image,

to observe the validity state.

Performing 3D convolution directly on sparse 3D data is not efficient and the 3D structure may change over time, which makes the fixed mesh network difficult to learn. Thus, the present invention uses a self-attention mechanism to explicitly measure the correlation between the current voxel and its neighborhood voxels. The structure of the spatial self-attention information fusion network is shown in part b of fig. 3, and includes two key parts, including:

1) neighborhood searching:

and taking a cube taking the current voxel as a center as a search range, searching the neighborhood voxel of the current voxel, and adding the neighborhood voxel into a neighborhood voxel list of the current voxel. The neighborhood voxel list contains the current voxel.

2) Fusion network based on spatial self-attention:

the present invention assumes that there are two main factors that determine the correlation between the current voxel and its vicinity. One is the hidden state stored in each voxel and the other is the offset vector of the neighborhood voxels to the current voxel. In consideration of the inconsistency of feature space between the offset vector and the hidden state in the voxel, the invention designs a lightweight encoder to realize space transformation and embed the offset vector. And the offset vectors are subjected to dimensionality enhancement and series connection through a layer of full connection layer, and then are embedded through a layer of full connection layer. The embedded output is split into two branches: one containing only the information of the target voxel to generate the query vector. The other one contains all neighborhood voxels and is input into SENET, generating a weighted feature dictionary of neighborhood voxels. The key uses the same output as the value. In the correlation calculation, the dot product is selected to reduce the computational complexity.

Therefore, after the observation adaptive semantic fusion state and the offset vector from the neighborhood voxel to the current voxel are obtained, the normalized attention weight is determined according to the observation adaptive semantic fusion state and the offset vector. The calculation formula of the normalized correlation weight is as follows:

in the formula, F_i ^tIn order to observe the adaptive semantic fusion state,

is the offset vector from the jth neighborhood voxel at time t to the current voxel i. f (-) is a lightweight encoder, f (-)^TIs the transpose of the function f (-), S (-) represents SEnet,

normalized attention weight at time t for the j-th neighborhood voxel of the current voxel i.

After the normalized correlation weight is obtained, the formula is used for obtaining the normalized correlation weight

And calculating to obtain the semantic hidden layer output of the voxel after space-time fusion.

Wherein the content of the first and second substances,

and outputting the semantic hidden layer. Further, in FIG. 3

Is a bond of an element(s),

is the value of the element.

And obtaining semantic probability prediction through a full connection layer and a softmax layer after the hidden layer is output.

In order to provide the accuracy of semantic fusion, the constructed online space-time semantic fusion network is trained and tested. And training and testing the online spatiotemporal semantic fusion network by using the constructed voxel data set. Wherein, training test data is according to 3: 1 division. And the Loss function uses a classical cross entropy function in the semantic segmentation field, and the L2 regular Loss is added to prevent the network overfitting. Training until loss converges, using mIOU (average cross-over ratio) as an evaluation item, and selecting a model with the best effect on a test set as a finally used online space-time semantic fusion network.

Then, the online semantic space-time fusion network can be applied, namely, in the semantic mapping process, the spatial information and RGB semantic features of the point cloud of the historical frame are accumulated, and then the online space-time semantic fusion method provided by the invention is used for fusing the space-time semantic features of the point cloud to obtain a better semantic fusion result online.

Corresponding to the above-mentioned provided method for fusing online spatiotemporal semantics, the present invention also provides an online spatiotemporal semantic fusion system, as shown in fig. 4, the system comprising: a first acquisition module 400, a data point information determination module 401, a three-dimensional voxel grid map creation module 402, a voxel data set generation module 403, a second acquisition module 404, and a semantic fusion module 405.

The first obtaining module 400 is configured to obtain initial data of an object to be semantically fused and a 2D semantic segmentation network. The initial data includes point cloud data and an RGB picture.

The data point information determining module 401 is configured to determine information of each data point in a single frame of point cloud data by using 2D semantic segmentation network with initial data as input. The information for each data point includes: coordinates of the data points in the world coordinate system, image semantic feature vectors corresponding to the data points and semantic label values of the data points.

The three-dimensional voxel grid map building module 402 is configured to transform the single-frame point cloud data into a three-dimensional world coordinate system with voxels as basic units, and build a three-dimensional voxel grid map using a dictionary data structure. Each voxel in the three-dimensional voxel grid map comprises an image feature vector and a label corresponding to the data point.

The voxel data set generating module 403 is used for generating a voxel data set from voxels in the three-dimensional voxel grid map.

The second obtaining module 404 is configured to obtain an online spatiotemporal semantic fusion network.

The semantic fusion module 405 is configured to determine a three-dimensional semantic fusion map of an object to be semantically fused by using an online spatiotemporal semantic fusion network, with the image feature vector of a voxel in the voxel data set as input, so as to complete semantic fusion of the object to be semantically fused. The three-dimensional semantic fusion map comprises voxel class probability.

As another embodiment of the present invention, the three-dimensional voxel grid map building module 402 specifically includes: the device comprises a data point data storage unit, an index result determination unit, a data point data adding unit, a data point adding unit and a three-dimensional voxel grid map determination unit.

The data point data storage unit is used for storing data of data points of historical frames falling in a voxel by taking a key of each element in the dictionary data structure as the center coordinate of the voxel and taking the value of each element as a list.

And the index result determining unit is used for indexing the dictionary data structure by taking the central coordinate of the voxel of each data point as a key after acquiring the new single-frame point cloud data to obtain an index result.

The data point data adding unit is used for adding the data of the data point to the tail part of the list of the corresponding element when the index result is that the index is to the center coordinate.

And the data point adding unit is used for creating a new element taking the voxel center coordinate as a key if the index result is that the center coordinate is not obtained, and creating an empty list to add the data point into the dictionary data structure.

And the three-dimensional voxel grid map determining unit is used for returning to the step of obtaining new single-frame point cloud data, indexing the dictionary data structure by taking the central coordinate of the voxel of each data point as a key to obtain an indexing result until the point cloud data of all frames are accumulated to the dictionary data structure, and obtaining the three-dimensional voxel grid map.

As another embodiment of the invention, the system comprises an online spatiotemporal semantic fusion network construction module. The online space-time semantic fusion network construction module comprises: and acquiring a network unit and an online spatiotemporal semantic fusion network construction unit.

The acquisition network unit is used for acquiring an observation self-adaptive semantic state updating network and a self-attention information fusion network. The input of the observation self-adaptive semantic state updating network is an image feature vector in a voxel data set, and the output is an observation self-adaptive semantic fusion state of each voxel in time. The input of the self-attention information fusion network is an observation self-adaptive semantic fusion state, and the output is a voxel type prediction vector.

The online space-time semantic fusion network construction unit is used for constructing an online space-time semantic fusion network according to the observation self-adaptive semantic state updating network and the self-attention information fusion network.

As another embodiment of the invention, the online spatiotemporal semantic fusion network construction module further comprises: and the observation self-adaptive semantic state updating network construction unit. The observation adaptive semantic state updating network construction unit specifically comprises: the device comprises a first acquisition subunit, an observation validity state determination subunit and an observation adaptive semantic fusion state determination subunit.

The first acquisition subunit is used for acquiring a normal vector of a voxel in the voxel data set and image semantic features corresponding to historical frame point cloud data in the voxel data set, and taking a center coordinate of the voxel as a sensor pose quantity.

And the observation validity state determining subunit is used for obtaining the observation validity state by adopting a gating circulation unit and taking the normal vector and the sensor pose as input. The observed validity states are:

in the formula (I), the compound is shown in the specification,

GRU is the gating cycle for the normal vector of the ith voxel at time t.

And the observation self-adaptive semantic fusion state determining subunit is used for determining the observation self-adaptive semantic fusion state of each voxel in time by adopting the gate control circulation unit according to the image semantic features and the observation validity state. The observation self-adaptive semantic fusion state is as follows:

in order to be a semantic feature of the image,

to observe the validity state.

As another embodiment of the present invention, the above online spatiotemporal semantic fusion network building module may further include: and the self-attention information fusion network construction unit. The self-attention information fusion network construction unit specifically comprises: the system comprises a neighborhood voxel searching subunit, a second acquiring subunit, a normalized attention weight determining subunit and a semantic hidden layer output determining subunit.

The neighborhood voxel searching subunit is used for searching a neighborhood voxel of the current voxel by taking a cube with the current voxel as a center as a searching range and adding the neighborhood voxel into a neighborhood voxel list of the current voxel. The neighborhood voxel list contains the current voxel.

The second obtaining subunit is used for obtaining an observation self-adaptive semantic fusion state and a shift vector from a neighborhood voxel to a current voxel.

The normalized attention weight determination subunit is used for determining a normalized attention weight according to the observation adaptive semantic fusion state and the offset vector.

And the semantic hidden layer output determining subunit is used for determining the semantic hidden layer output of the voxel after space-time fusion according to the normalized attention weight. And obtaining a semantic probability prediction vector through a full connection layer and a softmax layer after the hidden layer is output.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. An online spatiotemporal semantic fusion method, comprising:

acquiring an online spatiotemporal semantic fusion network;

determining a three-dimensional semantic fusion map of the object to be semantically fused by using the online space-time semantic fusion network by taking the image feature vector of the voxel in the voxel data set as input; the three-dimensional semantic fusion map comprises voxel type probability;

the method comprises the following steps of transforming single-frame point cloud data into a three-dimensional world coordinate system by taking a voxel as a basic unit, and establishing a three-dimensional voxel grid map by using a dictionary data structure, wherein the method specifically comprises the following steps:

returning to the step of obtaining new single-frame point cloud data, indexing the dictionary data structure by taking the central coordinate of the voxel of each data point as a key to obtain an index result, and obtaining a three-dimensional voxel grid map after the point cloud data of all frames are accumulated to the dictionary data structure;

the construction process of the online spatiotemporal semantic fusion network comprises the following steps:

constructing the online spatiotemporal semantic fusion network according to the observation adaptive semantic state updating network and the self-attention information fusion network;

the construction process of the observation self-adaptive semantic state updating network comprises the following steps:

in the formula (I), the compound is shown in the specification,

in order to observe the state of validity,

for the sensor pose of the ith voxel at time t,

in the formula (I), the compound is shown in the specification,

for observing the self-adaptive semantic fusion state, GRU is a gating cycle, concatenate is a series operation,

in order to be a semantic feature of the image,

to observe the validity state;

the construction process of the self-attention information fusion network comprises the following steps:

2. An online spatiotemporal semantic fusion system, comprising:

the semantic fusion module is used for determining a three-dimensional semantic fusion map of the object to be subjected to semantic fusion by using the online spatiotemporal semantic fusion network by taking the image feature vector of the voxel in the voxel data set as input so as to complete the semantic fusion of the object to be subjected to semantic fusion; the three-dimensional semantic fusion map comprises voxel type probability;

the three-dimensional voxel grid map building module specifically comprises:

the three-dimensional voxel grid map determining unit is used for returning to the step that after new single-frame point cloud data are obtained, the center coordinates of the voxels where each data point falls are used as keys to index the dictionary data structure, and index results are obtained until all frames of point cloud data are accumulated in the dictionary data structure, and then the three-dimensional voxel grid map is obtained;

the system comprises an online time-space semantic fusion network construction module; the online space-time semantic fusion network construction module comprises:

the online space-time semantic fusion network construction unit is used for constructing the online space-time semantic fusion network according to the observation self-adaptive semantic state updating network and the self-attention information fusion network;

the online spatiotemporal semantic fusion network construction module further comprises: an observation self-adaptive semantic state updating network construction unit; the observation adaptive semantic state updating network construction unit specifically comprises:

in the formula (I), the compound is shown in the specification,

in order to observe the state of validity,

for the sensor pose of the ith voxel at time t,

in order to be a semantic feature of the image,

to observe the validity state;

the online spatiotemporal semantic fusion network construction module further comprises: a self-attention information fusion network construction unit; the self-attention information fusion network construction unit specifically comprises:

a neighborhood voxel searching subunit, configured to search a neighborhood voxel of a current voxel by using a cube whose center is the current voxel as a search range, and add the neighborhood voxel to a neighborhood voxel list of the current voxel; the neighborhood voxel list contains the current voxel;