CN113128348A

CN113128348A - Laser radar target detection method and system fusing semantic information

Info

Publication number: CN113128348A
Application number: CN202110317542.1A
Authority: CN
Inventors: 李燕; 陈超; 齐飞; 王晓甜; 石光明
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-07-16
Anticipated expiration: 2041-03-25
Also published as: CN113128348B

Abstract

The invention discloses a laser radar target detection method and a system fusing semantic information, wherein the method comprises the following steps: performing semantic segmentation processing on the image frame under each timestamp to obtain a semantic segmentation score; adding image RGB characteristics under the corresponding camera coordinates in each frame of point cloud data; projecting the point cloud data with the image RGB features added to the output of a segmentation network and appending the semantic segmentation score to the point cloud data; and carrying out target classification and 3D frame regression based on graph convolution on the point cloud data added with the semantic segmentation scores and the image RGB features to obtain a target position frame and a target category. The technical problems that in the prior art, the detection effect of the characteristics of the target is not accurate enough, and then the target detection of vehicles and pedestrians is not accurate enough and efficient are solved.

Description

Laser radar target detection method and system fusing semantic information

Technical Field

The invention relates to the related field of computer vision field, in particular to a laser radar target detection method and system fusing semantic information.

Background

The environment perception technology has important significance in the fields of intelligent transportation, intelligent wearing equipment, smart cities and the like. The sensor acquires and processes information, which is the basic and technical premise for realizing environment perception, and the image data acquired by the camera has inherent depth ambiguity, is greatly influenced by light and weather, but can provide fine-grained texture and color information; on the other hand, the point cloud data acquired by the laser radar provides very accurate spatial position information of the target, but the resolution and texture information are weak. In order to improve poor detection effect caused by a single sensor, a research method of multi-sensor fusion is adopted at present, so that abundant and accurate environmental information can be provided.

The existing multi-sensor fusion method is mainly divided into three types: feature level fusion, decision level fusion and two-stage fusion of 2D target frame projection point cloud. Feature level fusion, such as MV3D proposed by Xiaozhi Chen et al, and AVOD and other network structures proposed by JasonKu et al, mainly extracts image features and point cloud features in a shunting manner, and then directly cascades the image features and point cloud features or performs multi-scale fusion of the features on a feature level. However, the biggest disadvantage of this fusion method is "feature blurring", in which, on one hand, one point of the point cloud corresponds to a plurality of pixel points on the image view, and on the other hand, the order of magnitude of the features in the extracted image feature map and the point cloud feature map differ greatly, which causes that the information of small order of magnitude is not well utilized in the feature map which actually functions; decision level fusion is a relatively simple fusion mode, such as a CLOCS network proposed by SuPang et al, that is, features of two modes are not fused in a feature layer or at the beginning, but training reasoning of respective networks is performed respectively to obtain propulses under 2D and 3D detectors respectively, then the propulses of the two modes are encoded into sparse tensors, and corresponding feature fusion is performed on non-empty elements by adopting two-dimensional convolution. The decision layer fusion has the advantages that network structures of two modes are not interfered with each other and can be trained and combined independently, but has certain defects that the fusion in the decision layer is actually the least utilization of original sensor data information and the complementary characteristic among multi-sensor data cannot be well utilized; the two-stage method, represented by the F-pointet structure proposed by Charles r.qi et al, first obtains the image target detection result according to the 2D detector, and then projects it onto the 3D lidar data. However, the fusion mode excessively depends on the performance of the 2D detector, and after the two-dimensional frame is projected to the point cloud data, there is a problem that the feature extraction and identification of the point set cannot be performed in the projected view cone frame due to the sparsity of the point cloud.

However, in the process of implementing the technical solution of the invention in the embodiments of the present application, the inventors of the present application find that the above-mentioned technology has at least the following technical problems:

the technical problem that the detection effect of the characteristics of the target is not accurate enough, and then the target detection of vehicles and pedestrians is not accurate enough and efficient exists in the prior art.

Disclosure of Invention

The embodiment of the application provides the laser radar target detection method and system fusing the semantic information, solves the technical problems that in the prior art, the detection effect of the target characteristics is not accurate enough, and further the target detection of vehicles and pedestrians is not accurate enough and efficient, further the visual laser fusion target detection method based on image semantic segmentation and image convolution features is extracted, and the technical effects of accuracy and high efficiency of target detection of the vehicles and the pedestrians on the road are obviously improved.

In view of the above problems, the present application provides a laser radar target detection method and system fusing semantic information.

In a first aspect, the present application further provides a laser radar target detection method fusing semantic information, where the method includes: performing semantic segmentation processing on the image frame under each timestamp to obtain a semantic segmentation score; adding image RGB characteristics under the corresponding camera coordinates in each frame of point cloud data; projecting the point cloud data with the image RGB features added to the output of a segmentation network and appending the semantic segmentation score to the point cloud data; and carrying out target classification and 3D frame regression based on graph convolution on the point cloud data added with the semantic segmentation scores and the image RGB features to obtain a target position frame and a target category.

On the other hand, this application still provides a laser radar target detection system who fuses semantic information, the system includes: the first obtaining unit is used for performing semantic segmentation processing on the image frame under each timestamp to obtain a semantic segmentation score; the first adding unit is used for adding image RGB characteristics under corresponding camera coordinates in each frame of point cloud data; a first projection unit for projecting the point cloud data to which the image RGB features are added into an output of a segmentation network and attaching the semantic segmentation score to the point cloud data; and the second obtaining unit is used for carrying out target classification and 3D frame regression based on graph convolution on the point cloud data added with the semantic segmentation scores and the image RGB features to obtain a target position frame and a target category.

In a third aspect, the present invention provides a laser radar target detection system fusing semantic information, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of the first aspect when executing the program.

One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:

the method has the advantages that the point cloud data are processed by adopting semantic segmentation and image convolution, the semantic segmentation adopts a coder-decoder structure, the high-level semantic information is obtained while the contour information is kept, the point cloud data characteristics are extracted through the image convolution structure, the state of the point is updated according to the relative coordinate coding of the adjacent points and the central point characteristics, the structural characteristics of the space point are well represented, the detection accuracy is improved, the method for extracting the visual laser fusion target detection based on the image semantic segmentation and the image convolution characteristics is further achieved, and the technical effects of accuracy and high efficiency of the road vehicle and pedestrian target detection are remarkably improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Fig. 1 is a schematic flow chart of a laser radar target detection method fusing semantic information according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a laser radar target detection method fusing semantic information according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an exemplary electronic device according to an embodiment of the present application.

Description of reference numerals: a first obtaining unit 11, a first adding unit 12, a first projecting unit 13, a second obtaining unit 14, a bus 300, a receiver 301, a processor 302, a transmitter 303, a memory 304, and a bus interface 305.

Detailed Description

The embodiment of the application provides the laser radar target detection method and system fusing the semantic information, solves the technical problems that in the prior art, the detection effect of the target characteristics is not accurate enough, and further the target detection of vehicles and pedestrians is not accurate enough and efficient, further the visual laser fusion target detection method based on image semantic segmentation and image convolution features is extracted, and the technical effects of accuracy and high efficiency of target detection of the vehicles and the pedestrians on the road are obviously improved. Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are merely some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited to the example embodiments described herein.

Summary of the application

The environment perception technology has important significance in the fields of intelligent transportation, intelligent wearing equipment, smart cities and the like. The sensor acquires and processes information, which is the basic and technical premise for realizing environment perception, and the image data acquired by the camera has inherent depth ambiguity, is greatly influenced by light and weather, but can provide fine-grained texture and color information; on the other hand, the point cloud data acquired by the laser radar provides very accurate spatial position information of the target, but the resolution and texture information are weak. In order to improve poor detection effect caused by a single sensor, a research method of multi-sensor fusion is adopted at present, so that abundant and accurate environmental information can be provided. However, the prior art has the technical problem that the detection effect of the characteristics of the target is not accurate enough, so that the target detection of vehicles and pedestrians is not accurate and efficient enough.

In view of the above technical problems, the technical solution provided by the present application has the following general idea:

the embodiment of the application provides a laser radar target detection method fusing semantic information, wherein the method comprises the following steps: performing semantic segmentation processing on the image frame under each timestamp to obtain a semantic segmentation score; adding image RGB characteristics under the corresponding camera coordinates in each frame of point cloud data; projecting the point cloud data with the image RGB features added to the output of a segmentation network and appending the semantic segmentation score to the point cloud data; and carrying out target classification and 3D frame regression based on graph convolution on the point cloud data added with the semantic segmentation scores and the image RGB features to obtain a target position frame and a target category.

Having thus described the general principles of the present application, various non-limiting embodiments thereof will now be described in detail with reference to the accompanying drawings.

Example one

As shown in fig. 1, an embodiment of the present application provides a laser radar target detection method fusing semantic information, where the method includes:

step S100: performing semantic segmentation processing on the image frame under each timestamp to obtain a semantic segmentation score;

specifically, the semantic segmentation refers to a process of grouping/segmenting pixels of an image according to different semantic meanings expressed in the image, and is an image obtained by performing semantic segmentation algorithm processing on an actually captured image, and performing semantic segmentation processing based on a codec for each frame of image. Further, the image processing is performed for each captured frame. Firstly, an encoder is used for extracting sampling features of an image, then a decoder is used for carrying out upsampling resolution recovery processing on a feature map to obtain a final prediction feature map, and based on the segmentation prediction map, the class scores representing different classes of images in the prediction map, namely the semantic segmentation scores, are obtained.

Step S200: adding image RGB characteristics under the corresponding camera coordinates in each frame of point cloud data;

specifically, the point cloud data is a set of vectors in a three-dimensional coordinate system, the scanning data is recorded in the form of points, each point includes three-dimensional coordinates, some may include color information (RGB) or reflection Intensity information (Intensity), the semantic segmentation score after segmentation is added to the point cloud points, the method comprises the steps of obtaining point cloud data of each frame, further, adding RGB (red, green and blue) features of corresponding images in the point cloud data of each frame, projecting the point cloud data with the RGB attached to a semantic segmentation network for output, attaching semantic segmentation scores to each point, converting the position of the space point cloud to the position of a coordinate point of a camera coordinate according to a conversion matrix of a point cloud coordinate system and a camera coordinate system, loading the images of the frames corresponding to the point cloud, obtaining RGB channel data under each coordinate value, and then cascading the RGB data to point cloud feature dimensions.

Step S300: projecting the point cloud data with the image RGB features added to the output of a segmentation network and appending the semantic segmentation score to the point cloud data;

specifically, for each frame of image, after indexing each point image coordinate with category score output by the semantic segmentation network, the corresponding category is superimposed to each point of the point cloud which has been projected to the image in the corresponding frame.

Step S400: and carrying out target classification and 3D frame regression based on graph convolution on the point cloud data added with the semantic segmentation scores and the image RGB features to obtain a target position frame and a target category.

Specifically, the point cloud data to which the semantic segmentation score and the image RGB feature are added is processed, and target classification and 3D regression processing based on a graph convolution network, that is, point state update is performed based on the graph convolution network to obtain a target position frame and target category information. The method comprises the steps of processing point cloud data by adopting semantic segmentation, processing image and graph convolution, acquiring high-level semantic information while retaining contour information by adopting a codec structure for semantic segmentation, extracting point cloud data characteristics by adopting a graph convolution structure, updating the state of a point according to relative coordinate codes of adjacent points and central point characteristics, well representing the structural characteristics of a space point, improving detection accuracy, further achieving a visual laser fusion target detection method based on image semantic segmentation and graph convolution characteristics, and remarkably improving the technical effects of accuracy and high efficiency of road vehicle and pedestrian target detection.

Further, the semantic segmentation processing is performed on the image frame under each timestamp to obtain a semantic segmentation score, in step S100 in this embodiment of the present application, the method further includes:

step S110: taking ResNet101 as a main network, and performing downsampling feature extraction on the image frames under each timestamp through an encoder;

step S120: performing resolution recovery processing of up-sampling on the image frame under each timestamp through a decoder to obtain a prediction characteristic map;

step S130: and obtaining the semantic segmentation score according to the prediction feature map.

Specifically, firstly, an encoder is used to extract sampling features of an image, and ResNet101 is used as a main network to extract the images, and the steps are as follows:

1) performing downsampling on the image for 4 times to extract features, setting the step size stride to be 2 by adopting a convolution kernel with the size of 3 x 3, and obtaining a feature map with the size of the original image 1/16;

2) respectively adopting 1 × 1 convolution layers and three 3 × 3 hole convolutions to the feature map, wherein the rates of the hole convolutions are (6,12 and 18), the output channels are all 256, and a BN layer is added;

3) carrying out global average pooling to obtain image-level characteristics;

4) inputting the data into a 1 x 1 convolution layer, setting an output channel to be 256, and performing bilinear interpolation to the original size;

5) combining the obtained 4 features with different scales together in a channel dimension concat, and fusing by adopting a 1 × 1 convolution kernel classification layer to obtain a new 256-channel feature;

obtaining a corresponding feature map based on the new features, and then performing resolution recovery processing on the feature map by adopting a decoder to obtain a final prediction feature map, wherein the steps are as follows:

1) carrying out bilinear interpolation on the feature map obtained by the encoder to obtain a 4-x feature map;

2) carrying out channel number dimension reduction on low-level features with corresponding sizes in the encoder by adopting a 1 × 1 convolution layer;

3) cascading the feature graphs with the same resolution obtained in the first two steps, and further fusing features by adopting a 3-by-3 convolution layer;

4) carrying out bilinear interpolation to obtain a segmentation prediction image with the same size as the original image; and obtaining the class scores of the images representing different classes in the segmentation prediction map, namely the semantic segmentation scores based on the segmentation prediction map.

Further, the step S200 of adding the RGB features of the image in the corresponding camera coordinates to each frame of point cloud data in the embodiment of the present application further includes:

step S210: for each frame of point cloud data, converting the spatial point cloud position to a coordinate point position under a camera coordinate system according to a conversion matrix from the point cloud coordinate system to the camera coordinate system;

step S220: screening points of which the Z-axis coordinate value is greater than 0.1 in all camera coordinate points to obtain a first index position set;

step S230: obtaining coordinate values under the image coordinate system through a conversion matrix from the camera coordinate system to the image coordinate system according to the first index position set;

step S240: loading an image frame corresponding to the point cloud data to obtain RGB channel data under each coordinate value;

step S250: cascading the RGB channel data to a point cloud feature dimension.

Further, the step S300 of projecting the point cloud data added with the RGB features of the image to an output of a segmentation network, and adding the semantic segmentation score to the point cloud data in this embodiment of the present application further includes:

step S310: for the image frame under each timestamp, indexing the image coordinates of each point with the semantic segmentation score output by the segmentation network;

step S320: superimposing the corresponding category into the point cloud data in the respective frame that has been projected into the image coordinate system.

Specifically, for each frame image, indexing each point image coordinate with category score output by the semantic segmentation network, namely processing each image frame under each time stamp through the semantic segmentation network to obtain an index coordinate capable of quickly retrieving each point image; converting the coordinates of the point cloud data of each frame, namely obtaining a corresponding coordinate conversion matrix through the point cloud coordinate system and the camera coordinate system, converting the spatial point cloud position into the coordinate point position under the camera coordinate system based on the coordinate conversion matrix, screening points with Z-axis coordinate values larger than 0.1 in each camera coordinate point to obtain a first index position set, obtaining the coordinate values under the image coordinate system through the conversion matrix from the camera coordinate system to the image coordinate system according to the first index position set, loading the image of the corresponding frame of the point cloud, and obtaining RGB channel data under each coordinate value; and cascading the RGB data to a point cloud characteristic dimension, namely, superimposing corresponding categories to each point of the point cloud which is projected to the image in a corresponding frame in the intensity dimension concat three rows of color information.

Further, the step S400 of the embodiment of the present invention further includes performing graph convolution-based target classification and 3D frame regression on the point cloud data to which the semantic segmentation score and the image RGB features are added, to obtain a target position frame and a target category:

step S410: performing downsampling-based graph construction on the point cloud data;

step S420: constructing a graph neural network to update and iterate the characteristics of each central point, and improving the state of the central point through the states of adjacent points;

step S430: positioning the boundary box of each category of branch prediction, and if one vertex is in one boundary box, calculating a predicted value and the Huber loss of the group route; if a vertex is not in the bounding box or is a non-interesting class, its position penalty is set to 0.

Specifically, the process of graph construction mainly includes:

1) using down-sampling to reduce the density of the point cloud, and selecting a central point by adopting a farthest distance method;

2) for each center point, finding a pair of points within a given cutoff distance by using the list of units;

3) and extracting characteristics of points and point-to-edge in each image by adopting a multilayer perceptron, and aggregating the characteristics through a Max function to serve as an initial state value of a central point.

Updating the characteristics of each central point by constructing a graph neural network, and improving the state of the central point by using the states of the neighbor points, wherein the improvement formula is as follows:

defining a point cloud containing N points as P ═ P₁，...，p_NIn which p is_i＝(x_i，s_i)。x_i∈R³Representing an original point cloudSpace coordinates (X, Y, Z), s_i∈R^kA k-dimensional vector representing an attribute state of an original point; f. of^t(·)、g^t(. and h)^tThe (·) functions are all modeled by a multilayer perceptron (MLP); the ρ (-) function employs an edge feature aggregation method based on an attention mechanism. Computing multiclass probability distribution values for each vertex by classification branches

M is the total number of target classes, including the background class;

and

respectively carrying out one-hot coding on the prediction probability and the class label of the ith vertex; x is the number of_jIs the point cloud three-dimensional coordinate of the j point,

is a j-point feature of the t layer.

And calculating the predicted value and the Huber loss of the group route through a loss function. The classification loss adopts average cross entropy loss; positioning a boundary box of each class of branch prediction, and if one vertex is in one boundary box, calculating a predicted value and the Huber loss of the group route; if a vertex is not in the box or is a non-interesting class, the positioning loss is set to 0, and the specific formula is as follows:

for the target bounding box, we denote the center position of the bounding box in 7 degrees of freedom format b ═ x, y, z, l, h, w, θ, (x, y, z) denotes the length, height and width of the box, respectively, (l, h, w) denotes the yaw angle.

Is one-hot encoded for the true class label at point i,

for class prediction probability coding of i points, we use vertex coordinates (x)_v，y_v，z_v) Encoding the bounding box:

wherein l_m，h_m，w_m，θ₀，θ_mIs a constant scale factor, v_iFor the predicted three-dimensional coordinates of the vertex i, b_interestFor the category real box area that needs to be located,

7-dimensional bounding-box coding for predicted vertices i,/_huberIs the Huber loss function, δ^gtThe 7-dimensional bounding box of the true category label is encoded. In this example will (l)_m，h_m，w_m) Setting the median of a bounding box of a class to be trained, and setting theta to be equal to [ pi/4, 3 pi/4 ∈]，θ₀＝π/2，θ_mPi/2 to ensure that the object in front of the detection sight line is in the detection range. A localization box branching network employs MLP to predict bounding box coding δ for each class_b＝(δ_x，δ_y，δ_z，δ_l，δ_h，δ_w，δ_θ)。

To sum up, the laser radar target detection method and system fusing the semantic information provided by the embodiment of the application have the following technical effects:

Example two

Based on the same inventive concept as the laser radar target detection method fusing the semantic information in the foregoing embodiment, the present invention further provides a laser radar target detection system fusing the semantic information, as shown in fig. 2, the system includes:

a first obtaining unit 11, where the first obtaining unit 11 is configured to perform semantic segmentation processing on the image frame under each timestamp to obtain a semantic segmentation score;

a first adding unit 12, wherein the first adding unit 12 is used for adding image RGB features under corresponding camera coordinates in each frame of point cloud data;

a first projection unit 13, the first projection unit 13 being configured to project the point cloud data to which the image RGB features are added into an output of a segmentation network, and to attach the semantic segmentation score to the point cloud data;

a second obtaining unit 14, where the second obtaining unit 14 is configured to perform graph convolution-based target classification and 3D frame regression on the point cloud data to which the semantic segmentation score and the image RGB features are added, so as to obtain a target position frame and a target category.

Further, the system further comprises:

a first extraction unit, configured to perform downsampling feature extraction on the image frames at each timestamp through an encoder with the ResNet101 as a main network;

a third obtaining unit, configured to perform resolution recovery processing of upsampling on the image frame under each timestamp through a decoder, so as to obtain a prediction feature map;

a fourth obtaining unit, configured to obtain the semantic segmentation score according to the predicted feature map.

Further, the system further comprises:

the first conversion unit is used for converting the spatial point cloud position into a coordinate point position under a camera coordinate system according to a conversion matrix from a point cloud coordinate system to a camera coordinate system for each frame of point cloud data;

a fifth obtaining unit, configured to screen points, of which Z-axis coordinate values are greater than 0.1, from among the camera coordinate points, to obtain a first index position set;

a sixth obtaining unit configured to obtain coordinate values in an image coordinate system through a conversion matrix from a camera coordinate system to the image coordinate system according to the first index position set;

a seventh obtaining unit, configured to load an image frame corresponding to the point cloud data, and obtain RGB channel data under each coordinate value;

a first cascading unit to cascade the RGB channel data to a point cloud feature dimension.

Further, the system further comprises:

the first indexing unit is used for indexing image coordinates of each point with the semantic segmentation scores output by the segmentation network for the image frames under each timestamp;

a first superimposing unit for superimposing the corresponding category into the point cloud data that has been projected into the image coordinate system in the corresponding frame.

Further, the system further comprises:

a first construction unit for downsampling-based graph construction of the point cloud data;

the first improvement unit is used for constructing the characteristics of each central point of the neural network updating iteration of the graph, and improving the state of the central point through the states of adjacent points;

a first prediction unit for locating the bounding box of the branch prediction for each class, and if a vertex is in a bounding box, calculating the Huber loss of the prediction value and the grountruth; if a vertex is not in the bounding box or is a non-interesting class, its position penalty is set to 0.

Various changes and specific examples of the laser radar target detection method with fusion of semantic information in the first embodiment of fig. 1 are also applicable to the laser radar target detection system with fusion of semantic information in the present embodiment, and through the foregoing detailed description of the laser radar target detection method with fusion of semantic information, those skilled in the art can clearly know the implementation method of the laser radar target detection system with fusion of semantic information in the present embodiment, so for the sake of brevity of the description, detailed description is not repeated here.

Exemplary electronic device

The electronic device of the embodiment of the present application is described below with reference to fig. 3.

Fig. 3 illustrates a schematic structural diagram of an electronic device according to an embodiment of the present application.

Based on the inventive concept of the laser radar target detection method with fusion of semantic information in the foregoing embodiments, the present invention further provides a laser radar target detection system with fusion of semantic information, on which a computer program is stored, and when the computer program is executed by a processor, the steps of any one of the foregoing laser radar target detection methods with fusion of semantic information are implemented.

Where in fig. 3 a bus architecture (represented by bus 300), bus 300 may include any number of interconnected buses and bridges, bus 300 linking together various circuits including one or more processors, represented by processor 302, and memory, represented by memory 304. The bus 300 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 305 provides an interface between the bus 300 and the receiver 301 and transmitter 303. The receiver 301 and the transmitter 303 may be the same element, i.e., a transceiver, providing a means for communicating with various other systems over a transmission medium.

The processor 302 is responsible for managing the bus 300 and general processing, and the memory 304 may be used for storing data used by the processor 302 in performing operations.

The embodiment of the invention provides a laser radar target detection method fusing semantic information, which comprises the following steps: performing semantic segmentation processing on the image frame under each timestamp to obtain a semantic segmentation score; adding image RGB characteristics under the corresponding camera coordinates in each frame of point cloud data; projecting the point cloud data with the image RGB features added to the output of a segmentation network and appending the semantic segmentation score to the point cloud data; and carrying out target classification and 3D frame regression based on graph convolution on the point cloud data added with the semantic segmentation scores and the image RGB features to obtain a target position frame and a target category. The method solves the technical problems that the detection effect of the target features is not accurate enough, and then the target detection of vehicles and pedestrians is not accurate enough and efficient in the prior art, further achieves the technical effect of extracting the visual laser fusion target detection method based on image semantic segmentation and graph convolution features, and obviously improves the accuracy and the efficiency of the target detection of the vehicles and the pedestrians on the road.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a system for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction system which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A laser radar target detection method fusing semantic information, wherein the method comprises the following steps:

performing semantic segmentation processing on the image frame under each timestamp to obtain a semantic segmentation score;

adding image RGB characteristics under the corresponding camera coordinates in each frame of point cloud data;

projecting the point cloud data with the image RGB features added to the output of a segmentation network and appending the semantic segmentation score to the point cloud data;

and carrying out target classification and 3D frame regression based on graph convolution on the point cloud data added with the semantic segmentation scores and the image RGB features to obtain a target position frame and a target category.

2. The method as claimed in claim 1, wherein the semantic segmentation processing is performed on the image frame at each timestamp to obtain a semantic segmentation score, including;

taking ResNet101 as a main network, and performing downsampling feature extraction on the image frames under each timestamp through an encoder;

performing resolution recovery processing of up-sampling on the image frame under each timestamp through a decoder to obtain a prediction characteristic map;

and obtaining the semantic segmentation score according to the prediction feature map.

3. The method of claim 1, wherein the adding of image RGB features in respective camera coordinates to each frame of point cloud data comprises:

for each frame of point cloud data, converting the spatial point cloud position to a coordinate point position under a camera coordinate system according to a conversion matrix from the point cloud coordinate system to the camera coordinate system;

screening points of which the Z-axis coordinate value is greater than 0.1 in all camera coordinate points to obtain a first index position set;

obtaining coordinate values under the image coordinate system through a conversion matrix from the camera coordinate system to the image coordinate system according to the first index position set;

loading an image frame corresponding to the point cloud data to obtain RGB channel data under each coordinate value;

cascading the RGB channel data to a point cloud feature dimension.

4. The method of claim 3, wherein the projecting the point cloud data with the image RGB features added to the point cloud data into an output of a segmentation network and appending the semantic segmentation score to the point cloud data comprises:

for the image frame under each timestamp, indexing the image coordinates of each point with the semantic segmentation score output by the segmentation network;

superimposing the corresponding category into the point cloud data in the respective frame that has been projected into the image coordinate system.

5. The method of claim 1, wherein the performing graph convolution-based target classification and 3D frame regression on the point cloud data with the additional semantic segmentation scores and the image RGB features to obtain a target location frame and a target class comprises:

performing downsampling-based graph construction on the point cloud data;

constructing a graph neural network to update and iterate the characteristics of each central point, and improving the state of the central point through the states of adjacent points;

positioning the boundary box of each category of branch prediction, and if one vertex is in one boundary box, calculating a predicted value and the Huber loss of the group route; if a vertex is not in the bounding box or is a non-interesting class, its position penalty is set to 0.

6. The method of claim 5, wherein the constructed graph neural network updates the features of each central point by iterating through the states of neighboring points to improve the state of the central point by the formula:

wherein, a point cloud picture containing N points is defined as P ═ { P ═ P₁，...，p_NIn which p is_i＝(x_i，s_i)，x_i∈R³Spatial coordinates (X, Y, Z), s representing the original point cloud_i∈R^kK-dimensional vector representing the state of the original point attribute, f^t(·)、g^t(. and h)^tThe functions are all modeled by a multilayer perceptron (MLP), M represents the total number of target classes, x_jIs the point cloud three-dimensional coordinate of the j point,

is a j-point feature of the t layer.

7. The method of claim 5, wherein the positioning branch predicts bounding boxes for each class, and computes the Huber penalty for a predictor and group branch if a vertex is in a bounding box; if a vertex is not in the bounding box or is a non-interesting class, its positioning penalty is set to 0, and the specific formula is as follows:

to know

Respectively, the prediction probability and class label one-hot coding of the ith vertex, b is a freedom degree format,

is one-hot encoded for the true class label at point i,

predicting the probability coding for the class of i points, v_iFor the predicted three-dimensional coordinates of the vertex i, b_interestFor the category real box area that needs to be located,

7-dimensional bounding-box coding for predicted vertices i,/_huberIs the Huber loss function, δ^gtThe 7-dimensional bounding box of the true category label is encoded.

8. A lidar target detection system that fuses semantic information, wherein the system comprises:

the first obtaining unit is used for performing semantic segmentation processing on the image frame under each timestamp to obtain a semantic segmentation score;

the first adding unit is used for adding image RGB characteristics under corresponding camera coordinates in each frame of point cloud data;

a first projection unit for projecting the point cloud data to which the image RGB features are added into an output of a segmentation network and attaching the semantic segmentation score to the point cloud data;

and the second obtaining unit is used for carrying out target classification and 3D frame regression based on graph convolution on the point cloud data added with the semantic segmentation scores and the image RGB features to obtain a target position frame and a target category.

9. A lidar target detection system incorporating semantic information, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any of claims 1-7 when executing the program.