CN114419617A - Target detection method, device, equipment and storage medium - Google Patents

Target detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN114419617A
CN114419617A CN202210101130.9A CN202210101130A CN114419617A CN 114419617 A CN114419617 A CN 114419617A CN 202210101130 A CN202210101130 A CN 202210101130A CN 114419617 A CN114419617 A CN 114419617A
Authority
CN
China
Prior art keywords
target
point
voxel
coordinate
inputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210101130.9A
Other languages
Chinese (zh)
Inventor
董小瑜
吕颖
杨斯琦
韩佳琪
吕铮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FAW Group Corp
Original Assignee
FAW Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FAW Group Corp filed Critical FAW Group Corp
Priority to CN202210101130.9A priority Critical patent/CN114419617A/en
Publication of CN114419617A publication Critical patent/CN114419617A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target detection method, a target detection device, target detection equipment and a storage medium. The method comprises the following steps: acquiring original point cloud data; inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of a target, wherein the three-dimensional frame detection result of the target comprises the following steps: the detection method comprises the following steps of obtaining a central point heat map, a central point position of a target, a category of the target, an offset of a central point coordinate of the target, a three-dimensional frame size, a sine value of a regression angle, a cosine value of the regression angle and a central point z coordinate of the target.

Description

Target detection method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a target detection method, a target detection device, target detection equipment and a storage medium.
Background
In an automatic driving system, a perception module is the most basic module, and can realize the perception of surrounding environment information, so that services are provided for an upper-layer planning and control module. In the sensing module, the most basic function is to realize the identification of the surrounding environment, i.e. to complete the target detection task. The current two-dimensional target detection only depending on images is not enough to complete sufficient perception of surrounding complex environments, accurate detection and positioning of various targets in road environments cannot be realized, and the three-dimensional target detection can obtain the spatial distance, position, size, direction and other parameters of objects, so that the practicability of perception is greatly improved. The point cloud is the original data closest to the sensor, contains abundant spatial information, and is the most appropriate data for expressing a three-dimensional scene, so that the point cloud-based three-dimensional target detection algorithm gradually becomes a research hotspot in the field of automatic driving.
The VoxelNet is the first end-to-end three-dimensional target detection algorithm based on laser point cloud, and due to the fact that the 3D CNN is used in the middle layer, the network reasoning speed is very low, single-frame point cloud data detection needs 230ms, and execution efficiency is low. And aiming at the problem that the target in the three-dimensional space does not follow any specific direction and is difficult to fit a rotating boundary frame by using an axis-aligned anchor frame, the three-dimensional target is expressed as a point, a detector is used for searching the center of the target, and then other three-dimensional attributes including the size, the orientation and the like of the target are regressed; when the existing algorithm carries out feature coding on point clouds in each voxel, only the relation between points in the voxel is considered, and the spatial relation between the points and the voxel is not considered, the relation is also important for three-dimensional target detection, and context information between different objects is provided, for example, an automobile must be on the ground.
Disclosure of Invention
Embodiments of the present invention provide a target detection method, apparatus, device, and storage medium, which can adaptively merge high-level abstract semantic features and bottom-level spatial features, and effectively improve detection performance by moderately increasing complexity of a model.
In a first aspect, an embodiment of the present invention provides a target detection method, including: acquiring original point cloud data;
inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of a target, wherein the three-dimensional frame detection result of the target comprises the following steps: the center point heat map, the center point position of the target, the category of the target, the offset of the center point coordinate of the target, the three-dimensional frame size, the sine value of the regression angle, the cosine value of the regression angle and the center point z coordinate of the target.
In a second aspect, an embodiment of the present invention further provides an object detection apparatus, where the object detection apparatus includes:
the acquisition module is used for acquiring original point cloud data;
the detection module is used for inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of a target, wherein the three-dimensional frame detection result of the target comprises: the center point heat map, the center point position of the target, the category of the target, the offset of the center point coordinate of the target, the three-dimensional frame size, the sine value of the regression angle, the cosine value of the regression angle and the center point z coordinate of the target.
In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the object detection method according to any one of the embodiments of the present invention.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the object detection method according to any one of the embodiments of the present invention.
The embodiment of the invention obtains the original point cloud data; inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of a target, wherein the three-dimensional frame detection result of the target comprises the following steps: the detection method comprises the following steps of obtaining a central point heat map, a central point position of a target, a category of the target, an offset of a central point coordinate of the target, a three-dimensional frame size, a sine value of a regression angle, a cosine value of the regression angle and a central point z coordinate of the target, adaptively fusing high-level abstract semantic features and bottom-layer spatial features, and effectively improving detection performance by properly increasing the complexity of a model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of a method of target detection in an embodiment of the invention;
FIG. 1a is a schematic diagram of a network structure of an aggregation layer in an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device in an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer-readable storage medium containing a computer program in an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures. In addition, the embodiments and features of the embodiments in the present invention may be combined with each other without conflict.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like. In addition, the embodiments and features of the embodiments in the present invention may be combined with each other without conflict.
The term "include" and variations thereof as used herein are intended to be open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment".
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Fig. 1 is a flowchart of a target detection method provided in an embodiment of the present invention, where the present embodiment is applicable to a target detection situation, the method may be executed by a target detection apparatus in an embodiment of the present invention, and the target detection apparatus may be implemented in a software and/or hardware manner, as shown in fig. 1, the target detection method specifically includes the following steps:
and S110, acquiring original point cloud data.
The original point cloud data may be a frame of input original point cloud data.
S120, inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of the target, wherein the three-dimensional frame detection result of the target comprises: the center point heat map, the center point position of the target, the category of the target, the offset of the center point coordinate of the target, the three-dimensional frame size, the sine value of the regression angle, the cosine value of the regression angle and the center point z coordinate of the target.
Specifically, the manner of inputting the original point cloud data into the target neural network model to obtain the three-dimensional frame detection result of the target may be: inputting the original point cloud data into a first convolution layer to obtain a tensor characteristic diagram; converting the tensor eigenmap into a two-dimensional eigenmap; inputting the two-dimensional characteristic diagram into a polymerization layer to obtain a target characteristic diagram; and inputting the target feature map into the second convolution layer to obtain a three-dimensional frame detection result of the target.
Optionally, inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of the target, including:
inputting the original point cloud data into a first convolution layer to obtain a tensor characteristic diagram;
converting the tensor eigenmap into a two-dimensional eigenmap;
inputting the two-dimensional characteristic diagram into a polymerization layer to obtain a target characteristic diagram;
and inputting the target feature map into the second convolution layer to obtain a three-dimensional frame detection result of the target.
Specifically, the manner of inputting the original point cloud data into the first convolution layer to obtain the tensor eigen map may be as follows: performing voxelization on the original point cloud data to obtain at least one voxel; performing feature coding on the at least one voxel to obtain feature information of each point in the voxel, wherein the feature information includes: the coordinate of each point in the voxel, the reflection intensity of each point, the deviation value of the coordinate mean value of all the points in the voxel, and the coordinate deviation point of the coordinate of all the points in the voxel relative to the voxel center point; and inputting the characteristic information of each point into the first convolution layer to obtain a tensor characteristic diagram.
Specifically, the manner of converting the tensor eigen map into the two-dimensional eigen map may be: the tensor feature map reshape is converted into a two-dimensional feature map, for example, the feature map is sampled 8 times compared with the original feature map, and the reshape is converted into the two-dimensional feature map, which is convenient for the subsequent 2D convolution operation.
Specifically, the method for inputting the two-dimensional feature map into the aggregation layer to obtain the target feature map may be as follows: and determining a spatial characteristic part and a semantic characteristic part of the two-dimensional characteristic graph according to the two-dimensional characteristic graph, and processing the spatial characteristic part and the semantic characteristic part respectively to obtain a fused characteristic graph.
Optionally, inputting the original point cloud data into a first convolution layer to obtain a tensor eigen map, including:
performing voxelization on the original point cloud data to obtain at least one voxel;
performing feature coding on the at least one voxel to obtain feature information of each point in the voxel, wherein the feature information includes: the coordinate of each point in the voxel, the reflection intensity of each point, the deviation value of the coordinate mean value of all the points in the voxel, and the coordinate deviation point of the coordinate of all the points in the voxel relative to the voxel center point;
and inputting the characteristic information of each point into the first convolution layer to obtain a tensor characteristic diagram.
The first convolutional layer is constructed based on sub-manifold convolution and Sparse convolution, for example, the network may first use a sub-manifold convolution with a convolution kernel size of 3 × 3 × 3, a stride of 1, and a padding of 1 to promote the feature channel to 16 dimensions, which is denoted as sub mconv3D (3,1,1), and then extract the point cloud feature by stacking 4 Sparse Basic Block modules and Sparse convolution Sparse conv 3D. The Sparse Basic Block module adopts a ResNet Basic residual network structure, for each input data, sub-manifold Sparse convolution SubMConv3D is used for feature extraction, and a residual network is used for feature combination, so that the circulation of deep information can be ensured, gradient disappearance can be avoided, and the expression capability of features can be improved. The parameters of each sparse convolution layer are denoted as spareconv 3D (16,32,3,2,1), spareconv 3D (32,64,3,2,1), spareconv 3D (64,128,3,2,1), spareconv 3D (128,256, [3,1,1], [1,1,2],1) where the first number represents the input feature dimension, the second number represents the output feature dimension, and the next three numbers represent the size of the convolution kernel, the convolution step size, and the fill parameter, respectively. Since the last sparsconv 3D layer is downsampled only for the z-axis, the base network part downsamples a total of three times to obtain a feature map downsampled 8 times as much as the original feature map.
Optionally, after performing voxelization on the original point cloud data to obtain at least one voxel, the method further includes:
acquiring the number of point clouds in each voxel;
if the number of the point clouds in the voxels is larger than or equal to the number threshold, randomly sampling the point clouds in the voxels;
and if the number of the point clouds in the voxels is smaller than the number threshold, performing zero filling operation on the point clouds in the voxels.
Specifically, in order to divide the point cloud into voxels, the original space of the point cloud needs to be cut first to obtain a regular range with a uniform size for target detection, and then the point cloud is voxelized according to a certain resolution. The voxelization is to divide the original point cloud into a cube with equal volume in space, and each cube is called a voxel. Meanwhile, in order to save calculation cost and balance the difference of the number of points in different voxels, a threshold value is set for the number of points in the voxels, and when the number of points in the voxels exceeds the set threshold value, the points in the voxels are randomly sampled; and if the number of the zero padding operation is less than the threshold value, performing zero padding operation. Assuming that the spatial range of the point cloud to be detected is W × H × D, and the size of each voxel is Δ W × Δ H × Δ D, finally, (W/Δ W) × (H/Δ H) × (D/Δ D) voxels are obtained.
In actual data, a coordinate system of the laser radar faces to the right, y faces forwards, and z faces upwards, and a point cloud within a range of x, y axis ∈ [ -51.2m,51.2m ], and z axis ∈ [ -5m,3m ] is selected for detection. The size of each voxel is set to [0.1m,0.1m,0.2m ], thus generating 1024 × 1024 × 40 voxels, with a maximum number of points contained within each voxel of 35.
Optionally, the first winding layer includes: sub-manifold convolutional layers and sparse convolutional layers.
In a specific example, a frame of original point cloud data is input, the point cloud data is preprocessed, and then discrete voxel features are extracted and aggregated by using a point cloud feature coding module to obtain a four-dimensional tensor feature map. The 3D sparse convolution module is then used as the base network. And after the basic network, reshaping the feature graph into a 2D pseudo image, fusing the feature graph with rich high-level semantic information and the feature graph with rich low-level spatial position information by using a spatial semantic feature aggregation module to obtain robust feature extraction, improving the generalization capability of the model to the features such as target size, shape and rotation angle by using deformable convolution, and then obtaining final network output by using a detection head without an anchor frame.
(1) And (3) voxel division:
in order to divide the point cloud into voxels, the original space of the point cloud needs to be firstly cut to obtain a regular range with a uniform size for target detection, and then the point cloud is voxelized (Voxelization) according to a certain resolution. The voxelization is to divide the original point cloud into a cube with equal volume in space, and each cube is called a voxel. Meanwhile, in order to save calculation cost and balance the difference of the number of points in different voxels, a threshold value is set for the number of points in the voxels, and when the number of points in the voxels exceeds the set threshold value, the points in the voxels are randomly sampled; and if the number of the zero padding operation is less than the threshold value, performing zero padding operation. Assuming that the spatial range of the point cloud to be detected is W × H × D, and the size of each voxel is Δ W × Δ H × Δ D, finally, (W/Δ W) × (H/Δ H) × (D/Δ D) voxels are obtained.
In actual data, a coordinate system of the laser radar faces to the right, y faces forwards, and z faces upwards, and a point cloud within a range of x, y axis ∈ [ -51.2m,51.2m ], and z axis ∈ [ -5m,3m ] is selected for detection. The size of each voxel is set to [0.1m,0.1m,0.2m ], thus generating 1024 × 1024 × 40 voxels, with a maximum number of points contained within each voxel of 35.
(2) A voxel characteristic coding module:
and performing feature coding on each divided voxel. Specifically, the embodiment of the present invention uses a 10-channel feature to represent the feature of each point in a voxel, where the channel feature is, in turn, the coordinates of the point, the reflection intensity r, the offset value of the point compared with the centroid of the point in the voxel, i.e., the coordinate mean value of all points in the voxel, and the point compared with the coordinate offset point of the center point of the voxel, and such a coding manner reflects the positional relationship between the point and the voxel and also reflects the positional relationship between the point and the point in the voxel.
For each Voxel, a Voxel Feature extraction module (VFE) is used for Feature extraction. The VFE module is essentially a small PointNet structure. The PointNet indicates that for unordered point cloud data, the network can learn the characteristics of the unordered point cloud data only by adding a Max Pooling Layer after approaching a neural network of any function, so that the VFE module uses a Fully-Connected neural network comprising a Fully-Connected Layer (FC), a Batch Normalization Layer (BN) and a nonlinear activation function RELU to obtain the characteristics of each point, then performs Max Pooling on point-by-point characteristics to obtain the global characteristics of a voxel, and finally splices the global characteristics and the point-by-point characteristics together to obtain the voxel characteristics.
(3)3D basic network:
the base network is built using sub-manifold convolution and sparse convolution as base modules. The network first uses a sub-manifold Sparse convolution with convolution kernel size of 3 × 3 × 3, stride of 1, padding of 1 to boost the feature channel to 16 dimensions, denoted as sub mconv3D (3,1,1), and then extracts the point cloud feature by stacking 4 Sparse Basic Block modules and Sparse convolution sparsconv 3D, where x 2 denotes that Sparse Basic Block is repeated twice. The Sparse Basic Block module adopts a ResNet Basic residual network structure, for each input data, sub-manifold Sparse convolution SubMConv3D is used for feature extraction, and a residual network is used for feature combination, so that the circulation of deep information can be ensured, gradient disappearance can be avoided, and the expression capability of features can be improved. The parameters of each sparse convolution layer are denoted as spareconv 3D (16,32,3,2,1), spareconv 3D (32,64,3,2,1), spareconv 3D (64,128,3,2,1), spareconv 3D (128,256, [3,1,1], [1,1,2],1) where the first number represents the input feature dimension, the second number represents the output feature dimension, and the next three numbers represent the size of the convolution kernel, the convolution step size, and the fill parameter, respectively. Since the last spareconv 3D layer is downsampled only for the z-axis, the basic network part is downsampled three times in total to obtain a feature map which is downsampled 8 times relative to the original feature map, and reshape of the feature map is formed into a two-dimensional feature map, which is convenient for the subsequent 2D convolution operation. At the same time, each convolutional layer is followed by a batch normalization layer BN and a non-linear activation layer RELU.
(4) The 2D space semantic feature aggregation module:
after the two-dimensional feature map is obtained, the invention designs a 2D space semantic feature aggregation module to fuse the semantic features of the high layer and the spatial features of the low layer, and the network structure of the module is shown in FIG. 1 a. In fig. 1a, after a 2D feature map is obtained through a base network, spatial features are first extracted through layer-by-layer convolution, and the size of the feature map is ensured to be unchanged, so as to avoid spatial information loss. For the semantic feature part, convolution with the step length of 2 is used for performing convolution on the spatial features to obtain a feature map with the number of feature channels doubled and the size of the feature map halved, so that more high-level abstract semantic information is obtained. In addition, two deconvolution layers are used for upsampling semantic features by 2 times, one path is added with the previous spatial feature graph element by element to obtain richer spatial features, and the other path is used for being fused with new spatial features. In order to enable a network to adaptively fuse spatial feature characteristics and semantic features, a fusion method based on an attention mechanism is used, two paths of feature graphs of different types are respectively subjected to 1 × 1 convolution to enable feature channels of the two paths of feature graphs to become 1 dimension, then splicing is carried out on the feature channels, then a SoftMax function is used for normalizing the feature channels and obtaining the weight of each feature graph, the weight coefficient and the original features are subjected to point multiplication and element-by-element addition, finally a fused feature graph is obtained, and the fused feature graph is sent to a network detection head part.
It should be noted that, for the two-dimensional feature map, the more the number of convolutions is, the richer the semantic features are, the higher-level semantics are feature information obtained after a plurality of convolutions (feature extraction), the larger the receptive field thereof is, the more abstract the extracted features are, which is beneficial to the classification of objects, but also is beneficial to the accurate segmentation because the detail information is lost. High-level semantic features are abstract features.
(5) Detection head without anchor frame:
the invention adopts the anchor-free frame head to detect the three-dimensional target. The output part of the network comprises five parts, wherein the first part is a central point heat map predicted by the network and on a down-sampled feature map, the value of each point in the heat map is between 0 and 1, the peak value in the heat map is selected through maximum value pooling, the peak value point represents the object existing in the heat map, and the position of the central point of the object under the aerial view can be obtained according to the heat map predicted by the network
Figure RE-GDA0003549255260000111
And a category of the target; the second part is the offset of the central point of the network prediction
Figure RE-GDA0003549255260000112
Since the position of the center point in the feature map may be shifted due to rounding during the convolution downsampling, the correction of the position of the center point is achieved by regressing the shift amount. Other parts including the size of the three-dimensional frame
Figure RE-GDA0003549255260000113
Of a central point of an object in three-dimensional space
Figure RE-GDA0003549255260000114
Coordinates, for rotation angle θ, we pass the sine of the regression angle
Figure RE-GDA0003549255260000115
And cosine value
Figure RE-GDA0003549255260000116
To be determined. When reasoning, firstly, a peak value is extracted from the predicted central point heat map, and the position of the central point is found for each targetAnd the category and the confidence, then finding other corresponding attributes at the point in the bounding box regression vector, and finally obtaining the corresponding attributes through the following formula:
Figure RE-GDA0003549255260000121
and decoding to obtain a seven-dimensional parameter central point (x, y, z), a size (w, h, l) and a rotation angle theta of the three-dimensional frame, thereby realizing the detection of the three-dimensional bounding frame. The decoding formula is shown below, where down _ ratio is the multiple of the network downsampling, where the network downsampling multiple is 8.
The embodiment of the invention provides a coding mode, which is used for acquiring spatial context information between points and voxels; in the existing algorithm, the receptive field is expanded by simply stacking a plurality of convolution layers to extract high-level semantic information in input data, but in the process, the spatial features of a lower layer are gradually lost, and the features of the lower layer have higher resolution and can provide accurate positioning information, so that the simultaneous consideration of the spatial features of the lower layer and the semantic features of a higher layer is crucial.
According to the technical scheme of the embodiment, original point cloud data are obtained; inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of a target, wherein the three-dimensional frame detection result of the target comprises the following steps: the detection method comprises the following steps of obtaining a central point heat map, a central point position of a target, a category of the target, an offset of a central point coordinate of the target, a three-dimensional frame size, a sine value of a regression angle, a cosine value of the regression angle and a central point z coordinate of the target, adaptively fusing high-level abstract semantic features and bottom-layer spatial features, and effectively improving detection performance by properly increasing the complexity of a model.
Fig. 2 is a schematic structural diagram of a target detection apparatus according to an embodiment of the present invention. The present embodiment may be applicable to the case of target detection, the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be integrated in any device providing a target detection function, as shown in fig. 2, where the target detection apparatus specifically includes: an acquisition module 210 and a detection module 220.
The acquisition module is used for acquiring original point cloud data;
the detection module is used for inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of a target, wherein the three-dimensional frame detection result of the target comprises: the center point heat map, the center point position of the target, the category of the target, the offset of the center point coordinate of the target, the three-dimensional frame size, the sine value of the regression angle, the cosine value of the regression angle and the center point z coordinate of the target.
Optionally, the detection module is specifically configured to:
inputting the original point cloud data into a first convolution layer to obtain a tensor characteristic diagram;
converting the tensor eigenmap into a two-dimensional eigenmap;
inputting the two-dimensional characteristic diagram into a polymerization layer to obtain a target characteristic diagram;
and inputting the target feature map into the second convolution layer to obtain a three-dimensional frame detection result of the target.
Optionally, the detection module is further configured to:
performing voxelization on the original point cloud data to obtain at least one voxel;
performing feature coding on the at least one voxel to obtain feature information of each point in the voxel, wherein the feature information includes: the coordinate of each point in the voxel, the reflection intensity of each point, the deviation value of the coordinate mean value of all the points in the voxel, and the coordinate deviation point of the coordinate of all the points in the voxel relative to the voxel center point;
and inputting the characteristic information of each point into the first convolution layer to obtain a tensor characteristic diagram.
The product can execute the method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
According to the technical scheme of the embodiment, original point cloud data are obtained; inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of a target, wherein the three-dimensional frame detection result of the target comprises the following steps: the detection method comprises the following steps of obtaining a central point heat map, a central point position of a target, a category of the target, an offset of a central point coordinate of the target, a three-dimensional frame size, a sine value of a regression angle, a cosine value of the regression angle and a central point z coordinate of the target, adaptively fusing high-level abstract semantic features and bottom-layer spatial features, and effectively improving detection performance by properly increasing the complexity of a model.
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. FIG. 3 illustrates a block diagram of an electronic device 312 suitable for use in implementing embodiments of the present invention. The electronic device 312 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of the use of the embodiment of the present invention. Device 312 is a computing device for typical trajectory fitting functions.
As shown in fig. 3, electronic device 312 is in the form of a general purpose computing device. The components of the electronic device 312 may include, but are not limited to: one or more processors 316, a storage device 328, and a bus 318 that couples the various system components including the storage device 328 and the processors 316.
Bus 318 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.
Electronic device 312 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 312 and includes both volatile and nonvolatile media, removable and non-removable media.
Storage 328 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 330 and/or cache Memory 332. The electronic device 312 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 334 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 3, and commonly referred to as a "hard drive"). Although not shown in FIG. 3, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk-Read Only Memory (CD-ROM), a Digital Video disk (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 318 by one or more data media interfaces. Storage 328 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
Program 336 having a set (at least one) of program modules 326 may be stored, for example, in storage 328, such program modules 326 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which may comprise an implementation of a network environment, or some combination thereof. Program modules 326 generally carry out the functions and/or methodologies of embodiments of the invention as described herein.
Electronic device 312 may also communicate with one or more external devices 314 (e.g., keyboard, pointing device, camera, display 324, etc.), with one or more devices that enable a user to interact with electronic device 312, and/or with any devices (e.g., network card, modem, etc.) that enable electronic device 312 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 322. Also, the electronic device 312 may communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), and/or a public Network, such as the internet) via the Network adapter 320. As shown, a network adapter 320 communicates with the other modules of the electronic device 312 via the bus 318. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 312, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) systems, tape drives, and data backup storage systems, to name a few.
The processor 316 executes various functional applications and data processing by executing programs stored in the storage 328, for example, to implement the object detection method provided by the above-described embodiment of the present invention:
acquiring original point cloud data;
inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of a target, wherein the three-dimensional frame detection result of the target comprises the following steps: the center point heat map, the center point position of the target, the category of the target, the offset of the center point coordinate of the target, the three-dimensional frame size, the sine value of the regression angle, the cosine value of the regression angle and the center point z coordinate of the target.
Fig. 4 is a schematic structural diagram of a computer-readable storage medium containing a computer program according to an embodiment of the present invention. Embodiments of the present invention provide a computer-readable storage medium 61, on which a computer program 610 is stored, which when executed by one or more processors implements an object detection method as provided by all inventive embodiments of the present application:
acquiring original point cloud data;
inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of a target, wherein the three-dimensional frame detection result of the target comprises the following steps: the center point heat map, the center point position of the target, the category of the target, the offset of the center point coordinate of the target, the three-dimensional frame size, the sine value of the regression angle, the cosine value of the regression angle and the center point z coordinate of the target.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (Hyper Text Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method of object detection, comprising:
acquiring original point cloud data;
inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of a target, wherein the three-dimensional frame detection result of the target comprises the following steps: the center point heat map, the center point position of the target, the category of the target, the offset of the center point coordinate of the target, the three-dimensional frame size, the sine value of the regression angle, the cosine value of the regression angle and the center point z coordinate of the target.
2. The method of claim 1, wherein inputting the raw point cloud data into a target neural network model to obtain a three-dimensional box detection result of a target comprises:
inputting the original point cloud data into a first convolution layer to obtain a tensor characteristic diagram;
converting the tensor eigenmap into a two-dimensional eigenmap;
inputting the two-dimensional characteristic diagram into a polymerization layer to obtain a target characteristic diagram;
and inputting the target feature map into the second convolution layer to obtain a three-dimensional frame detection result of the target.
3. The method of claim 2, wherein inputting the raw point cloud data into a first convolution layer to obtain a tensor eigenmap comprises:
performing voxelization on the original point cloud data to obtain at least one voxel;
performing feature coding on the at least one voxel to obtain feature information of each point in the voxel, wherein the feature information includes: the coordinate of each point in the voxel, the reflection intensity of each point, the deviation value of the coordinate mean value of all the points in the voxel, and the coordinate deviation point of the coordinate of all the points in the voxel relative to the voxel center point;
and inputting the characteristic information of each point into the first convolution layer to obtain a tensor characteristic diagram.
4. The method of claim 2, further comprising, after voxelizing the raw point cloud data to obtain at least one voxel:
acquiring the number of point clouds in each voxel;
if the number of the point clouds in the voxels is larger than or equal to the number threshold, randomly sampling the point clouds in the voxels;
and if the number of the point clouds in the voxels is smaller than the number threshold, performing zero filling operation on the point clouds in the voxels.
5. The method of claim 2, wherein the first buildup layer comprises: sub-manifold convolutional layers and sparse convolutional layers.
6. An object detection device, comprising:
the acquisition module is used for acquiring original point cloud data;
the detection module is used for inputting the original point cloud data into a target neural network model to obtain a three-dimensional frame detection result of a target, wherein the three-dimensional frame detection result of the target comprises: the center point heat map, the center point position of the target, the category of the target, the offset of the center point coordinate of the target, the three-dimensional frame size, the sine value of the regression angle, the cosine value of the regression angle and the center point z coordinate of the target.
7. The apparatus of claim 6, wherein the detection module is specifically configured to:
inputting the original point cloud data into a first convolution layer to obtain a tensor characteristic diagram;
converting the tensor eigenmap into a two-dimensional eigenmap;
inputting the two-dimensional characteristic diagram into a polymerization layer to obtain a target characteristic diagram;
and inputting the target feature map into the second convolution layer to obtain a three-dimensional frame detection result of the target.
8. The apparatus of claim 7, wherein the detection module is further configured to:
performing voxelization on the original point cloud data to obtain at least one voxel;
performing feature coding on the at least one voxel to obtain feature information of each point in the voxel, wherein the feature information includes: the coordinate of each point in the voxel, the reflection intensity of each point, the deviation value of the coordinate mean value of all the points in the voxel, and the coordinate deviation point of the coordinate of all the points in the voxel relative to the voxel center point;
and inputting the characteristic information of each point into the first convolution layer to obtain a tensor characteristic diagram.
9. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the processors to implement the method of any of claims 1-5.
10. A computer-readable storage medium containing a computer program, on which the computer program is stored, characterized in that the program, when executed by one or more processors, implements the method according to any one of claims 1-5.
CN202210101130.9A 2022-01-27 2022-01-27 Target detection method, device, equipment and storage medium Pending CN114419617A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210101130.9A CN114419617A (en) 2022-01-27 2022-01-27 Target detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210101130.9A CN114419617A (en) 2022-01-27 2022-01-27 Target detection method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114419617A true CN114419617A (en) 2022-04-29

Family

ID=81279266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210101130.9A Pending CN114419617A (en) 2022-01-27 2022-01-27 Target detection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114419617A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273064A (en) * 2022-07-14 2022-11-01 中国人民解放军国防科技大学 Sparse event point small target segmentation method under complex motion background
CN116340807A (en) * 2023-01-10 2023-06-27 中国人民解放军国防科技大学 Broadband spectrum signal detection and classification network
CN117368876A (en) * 2023-10-18 2024-01-09 广州易而达科技股份有限公司 Human body detection method, device, equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273064A (en) * 2022-07-14 2022-11-01 中国人民解放军国防科技大学 Sparse event point small target segmentation method under complex motion background
CN115273064B (en) * 2022-07-14 2023-05-09 中国人民解放军国防科技大学 Sparse event point small target segmentation method under complex motion background
CN116340807A (en) * 2023-01-10 2023-06-27 中国人民解放军国防科技大学 Broadband spectrum signal detection and classification network
CN116340807B (en) * 2023-01-10 2024-02-13 中国人民解放军国防科技大学 Broadband Spectrum Signal Detection and Classification Network
CN117368876A (en) * 2023-10-18 2024-01-09 广州易而达科技股份有限公司 Human body detection method, device, equipment and storage medium
CN117368876B (en) * 2023-10-18 2024-03-29 广州易而达科技股份有限公司 Human body detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
JP6745328B2 (en) Method and apparatus for recovering point cloud data
US10817752B2 (en) Virtually boosted training
CN114419617A (en) Target detection method, device, equipment and storage medium
EP4181079A1 (en) Method and apparatus with multi-modal feature fusion
CN109087349A (en) A kind of monocular depth estimation method, device, terminal and storage medium
CN112699806B (en) Three-dimensional point cloud target detection method and device based on three-dimensional heat map
US11875424B2 (en) Point cloud data processing method and device, computer device, and storage medium
EP3767332B1 (en) Methods and systems for radar object detection
WO2022206414A1 (en) Three-dimensional target detection method and apparatus
CN113762003A (en) Target object detection method, device, equipment and storage medium
CN112764004A (en) Point cloud processing method, device, equipment and storage medium
Maslov et al. Fast depth reconstruction using deep convolutional neural networks
US20230105331A1 (en) Methods and systems for semantic scene completion for sparse 3d data
CN116030206A (en) Map generation method, training device, electronic equipment and storage medium
Meng et al. Multi‐vehicle multi‐sensor occupancy grid map fusion in vehicular networks
US20240161478A1 (en) Multimodal Weakly-Supervised Three-Dimensional (3D) Object Detection Method and System, and Device
Tung et al. MF3D: Model-free 3D semantic scene parsing
US12026954B2 (en) Static occupancy tracking
JP2024521816A (en) Unrestricted image stabilization
CN114612572A (en) Laser radar and camera external parameter calibration method and device based on deep learning
EP4211651A1 (en) Efficient three-dimensional object detection from point clouds
Le et al. Simple linear iterative clustering based low-cost pseudo-LiDAR for 3D object detection in autonomous driving
US20240221386A1 (en) Occupancy tracking based on depth information
CN113343979B (en) Method, apparatus, device, medium and program product for training a model
EP3944137A1 (en) Positioning method and positioning apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination