CN110879994A - Three-dimensional visual inspection detection method, system and device based on shape attention mechanism - Google Patents
Three-dimensional visual inspection detection method, system and device based on shape attention mechanism Download PDFInfo
- Publication number
- CN110879994A CN110879994A CN201911213392.9A CN201911213392A CN110879994A CN 110879994 A CN110879994 A CN 110879994A CN 201911213392 A CN201911213392 A CN 201911213392A CN 110879994 A CN110879994 A CN 110879994A
- Authority
- CN
- China
- Prior art keywords
- feature map
- attention
- target
- module
- top view
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 42
- 230000007246 mechanism Effects 0.000 title claims abstract description 37
- 238000011179 visual inspection Methods 0.000 title claims abstract description 34
- 238000000034 method Methods 0.000 claims description 49
- 238000010586 diagram Methods 0.000 claims description 17
- 238000003860 storage Methods 0.000 claims description 10
- 230000004927 fusion Effects 0.000 claims description 7
- 238000003475 lamination Methods 0.000 claims description 7
- 238000002310 reflectometry Methods 0.000 claims description 4
- 238000009826 distribution Methods 0.000 abstract description 4
- 238000005070 sampling Methods 0.000 abstract description 4
- 238000003909 pattern recognition Methods 0.000 abstract description 2
- 230000002787 reinforcement Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 10
- 230000008569 process Effects 0.000 description 6
- 229910052739 hydrogen Inorganic materials 0.000 description 4
- 238000000605 extraction Methods 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 235000004522 Pentaglottis sempervirens Nutrition 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/513—Sparse representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
- Length Measuring Devices By Optical Means (AREA)
Abstract
The invention belongs to the field of computer deep reinforcement learning and pattern recognition, and particularly relates to a three-dimensional visual inspection detection method, system and device based on a shape attention mechanism, aiming at solving the problems that the precision of a single-stage detector is lower than that of a two-stage detector, and the two-stage detector consumes much time and is not suitable for a real-time system. The invention comprises the following steps: representing the point cloud data by three-dimensional grid voxels; extracting features and coding a space sparse feature map; extracting different scale features after projecting to a top view; adopting a deconvolution layer merging characteristic; extracting a shape attention feature map through attention weight and a convolution coding layer; and acquiring the target category, the target position, the target size and the target direction through a target classification network and a regression positioning network. The invention uses a sampling strategy based on distance constraint and an attention mechanism based on shape prior, relieves the instability caused by uneven data distribution, improves the problem that a single-stage detector lacks shape prior, and has high precision, short time consumption, strong real-time performance and good robustness.
Description
Technical Field
The invention belongs to the field of deep reinforcement learning, computer vision, pattern recognition and machine learning, and particularly relates to a three-dimensional visual inspection detection method, system and device based on a shape attention mechanism.
Background
Three-dimensional object detectors need to output reliable spatial and semantic information, i.e. three-dimensional position, orientation, occupied volume and category. Compared with two-dimensional object detection, the three-dimensional target provides more detail information, but the modeling difficulty is higher. Three-dimensional object detection typically employs distance sensors, such as laser radars, TOF cameras, stereo cameras, etc., to predict more meaningful and accurate results. Three-dimensional object detection becomes a key technology in the fields of automatic driving of automobiles, UVA, robots and the like. Most accurate three-dimensional object detection algorithms in traffic scenes are based on radar sensors, which have become the basic sensors for outdoor scene perception. And target perception in traffic scenes is a key technology of unmanned vehicles with respect to locating surrounding targets.
Lidar-based three-dimensional target detection involves two important issues. The first problem is how to generate descriptive underlying features for sparse non-uniform point clouds sampled from lidar sensors. The sampling points of the laser radar are more at the position close to the sensor and less at the position far away. The diversity distribution of the point cloud may reduce the detection performance of the detector and cause instability of the detection result. Many methods rely on manual feature extraction methods. However, the detection algorithm is not stable because the manual features do not take into account and handle the unbalanced laser point cloud distribution well. Object detection and segmentation play an extremely important role in both visual data understanding and perception. Another problem is how to efficiently encode the three-dimensional shape information to achieve better discriminant embedding. The three-dimensional object detection framework mainly comprises a single-stage detector and a two-stage detector. The single-stage detector has higher efficiency, and the two-stage detector has higher detection precision. The two-stage detector is not efficient because the region candidate network outputs the region of interest ROI that needs to be cropped. However, these cropped ROIs provide a shape prior for each detected object, and higher detection accuracy can be achieved through subsequent optimization networks. The performance of a single-stage detector is lower than that of a two-stage detector due to the lack of shape priors and subsequent optimization networks. However, for real-time systems, two-stage detectors are very time consuming. In addition, the three-dimensional shape prior is more suitable for the detection of three-dimensional targets.
Disclosure of Invention
In order to solve the above problems in the prior art, namely, the problems that the precision of a single-stage three-dimensional target detector is lower than that of a two-stage detector, and the two-stage detector consumes much time and is not suitable for a real-time system, the invention provides a three-dimensional visual inspection detection method based on a shape attention mechanism, which comprises the following steps:
step S10, laser point cloud data containing a target object are obtained to serve as data to be detected, and the data to be detected are represented through voxels based on a three-dimensional network;
step S20, acquiring the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a space sparse feature map corresponding to the data to be processed;
step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map;
step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; acquiring a coding feature map of the top view feature map through a convolution coding layer;
step S50, multiplying the attention weight feature map to the corresponding area of the coding feature map, and performing feature splicing to obtain an attention feature map;
step S60, acquiring target categories in the data to be detected through a trained target classification network based on the attention feature map; and acquiring the position, the size and the direction of the target in the data to be detected through the trained target regression positioning network based on the attention feature map.
In some preferred embodiments, in step S10, "the data to be detected is characterized by voxels based on a three-dimensional network", which is performed by:
wherein D represents the voxel representation of the laser point cloud data, xi、yi、ziRepresenting the three-dimensional position information of the ith point in the laser point cloud data relative to the laser radar, RiRepresenting the reflectivity of the ith point in the laser point cloud data.
In some preferred embodiments, in step S20, "obtaining the feature expression of the voxel through the feature extractor and performing sparse convolutional coding to obtain a spatial sparse feature map corresponding to the data to be processed", the method includes:
wherein, F () represents the feature representation of the voxel obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatial sparse feature map.
In some preferred embodiments, in step S40, "obtaining the attention weight feature map of the top view feature map through the attention weight layer" includes:
Fatt(u,v)=Convatt(FFPN(u,v))
wherein, Fatt(u, v) represents the attention weight feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Convatt() Representing the attention weight layer convolution operation.
In some preferred embodiments, in step S40, "obtaining the encoding feature map of the top view feature map through a convolutional encoding layer", the method includes:
Fen(u,v)=Conven(FFPN(u,v))
wherein, Fen(u, v) represents the coding feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Conven() Representing a convolutional encoding layer convolution operation.
In some preferred embodiments, in step S50, "multiplying the attention weight feature map to the corresponding region of the coding feature map, and performing feature concatenation to obtain the attention feature map", the method includes:
Fop(u,v)=Fen(u,v)Repeat(Reshape(Fatt(u,v)))
wherein, resume () represents the deformation operation, and Repeat () represents the copy operation;
wherein [ ] represents a characteristic splicing operation.
In some preferred embodiments, the target classification network is trained by a cross entropy loss function; the cross entropy loss function is:
wherein N represents the number of samples for which loss is calculated; y isiRepresents positive and negative samples, with 0 representing a negative sample and 1 representing a positive sample; x is the number ofiA network output value representing a sample.
In some preferred embodiments, the target regression positioning network is trained by a Smooth L1 loss function; the Smooth L1 loss function is:
where x represents the residual requiring regression.
On the other hand, the invention provides a three-dimensional visual inspection detection system based on a shape attention mechanism, which comprises an input module, a sparse convolution coding module, a characteristic pyramid module, an attention weight convolution module, a coding convolution module, a characteristic fusion module, a target classification module, a target positioning module and an output module;
the input module is configured to acquire laser point cloud data containing a target object as to-be-detected data and represent the to-be-detected data through a voxel based on a three-dimensional network;
the sparse convolution coding module is configured to obtain the characteristic expression of the voxel through a characteristic extractor and carry out sparse convolution coding to obtain a spatial sparse characteristic diagram corresponding to the data to be processed;
the characteristic pyramid module is configured to project the space sparse characteristic diagram to a two-dimensional top view plane, obtain characteristics of different scales through a characteristic pyramid convolution network, and then combine the characteristics of different scales through deconvolution lamination to obtain a top view characteristic diagram;
the attention weight convolution module is configured to acquire an attention weight feature map of the top view feature map through an attention weight layer;
the coding convolution module is configured to acquire a coding feature map of the top view feature map through a convolution coding layer;
the feature fusion module is configured to multiply the attention weight feature map to a corresponding region of the coding feature map, and perform feature splicing to obtain an attention feature map;
the target classification module is configured to obtain a target class in the data to be detected through a trained target classification network based on the attention feature map;
the target positioning module is configured to obtain the position, the size and the direction of a target in the data to be detected through a trained target regression positioning network based on the attention feature map;
the output module is configured to output the acquired object type, and the object position, size and direction.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-mentioned three-dimensional visual inspection method based on the shape attention mechanism.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described three-dimensional visual inspection method based on the shape attention mechanism.
The invention has the beneficial effects that:
the three-dimensional visual inspection detection method based on the shape attention mechanism uses a sampling strategy based on distance constraint, can effectively relieve unstable results caused by uneven distribution of radar sampling point cloud data, solves the problem that a single-stage detector lacks shape prior through the attention mechanism based on the shape prior, can improve the detection performance of the conventional single-stage three-dimensional target detector, particularly aims at targets with obvious shape characteristics, is high in detection precision, short in detection time consumption, suitable for a real-time system and good in model robustness.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a schematic flow chart of a three-dimensional visual inspection method based on a shape attention mechanism according to the present invention;
FIG. 2 is a schematic diagram of an algorithm structure of an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention;
FIG. 3 is a data set and an exemplary graph of the inspection results of one embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention;
FIG. 4 is a graph showing the comparison of the results of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention with other methods.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
The invention discloses a three-dimensional visual inspection method based on a shape attention mechanism, which comprises the following steps:
step S10, laser point cloud data containing a target object are obtained to serve as data to be detected, and the data to be detected are represented through voxels based on a three-dimensional network;
step S20, acquiring the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a space sparse feature map corresponding to the data to be processed;
step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map;
step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; acquiring a coding feature map of the top view feature map through a convolution coding layer;
step S50, multiplying the attention weight feature map to the corresponding area of the coding feature map, and performing feature splicing to obtain an attention feature map;
step S60, acquiring target categories in the data to be detected through a trained target classification network based on the attention feature map; and acquiring the position, the size and the direction of the target in the data to be detected through the trained target regression positioning network based on the attention feature map.
In order to more clearly illustrate the three-dimensional visual inspection method based on the shape attention mechanism of the present invention, the following describes the steps in the embodiment of the method of the present invention in detail with reference to fig. 1.
The three-dimensional visual inspection method based on the shape attention mechanism comprises the following steps of S10-S60, wherein the steps are described in detail as follows:
step S10, laser point cloud data containing a target object is obtained as data to be detected, and the data to be detected is represented by a voxel based on a three-dimensional network, as shown in formula (1):
wherein D represents the voxel representation of the laser point cloud data, xi、yi、ziRepresenting the three-dimensional position information of the ith point in the laser radar point cloud in the laser point cloud data, RiRepresenting the reflectivity of the ith point in the laser point cloud data.
Assuming that the lidar point cloud includes a three-dimensional space of H, W, D, representing the height in the vertical direction, the position in the horizontal direction, and the distance, respectively, the size of each voxel is Δ H × Δ W × Δ D, Δ H ═ 0.4m, Δ W ═ 0.2m, and Δ D ═ 0.2 m. The size of the voxel grid in the whole three-dimensional space can be calculated by H/delta H, W/delta W, D/delta D. Each voxel is then characterized by a feature encoding layer (VFE). In one embodiment of the invention, the feature extractor describes the sample points in each voxel using 7-dimensional vectors (three-dimensional coordinates, reflectivity, and relative three-dimensional coordinates of the voxel, respectively), and adds to each sample the coordinate (P) of the current pillar centerx,Py). At this time, the description vector of the sample point in each voxel becomes 9 dimensions. In one embodiment of the invention, the feature encoding layer (VFE) includes a linear layer, a batch normalization layer (BN), and a corrected linear unit layer (ReLU) to extract vector features of points.
Step S20, obtaining the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a spatial sparse feature map corresponding to the data to be processed, as shown in formula (2):
wherein, F () represents the feature representation of the voxel obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatial sparse feature map.
And step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map.
Space sparse feature map fs(x, y, z) is projected to a top view (namely a bird's eye view), namely a space sparse feature map fs(x, y, z) vertical dimension compression to obtain a characteristic diagram f of a top view2D(u, v). Specifically, assuming that the original feature is (C, D, H, W), the height feature is incorporated into the feature channel to become (C × D, H, W), and a feature map in which the 2D convolution feature is a top view is obtained. Obtaining f by a characteristic pyramid convolution network2D(u, v) features of different scales, and combining the features of different scales through the deconvolution layer to obtain a feature map fFPN(u, v). In one embodiment of the present invention, the feature pyramid convolutional layer comprises three convolutional groups, each having (3, 5) convolutional layers, each of which is followed by a batch normalization layer (BN), a corrected linear unit layer (ReLU).
Step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; and acquiring the coding characteristic diagram of the top view characteristic diagram through a convolution coding layer.
Acquiring an attention weight characteristic diagram of the top view characteristic diagram through an attention weight layer, wherein the formula (3) is as follows:
Fatt(u,v)=Convatt(FFPN(u, v)) formula (3)
Wherein, Fatt(u, v) represents the attention weight feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Convatt() Representing the attention weight layer convolution operation.
Acquiring a coding feature map of the top view feature map through a convolution coding layer, wherein the formula (4) is as follows:
Fen(u,v)=Conven(FFPN(u, v)) formula (4)
Wherein, Fen(u, v) represents the coding feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Conven() Representing a convolutional encoding layer convolution operation.
Step S50, multiplying the attention weight feature map to the corresponding region of the coding feature map, and performing feature concatenation to obtain an attention feature map, as shown in equations (5) and (6):
Fop(u,v)=Fen(u,v)Repeat(Reshape(Fatt(u, v))) formula (5)
Wherein, resume () represents the deformation operation, and Repeat () represents the copy operation;
wherein [ ] represents a characteristic splicing operation.
Step S60, acquiring target categories in the data to be detected through a trained target classification network based on the attention feature map; and acquiring the position, the size and the direction of the target in the data to be detected through the trained target regression positioning network based on the attention feature map.
As shown in fig. 2, the schematic diagram of the algorithm structure of an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention is divided into three parts: the first part is a Distance-based Voxel Generator (Distance-based Voxel Generator) that transforms the input lidar point cloud into voxels; the second part is a Feature extraction layer (features extraction layers) for coding voxel features and coding three-dimensional space features; the third part is Attention area recommendation network (Attention RPN), and the Attention mechanism is injected to output the detection result.
The target classification network is trained through a cross entropy loss function, wherein the cross entropy loss function is shown as a formula (7):
wherein N represents the number of samples for which loss is calculated; y isiRepresents positive and negative samples, with 0 representing a negative sample and 1 representing a positive sample; x is the number ofiA network output value representing a sample.
The target regression positioning network is trained by a Smooth L1 loss function, and the Smooth L1 loss function is shown as a formula (8):
where x represents the residual of the regression.
Attention profile FhybridAnd (u, v) respectively connecting a target classification network and a target regression positioning network, wherein the target classification network is used for judging whether the detection object is a target, and the target regression positioning network is used for acquiring the position, the size and the direction of the detection object.
In one embodiment of the invention, for the car in the target classification task, setting the intersection ratio (IOU) of the anchor point and the target to be greater than 0.6 as a positive sample, and setting the intersection ratio to be less than 0.45 as a negative sample; for classes pedestrian and cyclist, a positive sample is taken when the intersection ratio (IOU) of anchor point and target is greater than 0.5, and a negative sample is taken when the intersection ratio is less than 0.35. For the regression positioning task, setting the width multiplied by the length multiplied by the height of a predefined anchor point corresponding to a target vehicle to be (1.6 multiplied by 3.9 multiplied by 1.5) meters; the width x length x height of the predefined anchor point for the target pedestrian is (0.6 x 0.8 x 1.73) meters; the width x length x height of the predefined anchor point for the target rider is (0.6 x 1.76 x 1.73) meters. Defining a three-dimensional real bounding box as xg,yg,zg,lg,wg,hg,θgWherein x, y and Z are the central positions of the bounding box, l, w and h represent the length, width and height of the three-dimensional target, and theta is the rotation angle of the target in the Z-axis directiongRepresenting true values byaAnd expressing the anchor point of the positive sample, expressing the corresponding residual error by delta, and predicting the position, the size and the direction of the real three-dimensional target through network learning. Residual error of central position of bounding box(Δ x, Δ y, Δ Z), a residual (Δ l, Δ w, Δ h) of the length and width of the three-dimensional target, and a residual (Δ θ) of the rotation angle of the target in the Z-axis direction are respectively expressed by the following equations (9), (10), and (11):
Δθ=sin(θg-θa) Formula (11)
To illustrate the effectiveness of the invention in detail, the method proposed by the invention is applied to the public driverless data set KITTI, which contains 3 validation classes. As shown in fig. 3, which is an exemplary diagram of a data set and a detection result of an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention, a first column Car represents a detection result of a vehicle, a second column Pedestrian represents a detection result of a Pedestrian, and a third column Cyclist represents a detection result of a rider. Each column has three groups of experimental results, each group comprises an RGB image and a top view of the radar, and the detection results are projected on the images.
In one embodiment of the invention, for the KITTI data set, the train data set is used for training, and the test data set is used for testing. As shown in fig. 4, which is a comparison graph of the detection results of the method of the present invention and other methods according to an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention, the data set is divided into three grades for each type of test object: easy, medium and difficult. The difficulty is divided according to the height of each target in the camera image, the occlusion level and the truncation degree. The height of the sample bounding box with easy difficulty is more than 40 and equal to each pixel, the maximum truncation is 15 percent, and the shielding level is completely visible; the height of a sample boundary frame with the difficulty is more than or equal to 25 pixels, the maximum truncation is 30%, and the shielding level is partial shielding; the height of the sample boundary frame with difficulty is more than or equal to 25 pixels, the maximum truncation is 50%, and the shielding level is difficult to see. BEV represents top view detection results and 3D represents detection results of a three-dimensional bounding box. The 3D target detection performance was evaluated using the PASCAL standard (AP, average accuracy). In the comparison method, ARPNET is used for representing the method, MV3D represents a multi-view 3D target detection method, ContFuse represents a depth continuous fusion multi-sensor 3D target detection method, AOVD represents multi-view aggregation data to realize a 3D object real-time detection method in an unmanned scene, F-PointNet represents a viewing cone point cloud network RGB-D data 3D object detection method, SECOND represents a sparse embedded convolution target detection method, and Voxelnet represents a point cloud data 3D target detection method based on end-to-end learning.
The three-dimensional visual inspection detection system based on the shape attention mechanism comprises an input module, a sparse convolution coding module, a characteristic pyramid module, an attention weight convolution module, a coding convolution module, a characteristic fusion module, a target classification module, a target positioning module and an output module;
the input module is configured to acquire laser point cloud data containing a target object as to-be-detected data and represent the to-be-detected data through a voxel based on a three-dimensional network;
the sparse convolution coding module is configured to obtain the characteristic expression of the voxel through a characteristic extractor and carry out sparse convolution coding to obtain a spatial sparse characteristic diagram corresponding to the data to be processed;
the characteristic pyramid module is configured to project the space sparse characteristic diagram to a two-dimensional top view plane, obtain characteristics of different scales through a characteristic pyramid convolution network, and then combine the characteristics of different scales through deconvolution lamination to obtain a top view characteristic diagram;
the attention weight convolution module is configured to acquire an attention weight feature map of the top view feature map through an attention weight layer;
the coding convolution module is configured to acquire a coding feature map of the top view feature map through a convolution coding layer;
the feature fusion module is configured to multiply the attention weight feature map to a corresponding region of the coding feature map, and perform feature splicing to obtain an attention feature map;
the target classification module is configured to obtain a target class in the data to be detected through a trained target classification network based on the attention feature map;
the target positioning module is configured to obtain the position, the size and the direction of a target in the data to be detected through a trained target regression positioning network based on the attention feature map;
the output module is configured to output the acquired object type, and the object position, size and direction.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.
It should be noted that, the three-dimensional visual inspection system based on the shape attention mechanism provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores a plurality of programs, which are suitable for being loaded and executed by a processor to implement the above-mentioned three-dimensional visual inspection method based on the shape attention mechanism.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described three-dimensional visual inspection method based on the shape attention mechanism.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
Claims (11)
1. A three-dimensional visual inspection method based on a shape attention mechanism is characterized by comprising the following steps:
step S10, laser point cloud data containing a target object are obtained to serve as data to be detected, and the data to be detected are represented through voxels based on a three-dimensional network;
step S20, acquiring the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a space sparse feature map corresponding to the data to be processed;
step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map;
step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; acquiring a coding feature map of the top view feature map through a convolution coding layer;
step S50, multiplying the attention weight feature map to the corresponding area of the coding feature map, and performing feature splicing to obtain an attention feature map;
step S60, acquiring target categories in the data to be detected through a trained target classification network based on the attention feature map; and acquiring the position, the size and the direction of the target in the data to be detected through the trained target regression positioning network based on the attention feature map.
2. The three-dimensional visual inspection method based on shape attention mechanism according to claim 1, wherein in step S10, "the data to be inspected is characterized by voxels based on three-dimensional network", which is performed by:
wherein D represents the voxel representation of the laser point cloud data, xi、yi、ziRepresenting the three-dimensional position information of the ith point in the laser point cloud data relative to the laser radar, RiRepresenting the reflectivity of the ith point in the laser point cloud data.
3. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S20, "obtaining the feature expression of the voxel by a feature extractor and performing sparse convolution coding to obtain a spatial sparse feature map corresponding to the data to be processed" includes:
wherein, F () represents the feature representation of the voxel obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatial sparse feature map.
4. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S40, "obtaining the attention weight feature map of the top view feature map through the attention weight layer" includes:
Fatt(u,v)=Convatt(FFPN(u,v))
wherein, Fatt(u, v) represents the attention weight feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Convatt() Representing the attention weight layer convolution operation.
5. A three-dimensional visual inspection method based on shape attention mechanism according to claim 1, wherein in step S40, "obtaining the encoding feature map of the top view feature map by convolution encoding layer" comprises:
Fen(u,v)=Conven(FFPN(u,v))
wherein, Fen(u, v) represents the coding feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Conven() Representing a convolutional encoding layer convolution operation.
6. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S50, the method comprises the steps of multiplying the attention weight feature map to the corresponding region of the coding feature map and performing feature matching to obtain the attention feature map, and comprises the steps of:
Fop(u,v)=Fen(u,v)Repeat(Reshape(Fatt(u,v)))
wherein, resume () represents the deformation operation, and Repeat () represents the copy operation;
wherein [ ] represents a characteristic splicing operation.
7. The three-dimensional visual inspection method based on shape attention mechanism according to any one of claims 1-6, characterized in that the object classification network is trained by cross entropy loss function; the cross entropy loss function is:
wherein N represents the number of samples for which loss is calculated; y isiRepresents positive and negative samples, with 0 representing a negative sample and 1 representing a positive sample; x is the number ofiA network output value representing a sample.
8. The three-dimensional visual inspection method based on shape attention mechanism according to any one of claims 1-6, characterized in that the target regression positioning network is trained by Smooth L1 loss function; the Smooth L1 loss function is:
where x represents the residual of the regression.
9. A three-dimensional visual inspection detection system based on a shape attention mechanism is characterized by comprising an input module, a sparse convolution coding module, a characteristic pyramid module, an attention weight convolution module, a coding convolution module, a characteristic fusion module, a target classification module, a target positioning module and an output module;
the input module is configured to acquire laser point cloud containing target object data as to-be-detected data, and the to-be-detected data is represented by voxels based on a three-dimensional network;
the sparse convolution coding module is configured to obtain the characteristic expression of the voxel through a characteristic extractor and carry out sparse convolution coding to obtain a spatial sparse characteristic diagram corresponding to the data to be processed;
the characteristic pyramid module is configured to project the space sparse characteristic diagram to a two-dimensional top view plane, obtain characteristics of different scales through a characteristic pyramid convolution network, and then combine the characteristics of different scales through deconvolution lamination to obtain a top view characteristic diagram;
the attention weight convolution module is configured to acquire an attention weight feature map of the top view feature map through an attention weight layer;
the coding convolution module is configured to acquire a coding feature map of the top view feature map through a convolution coding layer;
the feature fusion module is configured to multiply the attention weight feature map to a corresponding region of the coding feature map, and perform feature splicing to obtain an attention feature map;
the target classification module is configured to obtain a target class in the data to be detected through a trained target classification network based on the attention feature map;
the target positioning module is configured to obtain the position, the size and the direction of a target in the data to be detected through a trained target regression positioning network based on the attention feature map;
the output module is configured to output the acquired object type, and the object position, size and direction.
10. A storage device having stored therein a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the method for three-dimensional visual inspection based on the shape attention mechanism of any one of claims 1 to 8.
11. A treatment apparatus comprises
A processor adapted to execute various programs; and
a storage device adapted to store a plurality of programs;
wherein the program is adapted to be loaded and executed by a processor to perform:
the three-dimensional visual inspection method based on the shape attention mechanism as set forth in any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911213392.9A CN110879994A (en) | 2019-12-02 | 2019-12-02 | Three-dimensional visual inspection detection method, system and device based on shape attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911213392.9A CN110879994A (en) | 2019-12-02 | 2019-12-02 | Three-dimensional visual inspection detection method, system and device based on shape attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110879994A true CN110879994A (en) | 2020-03-13 |
Family
ID=69729811
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911213392.9A Pending CN110879994A (en) | 2019-12-02 | 2019-12-02 | Three-dimensional visual inspection detection method, system and device based on shape attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110879994A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111723719A (en) * | 2020-06-12 | 2020-09-29 | 中国科学院自动化研究所 | Video target detection method, system and device based on category external memory |
CN111862101A (en) * | 2020-07-15 | 2020-10-30 | 西安交通大学 | 3D point cloud semantic segmentation method under aerial view coding visual angle |
CN111985378A (en) * | 2020-08-13 | 2020-11-24 | 中国第一汽车股份有限公司 | Road target detection method, device and equipment and vehicle |
CN112257605A (en) * | 2020-10-23 | 2021-01-22 | 中国科学院自动化研究所 | Three-dimensional target detection method, system and device based on self-labeling training sample |
CN112347987A (en) * | 2020-11-30 | 2021-02-09 | 江南大学 | Multimode data fusion three-dimensional target detection method |
CN112418421A (en) * | 2020-11-06 | 2021-02-26 | 常州大学 | Roadside end pedestrian trajectory prediction algorithm based on graph attention self-coding model |
CN112464905A (en) * | 2020-12-17 | 2021-03-09 | 湖南大学 | 3D target detection method and device |
CN112668469A (en) * | 2020-12-28 | 2021-04-16 | 西安电子科技大学 | Multi-target detection and identification method based on deep learning |
CN112884723A (en) * | 2021-02-02 | 2021-06-01 | 贵州电网有限责任公司 | Insulator string detection method in three-dimensional laser point cloud data |
CN113095172A (en) * | 2021-03-29 | 2021-07-09 | 天津大学 | Point cloud three-dimensional object detection method based on deep learning |
CN113269147A (en) * | 2021-06-24 | 2021-08-17 | 浙江海康智联科技有限公司 | Three-dimensional detection method and system based on space and shape, and storage and processing device |
CN113807184A (en) * | 2021-08-17 | 2021-12-17 | 北京百度网讯科技有限公司 | Obstacle detection method and device, electronic equipment and automatic driving vehicle |
CN114663879A (en) * | 2022-02-09 | 2022-06-24 | 中国科学院自动化研究所 | Target detection method and device, electronic equipment and storage medium |
CN115082902A (en) * | 2022-07-22 | 2022-09-20 | 松立控股集团股份有限公司 | Vehicle target detection method based on laser radar point cloud |
CN115183782A (en) * | 2022-09-13 | 2022-10-14 | 毫末智行科技有限公司 | Multi-modal sensor fusion method and device based on joint space loss |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102896630A (en) * | 2011-07-25 | 2013-01-30 | 索尼公司 | Robot device, method of controlling the same, computer program, and robot system |
US20160063754A1 (en) * | 2014-08-26 | 2016-03-03 | The Boeing Company | System and Method for Detecting a Structural Opening in a Three Dimensional Point Cloud |
CN106778856A (en) * | 2016-12-08 | 2017-05-31 | 深圳大学 | A kind of object identification method and device |
CN108133191A (en) * | 2017-12-25 | 2018-06-08 | 燕山大学 | A kind of real-time object identification method suitable for indoor environment |
US20180210896A1 (en) * | 2015-07-22 | 2018-07-26 | Hangzhou Hikvision Digital Technology Co., Ltd. | Method and device for searching a target in an image |
US20190147245A1 (en) * | 2017-11-14 | 2019-05-16 | Nuro, Inc. | Three-dimensional object detection for autonomous robotic systems using image proposals |
CN110070025A (en) * | 2019-04-17 | 2019-07-30 | 上海交通大学 | Objective detection system and method based on monocular image |
CN110458112A (en) * | 2019-08-14 | 2019-11-15 | 上海眼控科技股份有限公司 | Vehicle checking method, device, computer equipment and readable storage medium storing program for executing |
-
2019
- 2019-12-02 CN CN201911213392.9A patent/CN110879994A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102896630A (en) * | 2011-07-25 | 2013-01-30 | 索尼公司 | Robot device, method of controlling the same, computer program, and robot system |
US20160063754A1 (en) * | 2014-08-26 | 2016-03-03 | The Boeing Company | System and Method for Detecting a Structural Opening in a Three Dimensional Point Cloud |
US20180210896A1 (en) * | 2015-07-22 | 2018-07-26 | Hangzhou Hikvision Digital Technology Co., Ltd. | Method and device for searching a target in an image |
CN106778856A (en) * | 2016-12-08 | 2017-05-31 | 深圳大学 | A kind of object identification method and device |
US20180165547A1 (en) * | 2016-12-08 | 2018-06-14 | Shenzhen University | Object Recognition Method and Device |
US20190147245A1 (en) * | 2017-11-14 | 2019-05-16 | Nuro, Inc. | Three-dimensional object detection for autonomous robotic systems using image proposals |
CN108133191A (en) * | 2017-12-25 | 2018-06-08 | 燕山大学 | A kind of real-time object identification method suitable for indoor environment |
CN110070025A (en) * | 2019-04-17 | 2019-07-30 | 上海交通大学 | Objective detection system and method based on monocular image |
CN110458112A (en) * | 2019-08-14 | 2019-11-15 | 上海眼控科技股份有限公司 | Vehicle checking method, device, computer equipment and readable storage medium storing program for executing |
Non-Patent Citations (5)
Title |
---|
YANG YANG YE ET AL: "ARPNET:attention region proposal network for 3D object detection", 《SCIENCE CHINA INFORMATION SCIENCES》 * |
YANG YANGYE ET AL: "SARPNET: Shape attention regional proposal network for liDAR-based 3D object detection", 《NEURO COMPUTING》 * |
YIN ZHOU ET AL: "VoxelNet:End-to-End Learning for Point Cloud Based 3D Object Detection", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 * |
赵华卿: "三维目标检测中的先验方向角估计", 《传感器与微系统》 * |
陈敏: "《认知计算导论》", 30 April 2017 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111723719A (en) * | 2020-06-12 | 2020-09-29 | 中国科学院自动化研究所 | Video target detection method, system and device based on category external memory |
CN111862101A (en) * | 2020-07-15 | 2020-10-30 | 西安交通大学 | 3D point cloud semantic segmentation method under aerial view coding visual angle |
CN111985378A (en) * | 2020-08-13 | 2020-11-24 | 中国第一汽车股份有限公司 | Road target detection method, device and equipment and vehicle |
CN112257605A (en) * | 2020-10-23 | 2021-01-22 | 中国科学院自动化研究所 | Three-dimensional target detection method, system and device based on self-labeling training sample |
CN112418421A (en) * | 2020-11-06 | 2021-02-26 | 常州大学 | Roadside end pedestrian trajectory prediction algorithm based on graph attention self-coding model |
CN112418421B (en) * | 2020-11-06 | 2024-01-23 | 常州大学 | Road side end pedestrian track prediction algorithm based on graph attention self-coding model |
CN112347987A (en) * | 2020-11-30 | 2021-02-09 | 江南大学 | Multimode data fusion three-dimensional target detection method |
CN112464905A (en) * | 2020-12-17 | 2021-03-09 | 湖南大学 | 3D target detection method and device |
CN112464905B (en) * | 2020-12-17 | 2022-07-26 | 湖南大学 | 3D target detection method and device |
CN112668469A (en) * | 2020-12-28 | 2021-04-16 | 西安电子科技大学 | Multi-target detection and identification method based on deep learning |
CN112884723A (en) * | 2021-02-02 | 2021-06-01 | 贵州电网有限责任公司 | Insulator string detection method in three-dimensional laser point cloud data |
CN112884723B (en) * | 2021-02-02 | 2022-08-12 | 贵州电网有限责任公司 | Insulator string detection method in three-dimensional laser point cloud data |
CN113095172A (en) * | 2021-03-29 | 2021-07-09 | 天津大学 | Point cloud three-dimensional object detection method based on deep learning |
CN113269147A (en) * | 2021-06-24 | 2021-08-17 | 浙江海康智联科技有限公司 | Three-dimensional detection method and system based on space and shape, and storage and processing device |
CN113807184A (en) * | 2021-08-17 | 2021-12-17 | 北京百度网讯科技有限公司 | Obstacle detection method and device, electronic equipment and automatic driving vehicle |
CN114663879A (en) * | 2022-02-09 | 2022-06-24 | 中国科学院自动化研究所 | Target detection method and device, electronic equipment and storage medium |
CN114663879B (en) * | 2022-02-09 | 2023-02-21 | 中国科学院自动化研究所 | Target detection method and device, electronic equipment and storage medium |
CN115082902A (en) * | 2022-07-22 | 2022-09-20 | 松立控股集团股份有限公司 | Vehicle target detection method based on laser radar point cloud |
CN115082902B (en) * | 2022-07-22 | 2022-11-11 | 松立控股集团股份有限公司 | Vehicle target detection method based on laser radar point cloud |
CN115183782A (en) * | 2022-09-13 | 2022-10-14 | 毫末智行科技有限公司 | Multi-modal sensor fusion method and device based on joint space loss |
CN115183782B (en) * | 2022-09-13 | 2022-12-09 | 毫末智行科技有限公司 | Multi-modal sensor fusion method and device based on joint space loss |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110879994A (en) | Three-dimensional visual inspection detection method, system and device based on shape attention mechanism | |
CN112257605B (en) | Three-dimensional target detection method, system and device based on self-labeling training sample | |
US20190065824A1 (en) | Spatial data analysis | |
CN113052109A (en) | 3D target detection system and 3D target detection method thereof | |
CN115049700A (en) | Target detection method and device | |
CN113269147B (en) | Three-dimensional detection method and system based on space and shape, and storage and processing device | |
CN110298281B (en) | Video structuring method and device, electronic equipment and storage medium | |
CN113267761B (en) | Laser radar target detection and identification method, system and computer readable storage medium | |
CN112287824B (en) | Binocular vision-based three-dimensional target detection method, device and system | |
CN111709923A (en) | Three-dimensional object detection method and device, computer equipment and storage medium | |
CN113240734B (en) | Vehicle cross-position judging method, device, equipment and medium based on aerial view | |
EP4174792A1 (en) | Method for scene understanding and semantic analysis of objects | |
CN114463736A (en) | Multi-target detection method and device based on multi-mode information fusion | |
CN113362385A (en) | Cargo volume measuring method and device based on depth image | |
CN116246119A (en) | 3D target detection method, electronic device and storage medium | |
CN115588047A (en) | Three-dimensional target detection method based on scene coding | |
CN116051489A (en) | Bird's eye view perspective characteristic diagram processing method and device, electronic equipment and storage medium | |
US20240193788A1 (en) | Method, device, computer system for detecting pedestrian based on 3d point clouds | |
CN117726880A (en) | Traffic cone 3D real-time detection method, system, equipment and medium based on monocular camera | |
Tao et al. | SiLVR: Scalable Lidar-Visual Reconstruction with Neural Radiance Fields for Robotic Inspection | |
Giosan et al. | Superpixel-based obstacle segmentation from dense stereo urban traffic scenarios using intensity, depth and optical flow information | |
Saleem et al. | Effects of ground manifold modeling on the accuracy of stixel calculations | |
Palmer et al. | Scale proportionate histograms of oriented gradients for object detection in co-registered visual and range data | |
CN111414848B (en) | Full-class 3D obstacle detection method, system and medium | |
CN113177903B (en) | Fusion method, system and equipment of foreground point cloud and background point cloud |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200313 |