CN110879994A - Three-dimensional visual inspection detection method, system and device based on shape attention mechanism - Google Patents

Three-dimensional visual inspection detection method, system and device based on shape attention mechanism Download PDF

Info

Publication number
CN110879994A
CN110879994A CN201911213392.9A CN201911213392A CN110879994A CN 110879994 A CN110879994 A CN 110879994A CN 201911213392 A CN201911213392 A CN 201911213392A CN 110879994 A CN110879994 A CN 110879994A
Authority
CN
China
Prior art keywords
feature map
attention
target
module
top view
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911213392.9A
Other languages
Chinese (zh)
Inventor
张兆翔
张驰
叶阳阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201911213392.9A priority Critical patent/CN110879994A/en
Publication of CN110879994A publication Critical patent/CN110879994A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/513Sparse representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Length Measuring Devices By Optical Means (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of computer deep reinforcement learning and pattern recognition, and particularly relates to a three-dimensional visual inspection detection method, system and device based on a shape attention mechanism, aiming at solving the problems that the precision of a single-stage detector is lower than that of a two-stage detector, and the two-stage detector consumes much time and is not suitable for a real-time system. The invention comprises the following steps: representing the point cloud data by three-dimensional grid voxels; extracting features and coding a space sparse feature map; extracting different scale features after projecting to a top view; adopting a deconvolution layer merging characteristic; extracting a shape attention feature map through attention weight and a convolution coding layer; and acquiring the target category, the target position, the target size and the target direction through a target classification network and a regression positioning network. The invention uses a sampling strategy based on distance constraint and an attention mechanism based on shape prior, relieves the instability caused by uneven data distribution, improves the problem that a single-stage detector lacks shape prior, and has high precision, short time consumption, strong real-time performance and good robustness.

Description

Three-dimensional visual inspection detection method, system and device based on shape attention mechanism
Technical Field
The invention belongs to the field of deep reinforcement learning, computer vision, pattern recognition and machine learning, and particularly relates to a three-dimensional visual inspection detection method, system and device based on a shape attention mechanism.
Background
Three-dimensional object detectors need to output reliable spatial and semantic information, i.e. three-dimensional position, orientation, occupied volume and category. Compared with two-dimensional object detection, the three-dimensional target provides more detail information, but the modeling difficulty is higher. Three-dimensional object detection typically employs distance sensors, such as laser radars, TOF cameras, stereo cameras, etc., to predict more meaningful and accurate results. Three-dimensional object detection becomes a key technology in the fields of automatic driving of automobiles, UVA, robots and the like. Most accurate three-dimensional object detection algorithms in traffic scenes are based on radar sensors, which have become the basic sensors for outdoor scene perception. And target perception in traffic scenes is a key technology of unmanned vehicles with respect to locating surrounding targets.
Lidar-based three-dimensional target detection involves two important issues. The first problem is how to generate descriptive underlying features for sparse non-uniform point clouds sampled from lidar sensors. The sampling points of the laser radar are more at the position close to the sensor and less at the position far away. The diversity distribution of the point cloud may reduce the detection performance of the detector and cause instability of the detection result. Many methods rely on manual feature extraction methods. However, the detection algorithm is not stable because the manual features do not take into account and handle the unbalanced laser point cloud distribution well. Object detection and segmentation play an extremely important role in both visual data understanding and perception. Another problem is how to efficiently encode the three-dimensional shape information to achieve better discriminant embedding. The three-dimensional object detection framework mainly comprises a single-stage detector and a two-stage detector. The single-stage detector has higher efficiency, and the two-stage detector has higher detection precision. The two-stage detector is not efficient because the region candidate network outputs the region of interest ROI that needs to be cropped. However, these cropped ROIs provide a shape prior for each detected object, and higher detection accuracy can be achieved through subsequent optimization networks. The performance of a single-stage detector is lower than that of a two-stage detector due to the lack of shape priors and subsequent optimization networks. However, for real-time systems, two-stage detectors are very time consuming. In addition, the three-dimensional shape prior is more suitable for the detection of three-dimensional targets.
Disclosure of Invention
In order to solve the above problems in the prior art, namely, the problems that the precision of a single-stage three-dimensional target detector is lower than that of a two-stage detector, and the two-stage detector consumes much time and is not suitable for a real-time system, the invention provides a three-dimensional visual inspection detection method based on a shape attention mechanism, which comprises the following steps:
step S10, laser point cloud data containing a target object are obtained to serve as data to be detected, and the data to be detected are represented through voxels based on a three-dimensional network;
step S20, acquiring the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a space sparse feature map corresponding to the data to be processed;
step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map;
step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; acquiring a coding feature map of the top view feature map through a convolution coding layer;
step S50, multiplying the attention weight feature map to the corresponding area of the coding feature map, and performing feature splicing to obtain an attention feature map;
step S60, acquiring target categories in the data to be detected through a trained target classification network based on the attention feature map; and acquiring the position, the size and the direction of the target in the data to be detected through the trained target regression positioning network based on the attention feature map.
In some preferred embodiments, in step S10, "the data to be detected is characterized by voxels based on a three-dimensional network", which is performed by:
Figure RE-GDA0002353771660000031
wherein D represents the voxel representation of the laser point cloud data, xi、yi、ziRepresenting the three-dimensional position information of the ith point in the laser point cloud data relative to the laser radar, RiRepresenting the reflectivity of the ith point in the laser point cloud data.
In some preferred embodiments, in step S20, "obtaining the feature expression of the voxel through the feature extractor and performing sparse convolutional coding to obtain a spatial sparse feature map corresponding to the data to be processed", the method includes:
Figure RE-GDA0002353771660000032
wherein, F () represents the feature representation of the voxel obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatial sparse feature map.
In some preferred embodiments, in step S40, "obtaining the attention weight feature map of the top view feature map through the attention weight layer" includes:
Fatt(u,v)=Convatt(FFPN(u,v))
wherein, Fatt(u, v) represents the attention weight feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Convatt() Representing the attention weight layer convolution operation.
In some preferred embodiments, in step S40, "obtaining the encoding feature map of the top view feature map through a convolutional encoding layer", the method includes:
Fen(u,v)=Conven(FFPN(u,v))
wherein, Fen(u, v) represents the coding feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Conven() Representing a convolutional encoding layer convolution operation.
In some preferred embodiments, in step S50, "multiplying the attention weight feature map to the corresponding region of the coding feature map, and performing feature concatenation to obtain the attention feature map", the method includes:
Fop(u,v)=Fen(u,v)Repeat(Reshape(Fatt(u,v)))
wherein, resume () represents the deformation operation, and Repeat () represents the copy operation;
Figure RE-GDA0002353771660000041
wherein [ ] represents a characteristic splicing operation.
In some preferred embodiments, the target classification network is trained by a cross entropy loss function; the cross entropy loss function is:
Figure RE-GDA0002353771660000042
wherein N represents the number of samples for which loss is calculated; y isiRepresents positive and negative samples, with 0 representing a negative sample and 1 representing a positive sample; x is the number ofiA network output value representing a sample.
In some preferred embodiments, the target regression positioning network is trained by a Smooth L1 loss function; the Smooth L1 loss function is:
Figure RE-GDA0002353771660000043
where x represents the residual requiring regression.
On the other hand, the invention provides a three-dimensional visual inspection detection system based on a shape attention mechanism, which comprises an input module, a sparse convolution coding module, a characteristic pyramid module, an attention weight convolution module, a coding convolution module, a characteristic fusion module, a target classification module, a target positioning module and an output module;
the input module is configured to acquire laser point cloud data containing a target object as to-be-detected data and represent the to-be-detected data through a voxel based on a three-dimensional network;
the sparse convolution coding module is configured to obtain the characteristic expression of the voxel through a characteristic extractor and carry out sparse convolution coding to obtain a spatial sparse characteristic diagram corresponding to the data to be processed;
the characteristic pyramid module is configured to project the space sparse characteristic diagram to a two-dimensional top view plane, obtain characteristics of different scales through a characteristic pyramid convolution network, and then combine the characteristics of different scales through deconvolution lamination to obtain a top view characteristic diagram;
the attention weight convolution module is configured to acquire an attention weight feature map of the top view feature map through an attention weight layer;
the coding convolution module is configured to acquire a coding feature map of the top view feature map through a convolution coding layer;
the feature fusion module is configured to multiply the attention weight feature map to a corresponding region of the coding feature map, and perform feature splicing to obtain an attention feature map;
the target classification module is configured to obtain a target class in the data to be detected through a trained target classification network based on the attention feature map;
the target positioning module is configured to obtain the position, the size and the direction of a target in the data to be detected through a trained target regression positioning network based on the attention feature map;
the output module is configured to output the acquired object type, and the object position, size and direction.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-mentioned three-dimensional visual inspection method based on the shape attention mechanism.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described three-dimensional visual inspection method based on the shape attention mechanism.
The invention has the beneficial effects that:
the three-dimensional visual inspection detection method based on the shape attention mechanism uses a sampling strategy based on distance constraint, can effectively relieve unstable results caused by uneven distribution of radar sampling point cloud data, solves the problem that a single-stage detector lacks shape prior through the attention mechanism based on the shape prior, can improve the detection performance of the conventional single-stage three-dimensional target detector, particularly aims at targets with obvious shape characteristics, is high in detection precision, short in detection time consumption, suitable for a real-time system and good in model robustness.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a schematic flow chart of a three-dimensional visual inspection method based on a shape attention mechanism according to the present invention;
FIG. 2 is a schematic diagram of an algorithm structure of an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention;
FIG. 3 is a data set and an exemplary graph of the inspection results of one embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention;
FIG. 4 is a graph showing the comparison of the results of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention with other methods.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
The invention discloses a three-dimensional visual inspection method based on a shape attention mechanism, which comprises the following steps:
step S10, laser point cloud data containing a target object are obtained to serve as data to be detected, and the data to be detected are represented through voxels based on a three-dimensional network;
step S20, acquiring the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a space sparse feature map corresponding to the data to be processed;
step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map;
step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; acquiring a coding feature map of the top view feature map through a convolution coding layer;
step S50, multiplying the attention weight feature map to the corresponding area of the coding feature map, and performing feature splicing to obtain an attention feature map;
step S60, acquiring target categories in the data to be detected through a trained target classification network based on the attention feature map; and acquiring the position, the size and the direction of the target in the data to be detected through the trained target regression positioning network based on the attention feature map.
In order to more clearly illustrate the three-dimensional visual inspection method based on the shape attention mechanism of the present invention, the following describes the steps in the embodiment of the method of the present invention in detail with reference to fig. 1.
The three-dimensional visual inspection method based on the shape attention mechanism comprises the following steps of S10-S60, wherein the steps are described in detail as follows:
step S10, laser point cloud data containing a target object is obtained as data to be detected, and the data to be detected is represented by a voxel based on a three-dimensional network, as shown in formula (1):
Figure RE-GDA0002353771660000081
wherein D represents the voxel representation of the laser point cloud data, xi、yi、ziRepresenting the three-dimensional position information of the ith point in the laser radar point cloud in the laser point cloud data, RiRepresenting the reflectivity of the ith point in the laser point cloud data.
Assuming that the lidar point cloud includes a three-dimensional space of H, W, D, representing the height in the vertical direction, the position in the horizontal direction, and the distance, respectively, the size of each voxel is Δ H × Δ W × Δ D, Δ H ═ 0.4m, Δ W ═ 0.2m, and Δ D ═ 0.2 m. The size of the voxel grid in the whole three-dimensional space can be calculated by H/delta H, W/delta W, D/delta D. Each voxel is then characterized by a feature encoding layer (VFE). In one embodiment of the invention, the feature extractor describes the sample points in each voxel using 7-dimensional vectors (three-dimensional coordinates, reflectivity, and relative three-dimensional coordinates of the voxel, respectively), and adds to each sample the coordinate (P) of the current pillar centerx,Py). At this time, the description vector of the sample point in each voxel becomes 9 dimensions. In one embodiment of the invention, the feature encoding layer (VFE) includes a linear layer, a batch normalization layer (BN), and a corrected linear unit layer (ReLU) to extract vector features of points.
Step S20, obtaining the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a spatial sparse feature map corresponding to the data to be processed, as shown in formula (2):
Figure RE-GDA0002353771660000082
wherein, F () represents the feature representation of the voxel obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatial sparse feature map.
And step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map.
Space sparse feature map fs(x, y, z) is projected to a top view (namely a bird's eye view), namely a space sparse feature map fs(x, y, z) vertical dimension compression to obtain a characteristic diagram f of a top view2D(u, v). Specifically, assuming that the original feature is (C, D, H, W), the height feature is incorporated into the feature channel to become (C × D, H, W), and a feature map in which the 2D convolution feature is a top view is obtained. Obtaining f by a characteristic pyramid convolution network2D(u, v) features of different scales, and combining the features of different scales through the deconvolution layer to obtain a feature map fFPN(u, v). In one embodiment of the present invention, the feature pyramid convolutional layer comprises three convolutional groups, each having (3, 5) convolutional layers, each of which is followed by a batch normalization layer (BN), a corrected linear unit layer (ReLU).
Step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; and acquiring the coding characteristic diagram of the top view characteristic diagram through a convolution coding layer.
Acquiring an attention weight characteristic diagram of the top view characteristic diagram through an attention weight layer, wherein the formula (3) is as follows:
Fatt(u,v)=Convatt(FFPN(u, v)) formula (3)
Wherein, Fatt(u, v) represents the attention weight feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Convatt() Representing the attention weight layer convolution operation.
Acquiring a coding feature map of the top view feature map through a convolution coding layer, wherein the formula (4) is as follows:
Fen(u,v)=Conven(FFPN(u, v)) formula (4)
Wherein, Fen(u, v) represents the coding feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Conven() Representing a convolutional encoding layer convolution operation.
Step S50, multiplying the attention weight feature map to the corresponding region of the coding feature map, and performing feature concatenation to obtain an attention feature map, as shown in equations (5) and (6):
Fop(u,v)=Fen(u,v)Repeat(Reshape(Fatt(u, v))) formula (5)
Wherein, resume () represents the deformation operation, and Repeat () represents the copy operation;
Figure RE-GDA0002353771660000101
wherein [ ] represents a characteristic splicing operation.
Step S60, acquiring target categories in the data to be detected through a trained target classification network based on the attention feature map; and acquiring the position, the size and the direction of the target in the data to be detected through the trained target regression positioning network based on the attention feature map.
As shown in fig. 2, the schematic diagram of the algorithm structure of an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention is divided into three parts: the first part is a Distance-based Voxel Generator (Distance-based Voxel Generator) that transforms the input lidar point cloud into voxels; the second part is a Feature extraction layer (features extraction layers) for coding voxel features and coding three-dimensional space features; the third part is Attention area recommendation network (Attention RPN), and the Attention mechanism is injected to output the detection result.
The target classification network is trained through a cross entropy loss function, wherein the cross entropy loss function is shown as a formula (7):
Figure RE-GDA0002353771660000102
wherein N represents the number of samples for which loss is calculated; y isiRepresents positive and negative samples, with 0 representing a negative sample and 1 representing a positive sample; x is the number ofiA network output value representing a sample.
The target regression positioning network is trained by a Smooth L1 loss function, and the Smooth L1 loss function is shown as a formula (8):
Figure RE-GDA0002353771660000103
where x represents the residual of the regression.
Attention profile FhybridAnd (u, v) respectively connecting a target classification network and a target regression positioning network, wherein the target classification network is used for judging whether the detection object is a target, and the target regression positioning network is used for acquiring the position, the size and the direction of the detection object.
In one embodiment of the invention, for the car in the target classification task, setting the intersection ratio (IOU) of the anchor point and the target to be greater than 0.6 as a positive sample, and setting the intersection ratio to be less than 0.45 as a negative sample; for classes pedestrian and cyclist, a positive sample is taken when the intersection ratio (IOU) of anchor point and target is greater than 0.5, and a negative sample is taken when the intersection ratio is less than 0.35. For the regression positioning task, setting the width multiplied by the length multiplied by the height of a predefined anchor point corresponding to a target vehicle to be (1.6 multiplied by 3.9 multiplied by 1.5) meters; the width x length x height of the predefined anchor point for the target pedestrian is (0.6 x 0.8 x 1.73) meters; the width x length x height of the predefined anchor point for the target rider is (0.6 x 1.76 x 1.73) meters. Defining a three-dimensional real bounding box as xg,yg,zg,lg,wg,hggWherein x, y and Z are the central positions of the bounding box, l, w and h represent the length, width and height of the three-dimensional target, and theta is the rotation angle of the target in the Z-axis directiongRepresenting true values byaAnd expressing the anchor point of the positive sample, expressing the corresponding residual error by delta, and predicting the position, the size and the direction of the real three-dimensional target through network learning. Residual error of central position of bounding box(Δ x, Δ y, Δ Z), a residual (Δ l, Δ w, Δ h) of the length and width of the three-dimensional target, and a residual (Δ θ) of the rotation angle of the target in the Z-axis direction are respectively expressed by the following equations (9), (10), and (11):
Figure RE-GDA0002353771660000111
Figure RE-GDA0002353771660000112
Δθ=sin(θga) Formula (11)
To illustrate the effectiveness of the invention in detail, the method proposed by the invention is applied to the public driverless data set KITTI, which contains 3 validation classes. As shown in fig. 3, which is an exemplary diagram of a data set and a detection result of an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention, a first column Car represents a detection result of a vehicle, a second column Pedestrian represents a detection result of a Pedestrian, and a third column Cyclist represents a detection result of a rider. Each column has three groups of experimental results, each group comprises an RGB image and a top view of the radar, and the detection results are projected on the images.
In one embodiment of the invention, for the KITTI data set, the train data set is used for training, and the test data set is used for testing. As shown in fig. 4, which is a comparison graph of the detection results of the method of the present invention and other methods according to an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention, the data set is divided into three grades for each type of test object: easy, medium and difficult. The difficulty is divided according to the height of each target in the camera image, the occlusion level and the truncation degree. The height of the sample bounding box with easy difficulty is more than 40 and equal to each pixel, the maximum truncation is 15 percent, and the shielding level is completely visible; the height of a sample boundary frame with the difficulty is more than or equal to 25 pixels, the maximum truncation is 30%, and the shielding level is partial shielding; the height of the sample boundary frame with difficulty is more than or equal to 25 pixels, the maximum truncation is 50%, and the shielding level is difficult to see. BEV represents top view detection results and 3D represents detection results of a three-dimensional bounding box. The 3D target detection performance was evaluated using the PASCAL standard (AP, average accuracy). In the comparison method, ARPNET is used for representing the method, MV3D represents a multi-view 3D target detection method, ContFuse represents a depth continuous fusion multi-sensor 3D target detection method, AOVD represents multi-view aggregation data to realize a 3D object real-time detection method in an unmanned scene, F-PointNet represents a viewing cone point cloud network RGB-D data 3D object detection method, SECOND represents a sparse embedded convolution target detection method, and Voxelnet represents a point cloud data 3D target detection method based on end-to-end learning.
The three-dimensional visual inspection detection system based on the shape attention mechanism comprises an input module, a sparse convolution coding module, a characteristic pyramid module, an attention weight convolution module, a coding convolution module, a characteristic fusion module, a target classification module, a target positioning module and an output module;
the input module is configured to acquire laser point cloud data containing a target object as to-be-detected data and represent the to-be-detected data through a voxel based on a three-dimensional network;
the sparse convolution coding module is configured to obtain the characteristic expression of the voxel through a characteristic extractor and carry out sparse convolution coding to obtain a spatial sparse characteristic diagram corresponding to the data to be processed;
the characteristic pyramid module is configured to project the space sparse characteristic diagram to a two-dimensional top view plane, obtain characteristics of different scales through a characteristic pyramid convolution network, and then combine the characteristics of different scales through deconvolution lamination to obtain a top view characteristic diagram;
the attention weight convolution module is configured to acquire an attention weight feature map of the top view feature map through an attention weight layer;
the coding convolution module is configured to acquire a coding feature map of the top view feature map through a convolution coding layer;
the feature fusion module is configured to multiply the attention weight feature map to a corresponding region of the coding feature map, and perform feature splicing to obtain an attention feature map;
the target classification module is configured to obtain a target class in the data to be detected through a trained target classification network based on the attention feature map;
the target positioning module is configured to obtain the position, the size and the direction of a target in the data to be detected through a trained target regression positioning network based on the attention feature map;
the output module is configured to output the acquired object type, and the object position, size and direction.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.
It should be noted that, the three-dimensional visual inspection system based on the shape attention mechanism provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores a plurality of programs, which are suitable for being loaded and executed by a processor to implement the above-mentioned three-dimensional visual inspection method based on the shape attention mechanism.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described three-dimensional visual inspection method based on the shape attention mechanism.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (11)

1. A three-dimensional visual inspection method based on a shape attention mechanism is characterized by comprising the following steps:
step S10, laser point cloud data containing a target object are obtained to serve as data to be detected, and the data to be detected are represented through voxels based on a three-dimensional network;
step S20, acquiring the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a space sparse feature map corresponding to the data to be processed;
step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map;
step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; acquiring a coding feature map of the top view feature map through a convolution coding layer;
step S50, multiplying the attention weight feature map to the corresponding area of the coding feature map, and performing feature splicing to obtain an attention feature map;
step S60, acquiring target categories in the data to be detected through a trained target classification network based on the attention feature map; and acquiring the position, the size and the direction of the target in the data to be detected through the trained target regression positioning network based on the attention feature map.
2. The three-dimensional visual inspection method based on shape attention mechanism according to claim 1, wherein in step S10, "the data to be inspected is characterized by voxels based on three-dimensional network", which is performed by:
Figure FDA0002298787760000011
wherein D represents the voxel representation of the laser point cloud data, xi、yi、ziRepresenting the three-dimensional position information of the ith point in the laser point cloud data relative to the laser radar, RiRepresenting the reflectivity of the ith point in the laser point cloud data.
3. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S20, "obtaining the feature expression of the voxel by a feature extractor and performing sparse convolution coding to obtain a spatial sparse feature map corresponding to the data to be processed" includes:
Figure FDA0002298787760000021
wherein, F () represents the feature representation of the voxel obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatial sparse feature map.
4. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S40, "obtaining the attention weight feature map of the top view feature map through the attention weight layer" includes:
Fatt(u,v)=Convatt(FFPN(u,v))
wherein, Fatt(u, v) represents the attention weight feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Convatt() Representing the attention weight layer convolution operation.
5. A three-dimensional visual inspection method based on shape attention mechanism according to claim 1, wherein in step S40, "obtaining the encoding feature map of the top view feature map by convolution encoding layer" comprises:
Fen(u,v)=Conven(FFPN(u,v))
wherein, Fen(u, v) represents the coding feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Conven() Representing a convolutional encoding layer convolution operation.
6. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S50, the method comprises the steps of multiplying the attention weight feature map to the corresponding region of the coding feature map and performing feature matching to obtain the attention feature map, and comprises the steps of:
Fop(u,v)=Fen(u,v)Repeat(Reshape(Fatt(u,v)))
wherein, resume () represents the deformation operation, and Repeat () represents the copy operation;
Figure FDA0002298787760000031
wherein [ ] represents a characteristic splicing operation.
7. The three-dimensional visual inspection method based on shape attention mechanism according to any one of claims 1-6, characterized in that the object classification network is trained by cross entropy loss function; the cross entropy loss function is:
Figure FDA0002298787760000032
wherein N represents the number of samples for which loss is calculated; y isiRepresents positive and negative samples, with 0 representing a negative sample and 1 representing a positive sample; x is the number ofiA network output value representing a sample.
8. The three-dimensional visual inspection method based on shape attention mechanism according to any one of claims 1-6, characterized in that the target regression positioning network is trained by Smooth L1 loss function; the Smooth L1 loss function is:
Figure FDA0002298787760000033
where x represents the residual of the regression.
9. A three-dimensional visual inspection detection system based on a shape attention mechanism is characterized by comprising an input module, a sparse convolution coding module, a characteristic pyramid module, an attention weight convolution module, a coding convolution module, a characteristic fusion module, a target classification module, a target positioning module and an output module;
the input module is configured to acquire laser point cloud containing target object data as to-be-detected data, and the to-be-detected data is represented by voxels based on a three-dimensional network;
the sparse convolution coding module is configured to obtain the characteristic expression of the voxel through a characteristic extractor and carry out sparse convolution coding to obtain a spatial sparse characteristic diagram corresponding to the data to be processed;
the characteristic pyramid module is configured to project the space sparse characteristic diagram to a two-dimensional top view plane, obtain characteristics of different scales through a characteristic pyramid convolution network, and then combine the characteristics of different scales through deconvolution lamination to obtain a top view characteristic diagram;
the attention weight convolution module is configured to acquire an attention weight feature map of the top view feature map through an attention weight layer;
the coding convolution module is configured to acquire a coding feature map of the top view feature map through a convolution coding layer;
the feature fusion module is configured to multiply the attention weight feature map to a corresponding region of the coding feature map, and perform feature splicing to obtain an attention feature map;
the target classification module is configured to obtain a target class in the data to be detected through a trained target classification network based on the attention feature map;
the target positioning module is configured to obtain the position, the size and the direction of a target in the data to be detected through a trained target regression positioning network based on the attention feature map;
the output module is configured to output the acquired object type, and the object position, size and direction.
10. A storage device having stored therein a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the method for three-dimensional visual inspection based on the shape attention mechanism of any one of claims 1 to 8.
11. A treatment apparatus comprises
A processor adapted to execute various programs; and
a storage device adapted to store a plurality of programs;
wherein the program is adapted to be loaded and executed by a processor to perform:
the three-dimensional visual inspection method based on the shape attention mechanism as set forth in any one of claims 1 to 8.
CN201911213392.9A 2019-12-02 2019-12-02 Three-dimensional visual inspection detection method, system and device based on shape attention mechanism Pending CN110879994A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911213392.9A CN110879994A (en) 2019-12-02 2019-12-02 Three-dimensional visual inspection detection method, system and device based on shape attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911213392.9A CN110879994A (en) 2019-12-02 2019-12-02 Three-dimensional visual inspection detection method, system and device based on shape attention mechanism

Publications (1)

Publication Number Publication Date
CN110879994A true CN110879994A (en) 2020-03-13

Family

ID=69729811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911213392.9A Pending CN110879994A (en) 2019-12-02 2019-12-02 Three-dimensional visual inspection detection method, system and device based on shape attention mechanism

Country Status (1)

Country Link
CN (1) CN110879994A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723719A (en) * 2020-06-12 2020-09-29 中国科学院自动化研究所 Video target detection method, system and device based on category external memory
CN111862101A (en) * 2020-07-15 2020-10-30 西安交通大学 3D point cloud semantic segmentation method under aerial view coding visual angle
CN111985378A (en) * 2020-08-13 2020-11-24 中国第一汽车股份有限公司 Road target detection method, device and equipment and vehicle
CN112257605A (en) * 2020-10-23 2021-01-22 中国科学院自动化研究所 Three-dimensional target detection method, system and device based on self-labeling training sample
CN112347987A (en) * 2020-11-30 2021-02-09 江南大学 Multimode data fusion three-dimensional target detection method
CN112418421A (en) * 2020-11-06 2021-02-26 常州大学 Roadside end pedestrian trajectory prediction algorithm based on graph attention self-coding model
CN112464905A (en) * 2020-12-17 2021-03-09 湖南大学 3D target detection method and device
CN112668469A (en) * 2020-12-28 2021-04-16 西安电子科技大学 Multi-target detection and identification method based on deep learning
CN112884723A (en) * 2021-02-02 2021-06-01 贵州电网有限责任公司 Insulator string detection method in three-dimensional laser point cloud data
CN113095172A (en) * 2021-03-29 2021-07-09 天津大学 Point cloud three-dimensional object detection method based on deep learning
CN113269147A (en) * 2021-06-24 2021-08-17 浙江海康智联科技有限公司 Three-dimensional detection method and system based on space and shape, and storage and processing device
CN113807184A (en) * 2021-08-17 2021-12-17 北京百度网讯科技有限公司 Obstacle detection method and device, electronic equipment and automatic driving vehicle
CN114663879A (en) * 2022-02-09 2022-06-24 中国科学院自动化研究所 Target detection method and device, electronic equipment and storage medium
CN115082902A (en) * 2022-07-22 2022-09-20 松立控股集团股份有限公司 Vehicle target detection method based on laser radar point cloud
CN115183782A (en) * 2022-09-13 2022-10-14 毫末智行科技有限公司 Multi-modal sensor fusion method and device based on joint space loss

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102896630A (en) * 2011-07-25 2013-01-30 索尼公司 Robot device, method of controlling the same, computer program, and robot system
US20160063754A1 (en) * 2014-08-26 2016-03-03 The Boeing Company System and Method for Detecting a Structural Opening in a Three Dimensional Point Cloud
CN106778856A (en) * 2016-12-08 2017-05-31 深圳大学 A kind of object identification method and device
CN108133191A (en) * 2017-12-25 2018-06-08 燕山大学 A kind of real-time object identification method suitable for indoor environment
US20180210896A1 (en) * 2015-07-22 2018-07-26 Hangzhou Hikvision Digital Technology Co., Ltd. Method and device for searching a target in an image
US20190147245A1 (en) * 2017-11-14 2019-05-16 Nuro, Inc. Three-dimensional object detection for autonomous robotic systems using image proposals
CN110070025A (en) * 2019-04-17 2019-07-30 上海交通大学 Objective detection system and method based on monocular image
CN110458112A (en) * 2019-08-14 2019-11-15 上海眼控科技股份有限公司 Vehicle checking method, device, computer equipment and readable storage medium storing program for executing

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102896630A (en) * 2011-07-25 2013-01-30 索尼公司 Robot device, method of controlling the same, computer program, and robot system
US20160063754A1 (en) * 2014-08-26 2016-03-03 The Boeing Company System and Method for Detecting a Structural Opening in a Three Dimensional Point Cloud
US20180210896A1 (en) * 2015-07-22 2018-07-26 Hangzhou Hikvision Digital Technology Co., Ltd. Method and device for searching a target in an image
CN106778856A (en) * 2016-12-08 2017-05-31 深圳大学 A kind of object identification method and device
US20180165547A1 (en) * 2016-12-08 2018-06-14 Shenzhen University Object Recognition Method and Device
US20190147245A1 (en) * 2017-11-14 2019-05-16 Nuro, Inc. Three-dimensional object detection for autonomous robotic systems using image proposals
CN108133191A (en) * 2017-12-25 2018-06-08 燕山大学 A kind of real-time object identification method suitable for indoor environment
CN110070025A (en) * 2019-04-17 2019-07-30 上海交通大学 Objective detection system and method based on monocular image
CN110458112A (en) * 2019-08-14 2019-11-15 上海眼控科技股份有限公司 Vehicle checking method, device, computer equipment and readable storage medium storing program for executing

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
YANG YANG YE ET AL: "ARPNET:attention region proposal network for 3D object detection", 《SCIENCE CHINA INFORMATION SCIENCES》 *
YANG YANGYE ET AL: "SARPNET: Shape attention regional proposal network for liDAR-based 3D object detection", 《NEURO COMPUTING》 *
YIN ZHOU ET AL: "VoxelNet:End-to-End Learning for Point Cloud Based 3D Object Detection", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
赵华卿: "三维目标检测中的先验方向角估计", 《传感器与微系统》 *
陈敏: "《认知计算导论》", 30 April 2017 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723719A (en) * 2020-06-12 2020-09-29 中国科学院自动化研究所 Video target detection method, system and device based on category external memory
CN111862101A (en) * 2020-07-15 2020-10-30 西安交通大学 3D point cloud semantic segmentation method under aerial view coding visual angle
CN111985378A (en) * 2020-08-13 2020-11-24 中国第一汽车股份有限公司 Road target detection method, device and equipment and vehicle
CN112257605A (en) * 2020-10-23 2021-01-22 中国科学院自动化研究所 Three-dimensional target detection method, system and device based on self-labeling training sample
CN112418421A (en) * 2020-11-06 2021-02-26 常州大学 Roadside end pedestrian trajectory prediction algorithm based on graph attention self-coding model
CN112418421B (en) * 2020-11-06 2024-01-23 常州大学 Road side end pedestrian track prediction algorithm based on graph attention self-coding model
CN112347987A (en) * 2020-11-30 2021-02-09 江南大学 Multimode data fusion three-dimensional target detection method
CN112464905A (en) * 2020-12-17 2021-03-09 湖南大学 3D target detection method and device
CN112464905B (en) * 2020-12-17 2022-07-26 湖南大学 3D target detection method and device
CN112668469A (en) * 2020-12-28 2021-04-16 西安电子科技大学 Multi-target detection and identification method based on deep learning
CN112884723A (en) * 2021-02-02 2021-06-01 贵州电网有限责任公司 Insulator string detection method in three-dimensional laser point cloud data
CN112884723B (en) * 2021-02-02 2022-08-12 贵州电网有限责任公司 Insulator string detection method in three-dimensional laser point cloud data
CN113095172A (en) * 2021-03-29 2021-07-09 天津大学 Point cloud three-dimensional object detection method based on deep learning
CN113269147A (en) * 2021-06-24 2021-08-17 浙江海康智联科技有限公司 Three-dimensional detection method and system based on space and shape, and storage and processing device
CN113807184A (en) * 2021-08-17 2021-12-17 北京百度网讯科技有限公司 Obstacle detection method and device, electronic equipment and automatic driving vehicle
CN114663879A (en) * 2022-02-09 2022-06-24 中国科学院自动化研究所 Target detection method and device, electronic equipment and storage medium
CN114663879B (en) * 2022-02-09 2023-02-21 中国科学院自动化研究所 Target detection method and device, electronic equipment and storage medium
CN115082902A (en) * 2022-07-22 2022-09-20 松立控股集团股份有限公司 Vehicle target detection method based on laser radar point cloud
CN115082902B (en) * 2022-07-22 2022-11-11 松立控股集团股份有限公司 Vehicle target detection method based on laser radar point cloud
CN115183782A (en) * 2022-09-13 2022-10-14 毫末智行科技有限公司 Multi-modal sensor fusion method and device based on joint space loss
CN115183782B (en) * 2022-09-13 2022-12-09 毫末智行科技有限公司 Multi-modal sensor fusion method and device based on joint space loss

Similar Documents

Publication Publication Date Title
CN110879994A (en) Three-dimensional visual inspection detection method, system and device based on shape attention mechanism
CN112257605B (en) Three-dimensional target detection method, system and device based on self-labeling training sample
US20190065824A1 (en) Spatial data analysis
CN115049700A (en) Target detection method and device
CN110298281B (en) Video structuring method and device, electronic equipment and storage medium
CN110674705A (en) Small-sized obstacle detection method and device based on multi-line laser radar
CN113052109A (en) 3D target detection system and 3D target detection method thereof
CN113240734B (en) Vehicle cross-position judging method, device, equipment and medium based on aerial view
CN113267761B (en) Laser radar target detection and identification method, system and computer readable storage medium
CN111709923A (en) Three-dimensional object detection method and device, computer equipment and storage medium
Pinggera et al. High-performance long range obstacle detection using stereo vision
EP4174792A1 (en) Method for scene understanding and semantic analysis of objects
CN114463736A (en) Multi-target detection method and device based on multi-mode information fusion
CN113269147B (en) Three-dimensional detection method and system based on space and shape, and storage and processing device
Giosan et al. Superpixel-based obstacle segmentation from dense stereo urban traffic scenarios using intensity, depth and optical flow information
CN111198563B (en) Terrain identification method and system for dynamic motion of foot type robot
Saleem et al. Effects of ground manifold modeling on the accuracy of stixel calculations
CN114648639B (en) Target vehicle detection method, system and device
CN116246119A (en) 3D target detection method, electronic device and storage medium
CN115588047A (en) Three-dimensional target detection method based on scene coding
Palmer et al. Scale proportionate histograms of oriented gradients for object detection in co-registered visual and range data
CN111414848B (en) Full-class 3D obstacle detection method, system and medium
CN113177903B (en) Fusion method, system and equipment of foreground point cloud and background point cloud
CN117475410B (en) Three-dimensional target detection method, system, equipment and medium based on foreground point screening
Drulea et al. An omnidirectional stereo system for logistic plants. Part 2: stereo reconstruction and obstacle detection using digital elevation maps

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200313