CN117557993A - Construction method and application of double-frame interaction perception 3D association detection model - Google Patents

Construction method and application of double-frame interaction perception 3D association detection model Download PDF

Info

Publication number
CN117557993A
CN117557993A CN202410045646.5A CN202410045646A CN117557993A CN 117557993 A CN117557993 A CN 117557993A CN 202410045646 A CN202410045646 A CN 202410045646A CN 117557993 A CN117557993 A CN 117557993A
Authority
CN
China
Prior art keywords
features
perception
feature
distribution
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410045646.5A
Other languages
Chinese (zh)
Other versions
CN117557993B (en
Inventor
余红旗
王煜
产思贤
孙晶晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Pixel Technology Co ltd
Original Assignee
Hangzhou Pixel Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Pixel Technology Co ltd filed Critical Hangzhou Pixel Technology Co ltd
Priority to CN202410045646.5A priority Critical patent/CN117557993B/en
Publication of CN117557993A publication Critical patent/CN117557993A/en
Application granted granted Critical
Publication of CN117557993B publication Critical patent/CN117557993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The method comprises the steps of respectively inputting point cloud data and image data in a training sample set into a feature extraction network to obtain multi-scale point cloud features and initial image features, inputting the multi-scale point cloud features and the initial image features into an area cross-modal interaction fusion module to obtain bimodal features, inputting the bimodal features into a double-branch distribution sensing module to obtain double-distribution features, inputting the double-distribution features into a double-feature association mining module to respectively conduct local attribute attention and global attribute learning on attributes of a three-dimensional target to obtain attribute association attention features, classifying and regressing the attribute association attention features to obtain three-dimensional objects, further preserving spatial features of the point cloud data and semantic features of the image data, achieving stable robustness effects under various scene changes, and simultaneously having good model generalization capability and excellent model reasoning speed.

Description

Construction method and application of double-frame interaction perception 3D association detection model
Technical Field
The application relates to the field of three-dimensional perception, in particular to a construction method and application of a double-frame interactive perception 3D association detection model.
Background
Multimodal three-dimensional sensing refers to the use of various sensors and information sources, such as vision, sound, lidar, etc., to obtain three-dimensional information about an environment and to comprehensively utilize such information to fully understand and sense the surrounding environment. The multi-modal three-dimensional perception exploits the complementary ability between different sensors to provide richer, more comprehensive information than a single information source, helping to address challenges in complex scenarios.
The existing multi-mode three-dimensional sensing method mainly comprises two modes: transfusion and BEVFuse. The former uses two transducer decoding layers as detection heads, first generating an initial bounding box from lidar features, then adaptively fusing query image features associated with spatial contexts, limiting the spatial extent of cross-attention around the initial bounding box for better network access to relevant locations. The latter uses unified BEV feature representation, which processes Lei Dadian cloud and image separately and projects unified onto BEV space for fusion.
However, the two existing methods have the following defects:
1) Due to heterogeneity of image and point cloud mode data, the two methods are obviously different in an original input space and a feature space, semantic information of an image mode and spatial information of a point cloud mode are not fully captured, and interaction process is lacking, so that the characterization capability of a dual mode is further enhanced.
2) Moreover, the BEV is taken as a top view representation with global observation, and due to the characteristics of sparsity, spatial discreteness, irregularity, disorder and the like of point cloud data, obvious problem of uneven distribution of points exists, particularly the common shielding problem in an automatic driving scene, partial points of a shielding target are lack, so that the phenomenon of unbalanced distribution of the points is seriously aggravated when the three-dimensional point cloud is compressed and projected to a two-dimensional BEV space, and the existing method does not consider the point-range distribution perception of the enhanced BEV characteristics to better capture the target morphology in a local sparse area.
3) Meanwhile, three-dimensional objects have more characteristic representations than two-dimensional objects, and the representations have an interactive relationship. The existing methods do not fully exploit this potentially multi-attribute-related information for refining the corresponding feature representation and attribute regression tasks.
Disclosure of Invention
The embodiment of the application provides a construction method and application of a double-frame interactive perception 3D association detection model, wherein a region cross-modal interactive fusion module is designed to reserve bimodal features of point cloud data and image data, the representation capability of the bimodal features is further enhanced, a double-branch distribution perception module is designed to perceive density distribution of points of BEV features, meanwhile, the double-distribution feature is obtained by combining double-distribution perception representation learning, a double-feature association mining module is designed to capture context association features among three-dimensional target attributes, and regression tasks of a three-dimensional object are further assisted.
In a first aspect, an embodiment of the present application provides a method for constructing a dual-framework interaction perception 3D association detection model, including:
acquiring point cloud data and image data of a matched marked three-dimensional object as a training sample set;
inputting a training sample set into a constructed double-frame interactive perception 3D (three-dimensional) associated detection frame for training until iteration conditions are met to obtain a double-frame interactive perception 3D associated detection model, wherein the double-frame interactive perception 3D detection frame comprises a feature extraction network, a region cross-modal interaction fusion module, a double-branch distribution perception module and a double-feature associated mining module, point cloud data and image data in the training sample set are respectively input into the feature extraction network to obtain multi-scale point cloud features and initial image features, the multi-scale point cloud features and the initial image features are input into the region cross-modal interaction fusion module together to obtain bimodal features, the bimodal features are input into inner distribution perception branches and outer distribution perception branches in the double-branch distribution perception module to respectively obtain inner point perception distribution features and outer point perception distribution features, the inner point perception distribution features and the outer point perception distribution features are fused with the bimodal features to obtain double-distribution features, the double-distribution features are input into the double-feature associated mining module to respectively conduct local attribute attention and global attribute learning to obtain global attention features and local attention features, the global attention features and local attention features are fused to obtain associated attention features, and regression attribute associated three-dimensional classification objects are obtained;
The multi-scale point cloud features are subjected to channel fusion to obtain voxel enhancement mode features, semantic information feature enhancement is carried out on the initial image features based on the voxel enhancement mode features to obtain image enhancement features, the voxel enhancement mode features are subjected to regional voxel self-attention aggregation to obtain a voxel set, and the voxel set and the image enhancement features are fused in a cross-mode attention interactive fusion mode to obtain bimodal features.
In a second aspect, an embodiment of the present application provides a dual-frame interaction sensing 3D association detection method, including:
and inputting the point cloud data and the image data into the double-frame interactive perception 3D association detection model constructed by the construction method of the double-frame interactive perception 3D association detection model to output a three-dimensional object.
In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to run the computer program to execute the method for constructing the dual-frame interaction-aware 3D correlation detection model.
In a fourth aspect, embodiments of the present application provide a readable storage medium having stored therein a computer program comprising program code for controlling a process to execute a process comprising a method of constructing a 3D correlation detection model according to the dual-frame interaction awareness.
The main contributions and innovation points of the invention are as follows:
1) The invention provides a region cross-modal interaction fusion mode to enhance the bimodal representation capability so as to obtain bimodal features, the self-attention learning aggregation is carried out on non-empty voxels in the region, and the interaction fusion is carried out on the spatial features of the corresponding point-mode and the semantic features of the image mode based on the double query set fusion. Specifically, in order to preserve the spatial features and image features of the point cloud data, the original extracted multi-scale point cloud features and initial image features of the VoxelNet and ResNet-50 are respectively subjected to feature mode enhancement: and carrying out spatial channel fusion on the multi-scale point cloud features to fuse shallow and deep information, and converting the initial image features according to the corresponding relation of the point-image pixels to enhance the semantic information of the image mode. Secondly, in order to solve the defect that sparse features exist in points in a three-dimensional space generally, the scheme further performs region-aware voxel aggregation on the multi-scale point cloud features: for non-empty voxels, the method of ball-query is adopted to gather the point characteristics of a local area, an initial voxel aggregation characteristic is generated, then based on an attention mechanism and the spatial information characteristic of a further enhanced point mode, the corresponding voxel spatial position characteristic is initialized in the self-attention learning process of the initial voxel aggregation characteristic, further the self-attention learning enhancement is participated, a refined area-aware voxel aggregation characteristic is generated as a voxel set, then the generated voxel set and the semanteme enhanced image enhancement characteristic are subjected to interactive fusion, in the process of interactive fusion of the bimodal characteristic, in order to further preserve the depth information of a point cloud mode, the position characteristic of a three-dimensional voxel is converted and fused into a dual-query set camera-query and the voxel-query in an interactive fusion manner by adopting a cross-mode attention mechanism, a refined mode image query set is generated, and then the dual-mode characteristic is fused to generate the final bimodal characteristic with the bimodal information characteristic.
2) The dual-branch distribution perception module is designed to enhance the point-range distribution information of the dual-mode characteristics, and the dual-distribution perception expression learning is combined to simulate the distribution density of the points under different range distribution, so that the dual-mode characteristics have the range point-distribution perception information. Because of sparsity of three-dimensional space points and sparse remote point cloud distribution or lack of point cloud caused by partial shielding, the projection of the three-dimensional space points at the BEV view angle is uneven, and in order to enable a network to capture the point-distribution information and enhance the point-range distribution perception of the BEV characteristics, the dual-distribution branch perception module constructed by the scheme learns the point-distribution information of the further enhanced BEV characteristics through joint dual-distribution perception representation: firstly, generating two types of features from original bimodal features through two different convolution flows, enhancing important features in the bimodal features in a space-channel-like enhancement mode, constructing representation learning of double-distribution perception, wherein one type of representation learning is inner point perception distribution representation obtained by local shrinkage inner point perception, the other type of representation learning is outer point perception distribution representation obtained by local expansion outer point perception, fusing the double-distribution perception representation to corresponding branches for feature learning, generating corresponding inner point perception distribution features and outer point perception distribution features, and finally fusing the inner point perception distribution features and the outer point perception distribution features with the original bimodal features to generate double-distribution features with BEV features of point-range distribution perception.
3) The method introduces a dual-feature association mining module to capture context association information existing between the attributes of the targets, simulates association learning between different attributes of the three-dimensional targets by constructing global attention features and local attention features, and refines regression tasks of the three-dimensional object targets by using the captured attribute association attention features. Firstly, modeling the dual distribution feature globally and locally based on an attention mechanism to generate a corresponding dual attention feature: one is global attention feature, the other is local attention feature, then the global attention feature and the local attention feature are subjected to attribute-association feature mining to obtain attribute-association attention feature capable of representing association information among different attributes, and then the attribute-association attention feature is used as a regression task for assisting in further refining the three-dimensional object.
In summary, the regional cross-modal interaction fusion module designed by the scheme can further retain the spatial features of the point cloud data and the semantic features of the image data, sense the features of the point range distribution of the BEV features and perform attribute-association mining among the three-dimensional target attribute features, so that the three-dimensional sensing capability is improved, stable robustness effect is achieved under various scene changes, and meanwhile, the method has good model generalization capability and excellent model reasoning speed.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is an overall block diagram of a dual-frame interactive awareness 3D association detection framework in accordance with an embodiment of the present application;
FIG. 2 is a block diagram of a region cross-modal interaction fusion module of a dual-frame interaction-aware 3D correlation detection frame according to one embodiment of the present application;
FIG. 3 is a block diagram of a dual-branch distribution awareness module of a dual-frame interaction awareness 3D association detection framework in accordance with one embodiment of the present application;
FIG. 4 is a block diagram of a dual feature association mining module of a dual frame interactive awareness 3D association detection frame, according to one embodiment of the present application;
fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.
It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.
Example 1
The scheme provides a method for constructing a double-frame interactive perception 3D association detection model, which comprises the following steps:
acquiring point cloud data and image data of a matched marked three-dimensional object as a training sample set;
inputting a training sample set into a constructed double-frame interactive perception 3D (three-dimensional) associated detection frame for training until iteration conditions are met to obtain a double-frame interactive perception 3D associated detection model, wherein the double-frame interactive perception 3D detection frame comprises a feature extraction network, a region cross-modal interaction fusion module, a double-branch distribution perception module and a double-feature associated mining module, point cloud data and image data in the training sample set are respectively input into the feature extraction network to obtain multi-scale point cloud features and initial image features, the multi-scale point cloud features and the initial image features are input into the region cross-modal interaction fusion module together to obtain bimodal features, the bimodal features are input into inner distribution perception branches and outer distribution perception branches in the double-branch distribution perception module to respectively obtain inner point perception distribution features and outer point perception distribution features, the inner point perception distribution features and the outer point perception distribution features are fused with the bimodal features to obtain double-distribution features, the double-distribution features are input into the double-feature associated mining module to respectively conduct local attribute attention and global attribute learning to obtain global attention features and local attention features, the global attention features and local attention features are fused to obtain associated attention features, and regression attribute associated three-dimensional classification objects are obtained;
The multi-scale point cloud features are subjected to channel fusion to obtain voxel enhancement mode features, semantic information feature enhancement is carried out on the initial image features based on the voxel enhancement mode features to obtain image enhancement features, the voxel enhancement mode features are subjected to regional voxel self-attention aggregation to obtain a voxel set, and the voxel set and the image enhancement features are fused in a cross-mode attention interactive fusion mode to obtain bimodal features.
Fig. 1 is a schematic diagram of a dual-frame interaction perception 3D correlation detection frame provided by the present solution, from which it can be seen that a voxel set of point cloud gathering voxels is obtained after multi-scale point cloud features extracted from a feature extraction network are gathered by regional voxels, an image enhancement feature is obtained after an initial image feature is enhanced by semantic information features, a dual-mode feature is obtained after the image enhancement feature and the voxel set are subjected to dual-mode attention operation, an inner point perception distribution feature and an outer point perception distribution feature are obtained by the dual-branch distribution perception module, and a dual-distribution feature obtained by fusing the inner point perception distribution feature and the outer point perception distribution feature with the dual-mode feature is input into the dual-feature correlation mining module for mining to obtain classification and regression results. According to the double-framework interactive perception 3D association detection framework, the bimodal characteristics are reserved through the regional cross-modal interactive fusion module, the density distribution of the midpoint of the bimodal characteristics is perceived through the double-branch distribution perception module, the double-distribution characteristics are obtained through combining double-distribution perception representation learning, the context association characteristics among three-dimensional target attributes in the double-distribution characteristics are captured through the double-characteristic association mining module, and the accuracy of regression tasks of three-dimensional objects is improved.
Regarding the selection of a training sample set, in the step of acquiring the point cloud data and the image data of the matched marked three-dimensional object as the training sample set, the point cloud data in the training sample set is preprocessed, the preprocessing operation comprises any one of random three-dimensional rotation, random angle rotation, scaling of a size scale and data enhancement GBGS of a balance sample, the image preprocessing operation is performed on the image data in the training set, and the preprocessing operation comprises random picture rotation and scaling of a picture size scale.
Regarding the feature extraction network:
the feature extraction network of the double-frame interactive perception 3D association detection frame comprises a point cloud feature extraction branch for extracting point cloud data and an image feature extraction branch for extracting image data, wherein the point cloud data is input into the point cloud feature extraction branch to be extracted to obtain multi-scale point cloud features, and the image data is input into the image feature extraction branch to be extracted to obtain initial image features. In other words, the feature extraction network of the present solution extracts point cloud data and image data respectively using different branches.
In some embodiments, the point cloud feature extraction branch is VoxelNet and the image feature extraction branch is ResNet-50.
It should be noted that VoxelNet is a feature encoding process for point cloud data, where the point cloud data is input into VoxelNet to perform feature extraction, then four point mode feature diagrams with different scales are output, and the four point mode feature diagrams with different scales are obtainedAnd fusing to obtain the multi-scale point cloud characteristics. ResNet-50 is a feature encoding process for image data, which is input into ResNet-50 for feature extraction and then outputs the initial image features.
Region cross-modal interaction fusion module:
the regional cross-modal interaction fusion module of the double-frame interaction perception 3D association detection frame provided by the scheme firstly carries out modal enhancement on the multi-scale point cloud features to obtain voxel enhancement mode features, carries out modal enhancement on the initial image features to obtain image enhancement features, then carries out regional perception voxel aggregation on the voxel enhancement features to obtain regional perception aggregated voxel features, defines the regional perception aggregated voxel features as a voxel set, and finally fuses the voxel set and the image enhancement features by using an interaction fusion mode of cross-modal attention so as to obtain bimodal features.
Specifically, as shown in fig. 2, fig. 2 is a network structure diagram of the region cross-modal interaction fusion module. In the step of carrying out channel fusion on multi-scale point cloud features to obtain voxel enhancement mode features, carrying out semantic information feature enhancement on initial image features based on the voxel enhancement mode features to obtain image enhancement features, carrying out channel fusion on the multi-scale point cloud features input into a region cross-mode interaction fusion module to obtain voxel enhancement mode features by fusing shallow and deep information, projecting points in a three-dimensional space in the voxel enhancement mode features into pixel grids of the corresponding initial image features according to the corresponding relation between the three-dimensional points and image pixels to generate attention feature distribution graphs, carrying out fusion on the initial image features and the attention feature distribution graphs, and carrying out normalization by using a Softmax function to obtain image enhancement features, so that semantic information of an image is enhanced.
The image enhancement features are generated as follows: according to the corresponding relation between the three-dimensional points and the image pixels, the initial image features are projected to the pixel grids of the corresponding two-dimensional initial image features to generate attention feature distribution diagrams, the initial image features and the attention feature distribution diagrams are fused, and the image enhancement features are obtained by normalization through a Softmax function, wherein the calculation formula is as follows:
wherein the method comprises the steps ofRepresenting the operation of convolution, ++>Representing the image enhancement features, P representing the projection matrix corresponding to the three-dimensional point-image pixels, +.>Representing voxel enhancement mode features,/->Representing corresponding spatial location information.
Further, in the step of collecting regional voxel self-attention by the voxel enhancement mode characteristics to obtain a voxel set, for non-empty voxels in the voxel enhancement mode characteristics, collecting the non-empty voxel characteristics by adopting the characteristics of points in the region and the position information of the fused points in a ball-query method, and carrying out self-learning enhancement on the collected non-empty voxel characteristics by using regional perception self-attention to obtain the voxel set. It should be noted that, the method of focusing non-empty voxels by using the ball-query method can enhance the aggregated non-empty voxel characteristics, and the finally generated voxel set is the region-aware aggregated voxel characteristics.
Specifically, for non-empty voxels in the voxel enhancement mode characteristics, the characteristics of corresponding points in the local area are clustered by adopting the operation of ball-query, and meanwhile, in order to preserve the position information of three-dimensional points, the clustered non-empty voxel characteristics are obtained by fusing the position codes of the corresponding points in the process of participating in clustering, and the clustering formula of the local area is as follows:
wherein the method comprises the steps ofNon-empty voxel feature representing regional focus, +.>Representing ball-queGroup operation of ry>Represents the maximum number of points of region aggregation, +.>Spatial position information representing corresponding aggregated points, < >>Features corresponding to points representing regional aggregation, +.>Representing the centroid of the voxels aggregated for the corresponding region, and N represents the number of voxels aggregated.
Based on the regional perception self-attention to the non-empty voxel characteristics gathered by the regional perception, the self-attention mechanical learning is further enhanced, in order to enhance the point position information of the region, the spatial information of the non-empty voxels is initialized firstly in the self-attention enhancing process, and then the self-attention updating and iteration are participated, wherein the self-attention characteristic enhancing formula is as follows:
wherein the method comprises the steps ofRepresenting regional awareness self-attention,/->Non-empty voxel feature representing region focus of the kth iteration,/- >Self-attention-enhancing feature representing the kth iteration, < >>Representing non-empty voxel characteristics of the k+1 iteration, conv represents a convolution operation, and FFN represents a feed-forward neural network.
Further, in the step of fusing the voxel set and the image enhancement feature to obtain the bimodal feature by using the cross-modal attention interactive fusion mode, the voxel set and the image enhancement feature are used as a double query set, the position information of the voxel set is converted into a depth perception coding mode by using an internal three-dimensional space conversion matrix of a camera, the attention weight and the prediction offset of the cross-modal attention of the voxel set and the image enhancement feature are calculated by using a Softmax and a Linear function, the double query set and the depth coding are used for participating in the calculation process, and the query set of the cross-modal attention update image enhancement feature is fused by using the attention weight to obtain the bimodal feature.
Specifically, the voxel set and the image enhancement feature are used as a dual-query set to enter a dual-mode attention, the query set of the image enhancement feature is updated in an interactive mode, a depth perception coding mode is used to participate in the interactive process of the dual-query set, further the depth feature of a point mode is reserved, the position information of the aggregated voxel set is converted into a depth-perception coding mode by utilizing an internal reference three-dimensional space conversion matrix of a camera, and a depth perception coding formula is as follows:
Wherein the method comprises the steps ofAn internal reference three-dimensional space transformation matrix representing the voxel-to-image enhancement features of a voxel set>Position information representing voxels in a voxel set of a corresponding region aggregate, < >>Representing the dimension of the feature, i represents the i-th element of the depth perception encoded feature vector.
The method comprises the steps of calculating attention weight and prediction offset offsets corresponding to a voxel set and image enhancement features by using a Softmax and Linear function, and participating in calculation by using a double query set and depth coding, so that corresponding double query set query and depth information can be updated in time in each updating process, and an updating formula of the query set about the image enhancement features is as follows:
wherein the method comprises the steps ofRepresenting the set of queries entered->Representing a reference point, F representing the corresponding image enhancement feature,/->Representing the number of attention heads, +.>And->Representing a weight that can be learned, +.>Representing the number of reference points +.>Representing predictive offset +.>The weight of attention is represented.
According to the method, the cross-modal attention is fused in a weight mode, and because the cross-modal attention is only fused with the nearby local reference points, the image enhancement features of each camera only consider the matching of the corresponding points and the associated information of the pixels to perform corresponding feature matching, and the bimodal features are obtained.
The network structure diagram of the dual-branch distributed sensing module is shown in fig. 3, the dual-branch distributed sensing module comprises an inner distributed sensing branch and an outer distributed sensing branch which are parallel, wherein the structures of the inner distributed sensing branch and the outer distributed sensing branch are the same, the inner distributed sensing branch and the outer distributed sensing branch adopt joint learning of dual-distributed sensing representations, point distribution information of different range distances is simulated, locally contracted inner point sensing representations are fused into the inner distributed sensing branch, locally expanded outer point sensing representations are fused into the outer distributed sensing branch, and the characteristics of the two forms are further enhanced in a similar space-channel enhancement mode to improve the characteristics of the bimodal characteristic under the BEV view angle. And then fusing the generated inner point perception distribution characteristics and outer point perception distribution characteristics with bimodal characteristics to obtain double distribution characteristics, wherein the double distribution characteristics have point-range distribution perception.
Specifically, in the step of inputting the bimodal features into the inner distribution sensing branches and the outer distribution sensing branches in the double-branch distribution sensing module to obtain the inner point sensing distribution features and the outer point sensing distribution features respectively,
performing BEV projection distribution of three-dimensional space points on the bimodal features to obtain initial BEV features, respectively performing inner distribution perception learning and outer distribution perception learning on the initial BEV features, wherein the inner distribution perception learning adopts inner point perception distribution representation of local inner shrinkage, the initial BEV features are input into inner distribution perception branches to be subjected to convolution processing to obtain inner point perception convolution features, the inner point perception convolution features are subjected to channel attention operation and then subjected to maximum pooling and average pooling to obtain pooling features, the pooling features and the inner point perception distribution representation are subjected to joint learning and then subjected to convolution processing to obtain inner distribution features, the inner distribution features and the inner point perception convolution features are fused to obtain inner distribution enhancement features, and the inner distribution enhancement features are subjected to activation function fusion with the initial BEV features to obtain inner point perception distribution features;
The learning of the outer distribution perception adopts the outer point perception distribution representation of local expansion, the initial BEV feature is input into the outer distribution perception branch to be subjected to convolution treatment to obtain the outer point perception convolution feature, the outer point perception convolution feature is subjected to channel attention operation and then subjected to maximum pooling and average pooling to obtain pooling feature, the pooling feature and the outer point perception distribution representation are subjected to joint learning and then subjected to convolution treatment to obtain the outer distribution feature, the outer distribution feature and the outer point perception convolution feature are fused to obtain the outer distribution enhancement feature, and the outer distribution enhancement feature is subjected to activation function and then fused with the initial BEV feature to obtain the outer point perception distribution feature.
The dual-branch distribution sensing module of the scheme uses two different convolution flows to conduct feature extraction on input bimodal features, and further generates features of two sensing forms: an outlier perceptual distribution feature and an outlier perceptual distribution feature.
The formulas for joint learning with respect to the inside-point perceptual distribution representation and the outside-point perceptual distribution representation blended into the corresponding branches are as follows:
wherein the method comprises the steps ofRepresenting an interior point perception distribution representation, < >>Representing the outlier perception profile representation +.>And->Is the range of the inner point perception distribution representation and the outer point perception distribution representation, T represents the precondition that the condition satisfies the formula,/I >And->Representing the coordinates of points within the representation range of the inter-point perceptual distribution,/->And->The out-point perceptual distribution is represented as point coordinates within a range.
The formula for obtaining the inside-outside point perception distribution feature by fusing the inside-outside distribution enhancement feature with the initial BEV feature after the outside point perception convolution feature and the inside point perception convolution feature are activated by the activation function is as follows:
wherein the method comprises the steps ofRepresenting an average pooling operation,/->Is channel attention, ++>Is the operation of the maximum pooling,representing corresponding interior and exterior point perception distribution representations, < >>Representing pooling features,/->Representing intra-distributed enhancement features->Representing the inlier perceptual distribution characteristics,>characteristic representing the distributed perceptual branching input, +.>Features representing the outward distribution aware branch input, +.>Representing the outlier perception profile +.>Representing the outer distribution enhancement features.
The formula for fusing the interior point perceptual distribution feature and the exterior point perceptual distribution feature with the initial BEV feature to obtain a dual distribution feature is as follows:
wherein the method comprises the steps ofRepresenting a dual distribution feature.
The network structure diagram of the dual-feature association mining module is shown in fig. 4, the dual-feature association mining module comprises a local attribute attention branch and a global attribute learning branch, the local attribute attention branch locally models the attribute of the three-dimensional object in the input dual-distribution feature to obtain a global attention feature, the global attribute learning branch globally models the attribute of the three-dimensional object in the input dual-distribution feature to obtain a local attention feature, the attribute association attention feature is mined from the local attention feature and the global attention feature, the attribute association attention feature is normalized by using a Softmax function to obtain an association attention feature map, the association attention feature map is classified and regressed to obtain a three-dimensional object, and simultaneously the mined attribute association attention feature and the real tag feature are optimally trained by using a SmoothL1 loss function.
Specifically, the dual-feature association mining module comprises a local attribute attention branch and a global attribute learning branch, wherein the global attribute learning branch adopts a multi-head attention mechanism consisting of a heat value detection head, a regression detection head, a height detection head, a dimension detection head, a speed detection head and a deflection detection head, and dual-distribution features are input into the global attribute learning branch to carry out overall self-attention mechanics learning to obtain local attention features:
wherein the method comprises the steps ofIndicating the heat value detection head, & lt, & gt>Indicating height detection head, & lt & gt>Representing dimension detection head, & lt, & gt>Indicating regression detection head, & lt & gt>Indicating deflection detection head, & lt & gt>Indicating the speed detection head.
The local attribute attention branch splits the double distribution features according to different attribute features and performs self-attention mechanics learning to obtain global attention features:
wherein the method comprises the steps ofIndicating the heat value detection head, & lt, & gt>Indicating height detection head, & lt & gt>Representing dimension detection head, & lt, & gt>Indicating regression detection head, & lt & gt>Indicating deflection detection head, & lt & gt>Indicating the speed detection head.
The calculation formula for mining attribute-related attention features from local attention features and global attention features is as follows:
wherein the method comprises the steps ofAttention feature for attribute association->For global attention feature->Is a local attention feature.
Normalizing the attribute-associated attention feature by using a Softmax function to obtain an associated attention feature map:
the extracted associated attention feature map is used for further enhancing the features of different attributes of the original three-dimensional object, and the associated attention feature map is classified and regressed to obtain the three-dimensional object:
wherein the method comprises the steps ofPredictive feature representing a target property of a corresponding three-dimensional object, < >>Real feature representing target properties of the corresponding three-dimensional object,/->Attention weights representing attribute-associated attention features;
simultaneously, the smoothL1 loss function is used for carrying out joint training on the attribute association attention feature and the feature of the real label, the process of double-feature association mining is further optimized, and the process of calculating association loss is as follows:
wherein the method comprises the steps ofRepresenting attribute-associated attention features->For global attention feature->For local attention features, add>Attention weight representing attribute-associated attention feature, +.>Predictive feature representing a target property of a corresponding three-dimensional object, < >>A real feature representing the target property of the corresponding three-dimensional object, smoothL1 representing the training process of the loss function,/I>Attribute-associated attention feature representing prediction, +.>The attribute representing the real tag correlates attention features.
Example two
Based on the same conception, the application also provides a double-frame interactive perception 3D association detection model, which comprises the following steps:
the point cloud data and the image data marked with the marked three-dimensional object are used as training sample sets and are obtained through training according to the construction method of the double-frame interactive perception 3D association detection model shown in the first embodiment.
In an embodiment of the present disclosure, the point cloud data and the image data are data of a vehicle-mounted camera, and the three-dimensional object is a three-dimensional object on a road.
Example III
Based on the same conception, the application also provides a double-frame interactive perception 3D association detection method, which comprises the following steps:
and inputting the point cloud data and the image data into the double-frame interactive perception 3D association detection model of the second embodiment to output the three-dimensional object.
Example IV
The present embodiment also provides an electronic device, referring to fig. 4, comprising a memory 404 and a processor 402, the memory 404 having stored therein a computer program, the processor 402 being arranged to run the computer program to perform the steps of the method of constructing the dual-frame interaction aware 3D correlation detection or the method of dual-frame interaction aware 3D correlation detection of any of the embodiments described above.
In particular, the processor 402 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.
The memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, memory 404 may comprise a Hard Disk Drive (HDD), floppy disk drive, solid State Drive (SSD), flash memory, optical disk, magneto-optical disk, tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. Memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 404 includes Read-only memory (ROM) and Random Access Memory (RAM). Where appropriate, the ROM may be a mask-programmed ROM, a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), an electrically rewritable ROM (EAROM) or FLASH memory (FLASH) or a combination of two or more of these. The RAM may be Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM) where appropriate, and the DRAM may be fast page mode dynamic random access memory 404 (FPMDRAM), extended Data Output Dynamic Random Access Memory (EDODRAM), synchronous Dynamic Random Access Memory (SDRAM), or the like.
Memory 404 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions for execution by processor 402.
The processor 402 reads and executes the computer program instructions stored in the memory 404 to implement the method of constructing dual-frame interactive aware 3D association detection or the method of dual-frame interactive aware 3D association detection according to any of the above embodiments.
Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402 and the input/output device 408 is connected to the processor 402.
The transmission device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through the base station to communicate with the internet. In one example, the transmission device 406 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
The input-output device 408 is used to input or output information. In the present embodiment, the input information may be point cloud data, image data, or the like, and the output information may be a three-dimensional object or the like.
Alternatively, in the present embodiment, the above-mentioned processor 402 may be configured to execute the following steps by a computer program:
acquiring point cloud data and image data of a matched marked three-dimensional object as a training sample set;
inputting a training sample set into a constructed double-frame interactive perception 3D (three-dimensional) associated detection frame for training until iteration conditions are met to obtain a double-frame interactive perception 3D associated detection model, wherein the double-frame interactive perception 3D detection frame comprises a feature extraction network, a region cross-modal interaction fusion module, a double-branch distribution perception module and a double-feature associated mining module, point cloud data and image data in the training sample set are respectively input into the feature extraction network to obtain multi-scale point cloud features and initial image features, the multi-scale point cloud features and the initial image features are input into the region cross-modal interaction fusion module together to obtain bimodal features, the bimodal features are input into inner distribution perception branches and outer distribution perception branches in the double-branch distribution perception module to respectively obtain inner point perception distribution features and outer point perception distribution features, the inner point perception distribution features and the outer point perception distribution features are fused with the bimodal features to obtain double-distribution features, the double-distribution features are input into the double-feature associated mining module to respectively conduct local attribute attention and global attribute learning to obtain global attention features and local attention features, the global attention features and local attention features are fused to obtain associated attention features, and regression attribute associated three-dimensional classification objects are obtained;
The multi-scale point cloud features are subjected to channel fusion to obtain voxel enhancement mode features, semantic information feature enhancement is carried out on the initial image features based on the voxel enhancement mode features to obtain image enhancement features, the voxel enhancement mode features are subjected to regional voxel self-attention aggregation to obtain a voxel set, and the voxel set and the image enhancement features are fused in a cross-mode attention interactive fusion mode to obtain bimodal features.
It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.
In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
Embodiments of the invention may be implemented by computer software executable by a data processor of a mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets, and/or macros can be stored in any apparatus-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may include one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. In addition, in this regard, it should be noted that any blocks of the logic flows as illustrated may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on a physical medium such as a memory chip or memory block implemented within a processor, a magnetic medium such as a hard disk or floppy disk, and an optical medium such as, for example, a DVD and its data variants, a CD, etc. The physical medium is a non-transitory medium.
It should be understood by those skilled in the art that the technical features of the above embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.
The foregoing examples merely represent several embodiments of the present application, the description of which is more specific and detailed and which should not be construed as limiting the scope of the present application in any way. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (10)

1. The method for constructing the double-frame interactive perception 3D association detection model is characterized by comprising the following steps of:
acquiring point cloud data and image data of a matched marked three-dimensional object as a training sample set;
inputting a training sample set into a constructed double-frame interactive perception 3D (three-dimensional) associated detection frame for training until iteration conditions are met to obtain a double-frame interactive perception 3D associated detection model, wherein the double-frame interactive perception 3D detection frame comprises a feature extraction network, a region cross-modal interaction fusion module, a double-branch distribution perception module and a double-feature associated mining module, point cloud data and image data in the training sample set are respectively input into the feature extraction network to obtain multi-scale point cloud features and initial image features, the multi-scale point cloud features and the initial image features are input into the region cross-modal interaction fusion module together to obtain bimodal features, the bimodal features are input into inner distribution perception branches and outer distribution perception branches in the double-branch distribution perception module to respectively obtain inner point perception distribution features and outer point perception distribution features, the inner point perception distribution features and the outer point perception distribution features are fused with the bimodal features to obtain double-distribution features, the double-distribution features are input into the double-feature associated mining module to respectively conduct local attribute attention and global attribute learning to obtain global attention features and local attention features, the global attention features and local attention features are fused to obtain associated attention features, and regression attribute associated three-dimensional classification objects are obtained;
The multi-scale point cloud features are subjected to channel fusion to obtain voxel enhancement mode features, semantic information feature enhancement is carried out on the initial image features based on the voxel enhancement mode features to obtain image enhancement features, the voxel enhancement mode features are subjected to regional voxel self-attention aggregation to obtain a voxel set, and the voxel set and the image enhancement features are fused in a cross-mode attention interactive fusion mode to obtain bimodal features.
2. The method for constructing the dual-frame interactive perception 3D correlation detection model according to claim 1, wherein the feature extraction network comprises a point cloud feature extraction branch for extracting point cloud data and an image feature extraction branch for extracting image data, the point cloud data is input into the point cloud feature extraction branch to extract multi-scale point cloud features, and the image data is input into the image feature extraction branch to extract initial image features.
3. The method for constructing the double-frame interactive perception 3D association detection model according to claim 1, wherein in the step of carrying out channel fusion on multi-scale point cloud features to obtain voxel enhancement mode features, carrying out semantic information feature enhancement on initial image features based on the voxel enhancement mode features to obtain image enhancement features, carrying out channel fusion on the multi-scale point cloud features input into a region cross-mode interactive fusion module to fuse shallow and deep information to obtain voxel enhancement mode features, projecting points in a three-dimensional space in the voxel enhancement mode features into pixel grids of the corresponding initial image features according to the corresponding relation between three-dimensional points and image pixels to generate an attention feature distribution map, and carrying out normalization on the initial image features and the attention feature distribution map to obtain the image enhancement features after fusing by using a Softmax function.
4. The method for constructing the dual-frame interactive perception 3D correlation detection model according to claim 1, wherein in the step of collecting regional voxel self-attention by using voxel enhancement mode characteristics to obtain a voxel set, for non-empty voxels in the voxel enhancement mode characteristics, collecting the characteristic of points in the regional and the position information of the merging points by using a ball-query method to obtain the collected non-empty voxel characteristics, and carrying out self-learning enhancement on the collected non-empty voxel characteristics by using regional perception self-attention to obtain the voxel set.
5. The method for constructing the dual-frame interactive perception 3D correlation detection model according to claim 1, wherein in the step of fusing the voxel set and the image enhancement feature to obtain the bimodal feature by using an interactive fusion mode of cross-modal attention, the voxel set and the image enhancement feature are used as a dual query set, the position information of the voxel set is converted into a depth perception coding mode by using an internal three-dimensional space conversion matrix of a camera, the attention weight and the prediction offset of the cross-modal attention of the voxel set and the image enhancement feature are calculated by using a Softmax and a Linear function, and the bimodal feature is obtained by utilizing the dual query set and the depth coding to participate in the calculation process and utilizing the attention weight fusion to update the query set of the image enhancement feature of cross-modal attention.
6. The method for constructing a dual-frame interactive perception 3D correlation detection model according to claim 1, wherein in the step of obtaining the interior point perception distribution feature and the exterior point perception distribution feature respectively in the interior distribution perception branch and the exterior distribution perception branch of the dual-branch distribution perception module by inputting the dual-mode feature,
performing BEV projection distribution of three-dimensional space points on the bimodal features to obtain initial BEV features, respectively performing inner distribution perception learning and outer distribution perception learning on the initial BEV features, wherein the inner distribution perception learning adopts inner point perception distribution representation of local inner shrinkage, the initial BEV features are input into inner distribution perception branches to be subjected to convolution processing to obtain inner point perception convolution features, the inner point perception convolution features are subjected to channel attention operation and then subjected to maximum pooling and average pooling to obtain pooling features, the pooling features and the inner point perception distribution representation are subjected to joint learning and then subjected to convolution processing to obtain inner distribution features, the inner distribution features and the inner point perception convolution features are fused to obtain inner distribution enhancement features, and the inner distribution enhancement features are subjected to activation function fusion with the initial BEV features to obtain inner point perception distribution features;
The learning of the outer distribution perception adopts the outer point perception distribution representation of local expansion, the initial BEV feature is input into the outer distribution perception branch to be subjected to convolution treatment to obtain the outer point perception convolution feature, the outer point perception convolution feature is subjected to channel attention operation and then subjected to maximum pooling and average pooling to obtain pooling feature, the pooling feature and the outer point perception distribution representation are subjected to joint learning and then subjected to convolution treatment to obtain the outer distribution feature, the outer distribution feature and the outer point perception convolution feature are fused to obtain the outer distribution enhancement feature, and the outer distribution enhancement feature is subjected to activation function and then fused with the initial BEV feature to obtain the outer point perception distribution feature.
7. The method for constructing the dual-frame interactive perception 3D correlation detection model according to claim 1, wherein the dual-feature correlation mining module comprises a local attribute focus branch and a global attribute learning branch, the local attribute focus branch locally models the attribute of the three-dimensional object in the input dual-distribution feature to obtain a global attention feature, the global attribute learning branch globally models the attribute of the three-dimensional object in the input dual-distribution feature to obtain a local attention feature, the attribute correlation attention feature is mined from the local attention feature and the global attention feature, the attribute correlation attention feature is normalized by using a Softmax function to obtain a correlation attention feature map, the correlation attention feature map is classified and regressed to obtain a three-dimensional object, and simultaneously the mined attribute correlation attention feature and the feature of a real tag are optimally trained by using a SmoothL1 loss function.
8. The method for detecting the double-frame interaction perception 3D association is characterized by comprising the following steps of:
inputting the point cloud data and the image data into the double-frame interactive perception 3D association detection model constructed by the construction method of the double-frame interactive perception 3D association detection model according to any one of claims 1 to 7, and outputting a three-dimensional object.
9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of constructing a dual-frame interactive awareness 3D-association detection model of any of claims 1 to 7.
10. A readable storage medium, characterized in that the readable storage medium has stored therein a computer program comprising program code for controlling a process to execute a process comprising a method of constructing a dual-frame interactive perception 3D-association detection model according to any one of claims 1 to 7.
CN202410045646.5A 2024-01-12 2024-01-12 Construction method and application of double-frame interaction perception 3D association detection model Active CN117557993B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410045646.5A CN117557993B (en) 2024-01-12 2024-01-12 Construction method and application of double-frame interaction perception 3D association detection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410045646.5A CN117557993B (en) 2024-01-12 2024-01-12 Construction method and application of double-frame interaction perception 3D association detection model

Publications (2)

Publication Number Publication Date
CN117557993A true CN117557993A (en) 2024-02-13
CN117557993B CN117557993B (en) 2024-03-29

Family

ID=89817082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410045646.5A Active CN117557993B (en) 2024-01-12 2024-01-12 Construction method and application of double-frame interaction perception 3D association detection model

Country Status (1)

Country Link
CN (1) CN117557993B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080065490A (en) * 2007-01-09 2008-07-14 포항공과대학교 산학협력단 Distributed file service method and system for integrated data management in ubiquitous environment
CN111209305A (en) * 2019-11-19 2020-05-29 华为技术有限公司 Data query method, data node, distributed database and computing equipment
US20220230322A1 (en) * 2021-01-21 2022-07-21 Dalian University Of Technology Depth-aware method for mirror segmentation
CN115880333A (en) * 2022-12-05 2023-03-31 东北大学 Three-dimensional single-target tracking method based on multi-mode information fusion
CN115909319A (en) * 2022-12-15 2023-04-04 南京工业大学 Method for detecting 3D object on point cloud based on hierarchical graph network
CN116189278A (en) * 2022-12-05 2023-05-30 重庆邮电大学 Fine granularity basketball action recognition method based on global context awareness
US20230206603A1 (en) * 2022-09-19 2023-06-29 Nanjing University Of Posts And Telecommunications High-precision point cloud completion method based on deep learning and device thereof
CN117173399A (en) * 2023-09-06 2023-12-05 东南大学 Traffic target detection method and system of cross-modal cross-attention mechanism
CN117315372A (en) * 2023-10-31 2023-12-29 电子科技大学 Three-dimensional perception method based on feature enhancement

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080065490A (en) * 2007-01-09 2008-07-14 포항공과대학교 산학협력단 Distributed file service method and system for integrated data management in ubiquitous environment
CN111209305A (en) * 2019-11-19 2020-05-29 华为技术有限公司 Data query method, data node, distributed database and computing equipment
US20220230322A1 (en) * 2021-01-21 2022-07-21 Dalian University Of Technology Depth-aware method for mirror segmentation
US20230206603A1 (en) * 2022-09-19 2023-06-29 Nanjing University Of Posts And Telecommunications High-precision point cloud completion method based on deep learning and device thereof
CN115880333A (en) * 2022-12-05 2023-03-31 东北大学 Three-dimensional single-target tracking method based on multi-mode information fusion
CN116189278A (en) * 2022-12-05 2023-05-30 重庆邮电大学 Fine granularity basketball action recognition method based on global context awareness
CN115909319A (en) * 2022-12-15 2023-04-04 南京工业大学 Method for detecting 3D object on point cloud based on hierarchical graph network
CN117173399A (en) * 2023-09-06 2023-12-05 东南大学 Traffic target detection method and system of cross-modal cross-attention mechanism
CN117315372A (en) * 2023-10-31 2023-12-29 电子科技大学 Three-dimensional perception method based on feature enhancement

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HONGHUI YANG 等: "PVT-SSD:Single-Stage 3D Object Detection with Point-Voxel Transformer", ARXIV, 12 May 2023 (2023-05-12), pages 1 - 12 *
党吉圣;杨军;: "多特征融合的三维模型识别与分割", 西安电子科技大学学报, no. 04, 17 August 2020 (2020-08-17), pages 153 - 161 *
刘天亮;谯庆伟;万俊伟;戴修斌;罗杰波;: "融合空间-时间双网络流和视觉注意的人体行为识别", 电子与信息学报, no. 10, 15 August 2018 (2018-08-15), pages 114 - 120 *

Also Published As

Publication number Publication date
CN117557993B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
Zhang et al. Reference pose generation for long-term visual localization via learned features and view synthesis
Guerry et al. Snapnet-r: Consistent 3d multi-view semantic labeling for robotics
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
JP6458394B2 (en) Object tracking method and object tracking apparatus
US10366300B1 (en) Systems and methods regarding 2D image and 3D image ensemble prediction models
WO2019177738A1 (en) Systems and methods for reducing data storage in machine learning
AU2017324923A1 (en) Predicting depth from image data using a statistical model
US11670097B2 (en) Systems and methods for 3D image distification
US20230076266A1 (en) Data processing system, object detection method, and apparatus thereof
Wang et al. 3d lidar and stereo fusion using stereo matching network with conditional cost volume normalization
Elharrouss et al. Panoptic segmentation: A review
CN112085840B (en) Semantic segmentation method, semantic segmentation device, semantic segmentation equipment and computer readable storage medium
CN113408584B (en) RGB-D multi-modal feature fusion 3D target detection method
CN113052295B (en) Training method of neural network, object detection method, device and equipment
CN109948637A (en) Object test equipment, method for checking object and computer-readable medium
He et al. Learning scene dynamics from point cloud sequences
CN106407978B (en) Method for detecting salient object in unconstrained video by combining similarity degree
JP2021510823A (en) Vehicle position identification
CN116310850B (en) Remote sensing image target detection method based on improved RetinaNet
KR20200136723A (en) Method and apparatus for generating learning data for object recognition using virtual city model
CN111950702A (en) Neural network structure determining method and device
CN113724128A (en) Method for expanding training sample
KR102462499B1 (en) System and method for creating digital twin maps using multiple intelligent autonomous driving robots
CN115222896A (en) Three-dimensional reconstruction method and device, electronic equipment and computer-readable storage medium
CN113657393B (en) Shape prior missing image semi-supervised segmentation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant