CN116894963A - Object detection method and system based on context clustering and multi-mode fusion - Google Patents

Object detection method and system based on context clustering and multi-mode fusion Download PDF

Info

Publication number
CN116894963A
CN116894963A CN202310660880.4A CN202310660880A CN116894963A CN 116894963 A CN116894963 A CN 116894963A CN 202310660880 A CN202310660880 A CN 202310660880A CN 116894963 A CN116894963 A CN 116894963A
Authority
CN
China
Prior art keywords
target
data
detection result
fusion
point cloud
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310660880.4A
Other languages
Chinese (zh)
Inventor
何为
邓振淼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202310660880.4A priority Critical patent/CN116894963A/en
Publication of CN116894963A publication Critical patent/CN116894963A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S13/00Systems using the reflection or reradiation of radio waves, e.g. radar systems; Analogous systems using reflection or reradiation of waves whose nature or wavelength is irrelevant or unspecified
    • G01S13/86Combinations of radar systems with non-radar systems, e.g. sonar, direction finder
    • G01S13/867Combination of radar systems with cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/86Arrangements for image or video recognition or understanding using pattern recognition or machine learning using syntactic or structural representations of the image or video pattern, e.g. symbolic string recognition; using graph matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Remote Sensing (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target detection method and a system based on context clustering and multi-mode fusion, which relate to the technical field of computer vision and radar signal processing and comprise the steps of acquiring image data and point cloud data and performing data enhancement processing; extracting features from the image data after the data enhancement processing, and calculating corresponding heat maps and attribute parameters based on the visual branch features to obtain a first detection result of the target; rectangular expansion is carried out on the point cloud data after data enhancement, and an expanded point cloud rectangular surface is obtained; carrying out regional association with the first detection result to obtain association data and extract characteristics to obtain radar branch characteristics; and integrating the visual branch characteristic and the radar branch characteristic, calculating a second detection result, and inputting the second detection result and the first detection result together into a decoder to obtain a final detection result of the target. The multi-modal feature fusion network structure based on the context clustering, designed by the invention, realizes fusion complementation of multi-modal feature data, and improves the accuracy and efficiency of target detection.

Description

Object detection method and system based on context clustering and multi-mode fusion
Technical Field
The invention relates to the technical field of computer vision and radar signal processing, in particular to a method and a system for detecting targets based on context clustering and multi-mode fusion.
Background
In the field of automatic driving and assisted driving, the target detection effect of the intelligent perception system has become one of the key factors determining the degree of environmental perception. In recent years, target detection algorithms based on single-mode sensors have been studied extensively, such as vision and radar, and particularly, target detection algorithms based on deep learning have been studied intensively. However, in a real application scene, a target detection algorithm based on a single mode is extremely easily influenced by factors such as environment and detection efficiency, so that the problems of unstable performance, poor robustness and the like of the target detection algorithm are caused. For example, a visual sensor-based target detection algorithm is extremely susceptible to problems of complex environments such as rain, fog or low illumination, resulting in reduced target detection performance and even direct failure of the detection algorithm. These problems will have a major impact on the operation of the autopilot and driver assist systems.
In recent years, a target detection algorithm based on multi-mode sensor information fusion is receiving a great deal of attention, because the detection accuracy of a sensing system on surrounding objects can be improved. The existing multi-mode fusion method is mainly a multi-mode fusion network based on a convolutional neural network structure or a network structure based on a self-attention mechanism. While these networks are capable of fusing multimodal data to achieve better target detection, they typically require a large number of network parameters to train the model, thus placing high demands on the computational performance of the deployed device. At present, information fusion technology based on visual sensors and radar sensors has become a very promising research direction, because the two modalities of sensors are well complementary and relatively inexpensive. The vision sensor is able to accurately detect target class features in visual space under standard circumstances, but lacks the perceptibility of object depth and speed, and under complex circumstances, if there is occlusion, etc., the target perceptibility will be further deteriorated. The data provided by radar sensors is relatively sparse in the spatial dimension, but is not affected by rain, fog or low illumination, and can measure both the distance and speed of reflective objects in the environment. Therefore, the deficiency between the vision sensor and the radar sensor can be well complemented by utilizing the data fusion of the two modes, and a more accurate and more robust target detection effect is provided.
The prior art discloses an object detection method based on semantic segmentation enhancement, which comprises the steps of preparing marked images and dividing the collected images; the method comprises the steps of designing a deep convolutional neural network structure based on semantic segmentation enhancement, wherein the deep convolutional neural network structure comprises a main sub-network, a segmentation sub-network and a detection sub-network; the main sub-network is used for extracting general features of the image, the segmentation sub-network is used for extracting features of semantic segmentation and predicting segmented heat maps of each type of object, and the detection sub-network adopts a detector of a specific type to extract and predict the features of the type; training the deep convolutional neural network by using the training data set, and calculating to obtain the detection result of the image by using the trained network. The comparison file is mainly based on a convolutional neural network structure, and although the multi-mode data can be fused to a certain extent, a large number of network parameters are needed to train the model, so that the calculation performance of deployment equipment is high, and the target detection efficiency is low.
Disclosure of Invention
The invention provides a method and a system for detecting targets based on context clustering and multi-mode fusion, which are used for carrying out fusion complementation by utilizing multi-mode characteristic data and improving the accuracy and the efficiency of target detection in order to overcome the defects that the conventional single-mode target detection algorithm is easily influenced by factors such as environment, detection efficiency and the like, and the performance of the target detection algorithm is unstable and the robustness is poor.
In order to solve the technical problems, the technical scheme of the invention is as follows:
the invention provides a target detection method based on context clustering and multi-mode fusion, which comprises the following steps:
s1: obtaining paired image data and point cloud data, and carrying out data enhancement processing on the image data and the point cloud data to obtain the image data and the point cloud data after data enhancement;
s2: extracting features of the image data after the data enhancement to obtain visual branch features;
s3: based on the visual branch characteristics, calculating a corresponding heat map and attribute parameters to obtain a first detection result of the target;
s4: mapping the point cloud data after data enhancement into an image coordinate system and performing rectangular expansion to obtain an expanded point cloud rectangular surface;
s5: performing region association on the first detection result of the target and the expanded point cloud rectangular surface to obtain association data;
s6: extracting features of the associated data to obtain radar branch features;
s7: performing feature fusion on the visual branch feature and the radar branch feature to obtain fusion features;
s8: calculating a second detection result of the target based on the fusion characteristic;
s9: and inputting the first detection result of the target and the second detection result of the target into a preset decoder to obtain a final detection result of the target.
Preferably, in the step S1, the image data is acquired by using a vision sensor, and the point cloud data is acquired by using a radar sensor; the data enhancement processing of the image data and the point cloud data comprises random horizontal overturn of the data and random movement of the data.
Preferably, in the step S2, the image data after the data enhancement is input into an existing deep aggregation network model to perform feature extraction, so as to obtain the visual branch feature.
Preferably, in the step S3, the visual branch feature is input into a trained first regression network, and a heat map and attribute parameters represented by the visual branch feature are calculated to obtain a first detection result of the target; the attribute parameters include target size, offset, three-dimensional size, depth, and direction.
Preferably, the specific method for obtaining the trained first regression network is as follows:
acquiring visual branch characteristics of image training data, inputting the visual branch characteristics into a constructed first regression network, and calculating a corresponding heat map, a target size, an offset, a three-dimensional size, a depth and a direction;
and setting a focus loss function for supervision training for the heat map, wherein the method specifically comprises the following steps:
wherein L is k Represents the focus loss value, N represents the target number of image training data, Y xyc A real heat map of the object is represented,a predicted heat map representing a target, and alpha and beta respectively represent a first super parameter and a second super parameter of focus loss;
setting average absolute error as a loss function for optimizing the target size, offset, three-dimensional size, depth and direction, wherein the average absolute error is specifically as follows:
wherein L is MAE Represents the average absolute error loss value, N represents the target number of image training data, f (x) i ) Representing a first predicted result, y, of the ith target i A real label representing the ith target;
and when the average absolute error loss value and the focus loss value are the minimum, obtaining a trained first regression network.
Aiming at the problem that the point cloud data of the radar sensor is inaccurate in height information, the point cloud data after data enhancement is mapped into an image coordinate system and subjected to rectangular expansion, an expanded point cloud rectangular surface is obtained, and the size and position information of the expanded point cloud rectangular surface in the plane of the image coordinate system are obtained.
Preferably, the first detection result of the target and the expanded point cloud rectangular surface are subjected to regional association by using a cone association method, so that association data are obtained.
The cone correlation method is that a threshold value tau is set in the center of a three-dimensional boundary frame of a target d At [ -tau [ dd ]Associating all the expanded radar detection targets with a first detection result in the area range; calculating the threshold τ by taking the maximum and minimum values of the three-dimensional bounding box in depth or in the z-axis d
In the method, in the process of the invention,representing the maximum value on the z-axis of the three-dimensional bounding box, < >>Showing the minimum on the z-axis of the three-dimensional bounding box;
if a plurality of targets detected by the radar exist in the threshold range, selecting a point cloud rectangular surface with the minimum z-axis coordinate; in addition, because the uncertainty of the vision sensor for acquiring the image data on the depth estimation is inaccurate in the first detection result output by the trained first regression network, a relaxation factor delta is introduced to expand the association area, more points are associated with the object, and the relaxation factor delta is increased by a threshold tau in proportion to the object d
Preferably, in the step S6, feature extraction is performed on the associated data, and the specific method for obtaining the radar branch feature is as follows:
for each point cloud rectangular surface associated with the first detection result of the target, generating a corresponding radar branch feature F at the position radar The calculation formula is as follows:
wherein i=1, 2 and 3 respectively represent three characteristic channels of radar branch characteristics;representing the radar branch characteristics of the jth target on the ith characteristic channel, M i Representing the normalization factor on the ith characteristic channel, f i Representing the depth or radial velocity eigenvalues on the ith eigenvectors, x and y representing the components of the target velocity in the horizontal and vertical directions respectively,and->Represents the center point, w, of the jth target j And h j A width and a height of a two-dimensional bounding box representing the jth object; gamma is a super parameter for controlling the size of the two-dimensional boundary box of the target.
Preferably, in the step S7, the visual branch feature and the radar branch feature are input into a constructed context clustering fusion network to perform feature fusion, so as to obtain fusion features; the formula of feature fusion is:
wherein F is fused Representing the fusion characteristics, phi CCN Representing a context cluster fusion network, F radar Representing radar branching characteristics, F camera The characteristics of the visual branch are represented,representing feature stitching.
The context clustering fusion network can adaptively adjust the clustering center, so that feature association information among different modes is effectively captured to obtain final fusion features, and target detection is carried out by using the fusion features, so that the result is more accurate and has better resultsThe robustness, and the context clustering fusion network adopts a structure without convolution and self-attention mechanism, and network parameters are reduced in a large scale, so that the training speed and the detection efficiency of the network are improved. The context clustering fusion network mainly comprises two parts, namely feature aggregation and feature distribution, and clusters the input features into a plurality of clusters. The feature points of each cluster will be aggregated and then reassigned back. Context clustering is performed by giving a set of feature pointsLinear projection of P onto P s For computing a similarity between features; setting c cluster centers for all the features in the feature space, calculating average value of the features in the cluster centers through the nearest k points, and calculating P s Cosine similarity between the center point set +.>Since the distance between the features and the feature similarity are implicitly highlighted when the similarity is calculated, then each feature point is distributed to the most similar center, and c clusters are finally generated, so that the effect of feature aggregation is achieved, and the aggregated features are redistributed back to the original dimension by using a full connection layer.
Preferably, in the step S8, the fusion features are input into a trained quadratic regression network, and the depth, speed, direction and class of the target are recalculated;
for the depth, speed and direction of the target, the average absolute error is still adopted as a loss function for supervision training;
aiming at the category, a cross entropy loss function is set for optimization, specifically:
wherein L is BCE Represents the cross entropy loss value, σ (x i ) A second prediction result representing an ith target;
and when the average absolute error loss value and the cross entropy loss value are minimum, obtaining a trained secondary regression network, and taking a second prediction result of the corresponding ith target as a second detection result of the target.
Preferably, in step S9, the preset decoder is a bounding box decoder.
The invention also provides a target detection system based on context clustering and multi-mode fusion, which is used for realizing the target detection method, and comprises the following steps:
the data acquisition processing module is used for acquiring paired image data and point cloud data, and carrying out data enhancement processing on the image data and the point cloud data to obtain the image data and the point cloud data after data enhancement;
the visual feature extraction module is used for carrying out feature extraction on the image data after the data enhancement to obtain visual branch features;
the first detection module is used for calculating a corresponding heat map and attribute parameters based on the visual branch characteristics to obtain a first detection result of the target;
the point cloud expansion module is used for mapping the point cloud data after the data enhancement into an image coordinate system and carrying out rectangular expansion to obtain an expanded point cloud rectangular surface;
the multi-mode data association module is used for carrying out area association on the first detection result of the target and the expanded point cloud rectangular surface to obtain association data;
the radar feature extraction module is used for extracting features of the associated data to obtain radar branch features;
the feature fusion module is used for carrying out feature fusion on the vision branch features and the radar branch features to obtain fusion features;
the second detection module is used for calculating a second detection result of the target based on the fusion characteristic;
the target detection module is used for inputting the first detection result of the target and the second detection result of the target into a preset decoder to obtain a final detection result of the target.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
firstly, obtaining paired image data and point cloud data, performing data enhancement processing, and performing feature extraction on the image data subjected to the data enhancement processing to obtain visual branch features; based on the visual branch characteristics, calculating a corresponding heat map and attribute parameters to obtain a first detection result of the target; mapping the point cloud data after data enhancement into an image coordinate system, carrying out rectangular expansion to obtain an expanded point cloud rectangular surface, and carrying out regional association with a first detection result of a target to obtain associated data; extracting features of the associated data to obtain radar branch features; feature fusion is carried out on the features of the two modes of the visual branch feature and the radar branch feature, and a second detection result of the target is calculated based on the fusion features; and finally, inputting the first detection result and the second detection result of the target into a preset decoder to obtain the final detection result of the target. The invention utilizes the multi-mode characteristic data to carry out fusion complementation, thereby improving the precision and efficiency of target detection.
Drawings
Fig. 1 is a flow chart of a method for detecting targets based on context clustering and multi-modal fusion according to embodiment 1.
Fig. 2 is a schematic diagram of a context clustering and multi-modal fusion-based target detection method according to embodiment 2.
Fig. 3 is a schematic diagram of an expanded rectangular surface of a point cloud according to embodiment 2.
Fig. 4 is a schematic diagram of the cone association method described in embodiment 2.
Fig. 5 is a diagram of the final detection result of the two-dimensional object in the first scenario described in embodiment 2.
Fig. 6 is a diagram of the final detection result of the two-dimensional object in the second scenario described in embodiment 2.
Fig. 7 is a diagram of the final detection result of the two-dimensional object in the third scenario described in embodiment 2.
Fig. 8 is a diagram of the final detection result of the three-dimensional object in the first scenario described in embodiment 2.
Fig. 9 is a diagram of the final detection result of the three-dimensional object in the second scenario described in embodiment 2.
Fig. 10 is a diagram of the final detection result of the three-dimensional object in the third scenario described in embodiment 2.
Fig. 11 is a schematic structural diagram of a context clustering and multi-modal fusion-based target detection system according to embodiment 3.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;
it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment provides a target detection method based on context clustering and multi-mode fusion, as shown in fig. 1, including:
s1: obtaining paired image data and point cloud data, and carrying out data enhancement processing on the image data and the point cloud data to obtain the image data and the point cloud data after data enhancement;
s2: extracting features of the image data after the data enhancement to obtain visual branch features;
s3: based on the visual branch characteristics, calculating a corresponding heat map and attribute parameters to obtain a first detection result of the target;
s4: mapping the point cloud data after data enhancement into an image coordinate system and performing rectangular expansion to obtain an expanded point cloud rectangular surface;
s5: performing region association on the first detection result of the target and the expanded point cloud rectangular surface to obtain association data;
s6: extracting features of the associated data to obtain radar branch features;
s7: performing feature fusion on the visual branch feature and the radar branch feature to obtain fusion features;
s8: calculating a second detection result of the target based on the fusion characteristic;
s9: and inputting the first detection result of the target and the second detection result of the target into a preset decoder to obtain a final detection result of the target.
In a specific implementation process, the embodiment firstly acquires paired image data and point cloud data, performs data enhancement processing, and performs feature extraction on the image data after the data enhancement processing to acquire visual branch features; based on the visual branch characteristics, calculating a corresponding heat map and attribute parameters to obtain a first detection result of the target; mapping the point cloud data after data enhancement into an image coordinate system, carrying out rectangular expansion to obtain an expanded point cloud rectangular surface, and carrying out regional association with a first detection result of a target to obtain associated data; extracting features of the associated data to obtain radar branch features; feature fusion is carried out on the features of the two modes of the visual branch feature and the radar branch feature, and a second detection result of the target is calculated based on the fusion features; and finally, inputting the first detection result and the second detection result of the target into a preset decoder to obtain the final detection result of the target. In the embodiment, the multi-mode characteristic data are used for fusion complementation, so that the accuracy and the efficiency of target detection are improved.
Example 2
The embodiment provides a target detection method based on context clustering and multi-mode fusion, as shown in fig. 2, including:
s1: obtaining paired image data and point cloud data, and carrying out data enhancement processing on the image data and the point cloud data to obtain the image data and the point cloud data after data enhancement;
specifically, the image data is obtained by using a vision sensor, and the point cloud data is obtained by using a radar sensor; the data enhancement processing of the image data and the point cloud data comprises random horizontal overturn and random movement of the data; in the present embodiment, the probability of random horizontal flip of data is set to 50%, and the range of random movement of data is set to 0% to 20%;
s2: inputting the image data with the enhanced data into the existing deep aggregation network model for feature extraction to obtain visual branch features;
s3: inputting the visual branch characteristics into a trained first regression network, and calculating a heat map and attribute parameters represented by the visual branch characteristics to obtain a first detection result of a target; the attribute parameters include target size, offset, three-dimensional size, depth and direction;
the specific method for obtaining the trained first regression network comprises the following steps:
acquiring visual branch characteristics of image training data, inputting the visual branch characteristics into a constructed first regression network, and calculating a corresponding heat map, a target size, an offset, a three-dimensional size, a depth and a direction;
and setting a focus loss function for supervision training for the heat map, wherein the method specifically comprises the following steps:
wherein L is k Represents the focus loss value, N represents the target number of image training data, Y xyc A real heat map of the object is represented,a predicted heat map representing a target, and alpha and beta respectively represent a first super parameter and a second super parameter of focus loss;
setting average absolute error as a loss function for optimizing the target size, offset, three-dimensional size, depth and direction, wherein the average absolute error is specifically as follows:
wherein L is MAE Represents the average absolute error loss value, N represents the target number of image training data, f (x) i ) Representing a first predicted result, y, of the ith target i A real label representing the ith target;
and when the average absolute error loss value and the focus loss value are the minimum, obtaining a trained first regression network.
S4: mapping the point cloud data after data enhancement into an image coordinate system and performing rectangular expansion to obtain an expanded point cloud rectangular surface; as shown in fig. 3, aiming at the problem that the point cloud data of the radar sensor is inaccurate in height information, mapping the point cloud data after data enhancement into an image coordinate system and performing rectangular expansion to obtain an expanded point cloud rectangular surface, and obtaining the size and position information of the expanded point cloud rectangular surface in the plane of the image coordinate system;
s5: performing region association on the first detection result of the target and the expanded point cloud rectangular surface by using a cone association method to obtain association data; as shown in fig. 4, the cone correlation method is to set a threshold τ in the center of the three-dimensional bounding box of the object d At [ -tau [ dd ]Associating all the expanded radar detection targets with a first detection result in the area range; calculating the threshold τ by taking the maximum and minimum values of the three-dimensional bounding box in depth or in the z-axis d
In the method, in the process of the invention,representing the maximum value on the z-axis of the three-dimensional bounding box, < >>Showing the minimum on the z-axis of the three-dimensional bounding box;
if a plurality of targets detected by the radar exist in the threshold range, selecting a point cloud rectangular surface with the minimum z-axis coordinate; in addition, because the uncertainty of the vision sensor for acquiring the image data on the depth estimation is inaccurate in the first detection result output by the trained first regression network, a relaxation factor delta is introduced to expand the association area, more points are associated with the object, and the relaxation factor delta is increased by a threshold tau in proportion to the object d
S6: extracting features of the associated data to obtain radar branch features; specific:
for each point cloud rectangular surface associated with the first detection result of the target, generating a corresponding radar branch feature F at the position radar The calculation formula is as follows:
wherein i=1, 2 and 3 respectively represent three characteristic channels of radar branch characteristics;representing the radar branch characteristics of the jth target on the ith characteristic channel, M i Representing the normalization factor on the ith characteristic channel, f i Representing the depth or radial velocity eigenvalues on the ith eigenvectors, x and y representing the components of the target velocity in the horizontal and vertical directions respectively,and->Represents the center point, w, of the jth target j And h j A width and a height of a two-dimensional bounding box representing the jth object; gamma is a super parameter for controlling the size of the two-dimensional boundary frame of the target;
s7: inputting the visual branch characteristics and the radar branch characteristics into a constructed context clustering fusion network to perform characteristic fusion to obtain fusion characteristics; the formula of feature fusion is:
wherein F is fused Representing the fusion characteristics, phi CCN Representing a context cluster fusion network, F radar Representing radar branching characteristics, F camera The characteristics of the visual branch are represented,representing characteristic stitching;
the context clustering fusion network can adaptively adjust the clustering center, so that feature association information among different modes is effectively captured to obtain final fusion features, the fusion features are utilized to carry out target detection, so that a result is more accurate and robust, the context clustering fusion network adopts a structure without convolution and a self-attention mechanism, and network parameters are greatly reduced, so that the training speed and the detection efficiency of the network are improved. The context clustering fusion network mainly comprises two parts, namely feature aggregation and feature distribution, and clusters the input features into a plurality of clusters. The feature points of each cluster will be aggregated and then reassigned back. Context clustering is performed by giving a set of feature pointsLinear projection of P onto P s For computing a similarity between features; setting c cluster centers for all the features in the feature space, calculating average value of the features in the cluster centers through the nearest k points, and calculating P s Cosine similarity between the center point set +.>Since the distance between the features and the feature similarity are implicitly highlighted when the similarity is calculated, each feature point is distributed to the most similar center to finally generate c clusters, thereby achieving feature aggregationEffectively, the aggregate features are reassigned back to the original dimension by a fully connected layer.
The context clustering fusion network has significantly reduced network parameters compared with other fusion networks, such as convolutional neural networks and self-attention networks, which enables the model to have faster target detection speed on low-computing-capability devices;
the context clustering fusion network can adaptively adjust a clustering center, so that feature association information among different modes is effectively captured to obtain final fusion features, and target detection is carried out by using the fusion features, so that a result is more accurate and more robust;
and the context clustering fusion network has better interpretability on the feature fusion result due to the adoption of a simplified clustering algorithm.
S8: the fusion characteristics are input into a trained secondary regression network, and the depth, speed, direction and category of the target are recalculated;
for the depth, speed and direction of the target, the average absolute error is still adopted as a loss function for supervision training;
aiming at the category, a cross entropy loss function is set for optimization, specifically:
wherein L is BCE Represents the cross entropy loss value, σ (x i ) A second prediction result representing an ith target;
when the average absolute error loss value and the cross entropy loss value are minimum, a trained secondary regression network is obtained, and a second prediction result of the corresponding ith target is taken as a second detection result of the target;
s9: and inputting the first detection result of the target and the second detection result of the target into a preset boundary box decoder to obtain a final detection result of the target.
In a specific implementation process, the method provided in this embodiment is implemented and verified on the Nuscenes dataset disclosed. As shown in fig. 5-7, the final detection result graphs of the two-dimensional targets in the selected three scenes are shown in fig. 8-10. As can be seen from fig. 5 to fig. 10, the method provided in this embodiment can accurately detect two-dimensional targets and three-dimensional targets in different scenes.
Example 3
The present embodiment further provides a context clustering and multi-mode fusion-based target detection system, configured to implement the target detection method described in embodiment 1 or 2, as shown in fig. 11, including:
the data acquisition processing module is used for acquiring paired image data and point cloud data, and carrying out data enhancement processing on the image data and the point cloud data to obtain the image data and the point cloud data after data enhancement;
the visual feature extraction module is used for carrying out feature extraction on the image data after the data enhancement to obtain visual branch features;
the first detection module is used for calculating a corresponding heat map and attribute parameters based on the visual branch characteristics to obtain a first detection result of the target;
the point cloud expansion module is used for mapping the point cloud data after the data enhancement into an image coordinate system and carrying out rectangular expansion to obtain an expanded point cloud rectangular surface;
the multi-mode data association module is used for carrying out area association on the first detection result of the target and the expanded point cloud rectangular surface to obtain association data;
the radar feature extraction module is used for extracting features of the associated data to obtain radar branch features;
the feature fusion module is used for carrying out feature fusion on the vision branch features and the radar branch features to obtain fusion features;
the second detection module is used for calculating a second detection result of the target based on the fusion characteristic;
the target detection module is used for inputting the first detection result of the target and the second detection result of the target into a preset decoder to obtain a final detection result of the target.
The same or similar reference numerals correspond to the same or similar components;
the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;
it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (10)

1. A target detection method based on context clustering and multi-mode fusion is characterized by comprising the following steps:
s1: obtaining paired image data and point cloud data, and carrying out data enhancement processing on the image data and the point cloud data to obtain the image data and the point cloud data after data enhancement;
s2: extracting features of the image data after the data enhancement to obtain visual branch features;
s3: based on the visual branch characteristics, calculating a corresponding heat map and attribute parameters to obtain a first detection result of the target;
s4: mapping the point cloud data after data enhancement into an image coordinate system and performing rectangular expansion to obtain an expanded point cloud rectangular surface;
s5: performing region association on the first detection result of the target and the expanded point cloud rectangular surface to obtain association data;
s6: extracting features of the associated data to obtain radar branch features;
s7: performing feature fusion on the visual branch feature and the radar branch feature to obtain fusion features;
s8: calculating a second detection result of the target based on the fusion characteristic;
s9: and inputting the first detection result of the target and the second detection result of the target into a preset decoder to obtain a final detection result of the target.
2. The method for detecting a target based on context clustering and multi-modal fusion according to claim 1, wherein in the step S1, the data enhancement processing performed on the image data and the point cloud data includes random horizontal flipping of data and random movement of data.
3. The method for detecting a target based on context clustering and multi-modal fusion according to claim 1, wherein in the step S2, the image data with enhanced data is input into an existing deep aggregation network model for feature extraction, and a visual branch feature F is obtained camera
4. The method for detecting a target based on context clustering and multi-modal fusion according to claim 1 or 3, wherein in the step S3, the visual branch feature is input into a trained first regression network, and a heat map and attribute parameters represented by the visual branch feature are calculated to obtain a first detection result of the target; the attribute parameters include target size, offset, three-dimensional size, depth, and direction.
5. The method for detecting a target based on context clustering and multi-modal fusion according to claim 4, wherein the specific method for obtaining the trained first regression network is as follows:
acquiring visual branch characteristics of image training data, inputting the visual branch characteristics into a constructed first regression network, and calculating a corresponding heat map, a target size, an offset, a three-dimensional size, a depth and a direction;
and setting a focus loss function for supervision training for the heat map, wherein the method specifically comprises the following steps:
wherein L is k Represents the focus loss value, N represents the target number of image training data, Y xyc A real heat map of the object is represented,a predicted heat map representing a target, and alpha and beta respectively represent a first super parameter and a second super parameter of focus loss;
and setting average absolute error as a loss function for monitoring and training the target size, offset, three-dimensional size, depth and direction, wherein the method specifically comprises the following steps:
wherein L is MAE Represents the average absolute error loss value, N represents the target number of image training data, f (x) i ) Representing a first predicted result, y, of the ith target i A real label representing the ith target;
and when the average absolute error loss value and the focus loss value are the minimum, obtaining a trained first regression network.
6. The method for detecting the target based on context clustering and multi-modal fusion according to claim 4, wherein the first detection result of the target and the expanded point cloud rectangular surface are subjected to regional association by using a cone association method to obtain association data.
7. The method for detecting the target based on context clustering and multi-modal fusion according to claim 6, wherein in the step S6, the feature extraction is performed on the associated data, and the specific method for obtaining the radar branch feature is as follows:
for each point cloud rectangular surface associated with the first detection result of the target, generating a corresponding radar branch feature F at the position radar The calculation formula is as follows:
wherein i=1, 2 and 3 respectively represent three characteristic channels of radar branch characteristics;representing the radar branch characteristics of the jth target on the ith characteristic channel, M i Representing the normalization factor on the ith characteristic channel, f i Representing the depth or radial velocity characteristic value on the ith characteristic channel, x and y representing the components of the target velocity in the horizontal and vertical directions, respectively,/->And->Represents the center point, w, of the jth target j And h j A width and a height of a two-dimensional bounding box representing the jth object; gamma is a super parameter for controlling the size of the two-dimensional boundary box of the target.
8. The method for detecting a target based on context clustering and multi-modal fusion according to claim 7, wherein in the step S7, the visual branch feature and the radar branch feature are input into a constructed context clustering fusion network to perform feature fusion, so as to obtain fusion features; the formula of feature fusion is:
wherein F is fused Representing the fusion characteristics, phi CCN Representing a context cluster fusion network, F radar Representing radar branching characteristics, F camera The characteristics of the visual branch are represented,representing feature stitching.
9. The method for detecting a target based on context clustering and multi-modal fusion according to claim 8, wherein in step S8, the fusion features are input into a trained quadratic regression network, and the depth, speed, direction and class of the target are recalculated;
for the depth, speed and direction of the target, the average absolute error is still adopted as a loss function for supervision training;
for the category, setting a cross entropy loss function to optimize, specifically:
wherein L is BCE Represents the cross entropy loss value, σ (x i ) A second prediction result representing an ith target;
and when the average absolute error loss value and the cross entropy loss value are minimum, obtaining a trained secondary regression network, and taking a second prediction result of the corresponding ith target as a second detection result of the target.
10. A context clustering and multi-modal fusion-based object detection system for implementing the object detection method according to any one of claims 1 to 9, comprising:
the data acquisition processing module is used for acquiring paired image data and point cloud data, and carrying out data enhancement processing on the image data and the point cloud data to obtain the image data and the point cloud data after data enhancement;
the visual feature extraction module is used for carrying out feature extraction on the image data after the data enhancement to obtain visual branch features;
the first detection module is used for calculating a corresponding heat map and attribute parameters based on the visual branch characteristics to obtain a first detection result of the target;
the point cloud expansion module is used for mapping the point cloud data after the data enhancement into an image coordinate system and carrying out rectangular expansion to obtain an expanded point cloud rectangular surface;
the multi-mode data association module is used for carrying out area association on the first detection result of the target and the expanded point cloud rectangular surface to obtain association data;
the radar feature extraction module is used for extracting features of the associated data to obtain radar branch features;
the feature fusion module is used for carrying out feature fusion on the vision branch features and the radar branch features to obtain fusion features;
the second detection module is used for calculating a second detection result of the target based on the fusion characteristic;
the target detection module is used for inputting the first detection result of the target and the second detection result of the target into a preset decoder to obtain a final detection result of the target.
CN202310660880.4A 2023-06-05 2023-06-05 Object detection method and system based on context clustering and multi-mode fusion Pending CN116894963A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310660880.4A CN116894963A (en) 2023-06-05 2023-06-05 Object detection method and system based on context clustering and multi-mode fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310660880.4A CN116894963A (en) 2023-06-05 2023-06-05 Object detection method and system based on context clustering and multi-mode fusion

Publications (1)

Publication Number Publication Date
CN116894963A true CN116894963A (en) 2023-10-17

Family

ID=88312734

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310660880.4A Pending CN116894963A (en) 2023-06-05 2023-06-05 Object detection method and system based on context clustering and multi-mode fusion

Country Status (1)

Country Link
CN (1) CN116894963A (en)

Similar Documents

Publication Publication Date Title
Wei et al. Toward automatic building footprint delineation from aerial images using CNN and regularization
CN112396650B (en) Target ranging system and method based on fusion of image and laser radar
CN107272021B (en) Object detection using radar and visually defined image detection areas
CN109829398B (en) Target detection method in video based on three-dimensional convolution network
Asvadi et al. 3D object tracking using RGB and LIDAR data
CN111201451A (en) Method and device for detecting object in scene based on laser data and radar data of scene
CN110688905B (en) Three-dimensional object detection and tracking method based on key frame
CN113506318B (en) Three-dimensional target perception method under vehicle-mounted edge scene
CN107369158B (en) Indoor scene layout estimation and target area extraction method based on RGB-D image
CN110663060B (en) Method, device, system and vehicle/robot for representing environmental elements
CN114495064A (en) Monocular depth estimation-based vehicle surrounding obstacle early warning method
CN105160649A (en) Multi-target tracking method and system based on kernel function unsupervised clustering
CN104331901A (en) TLD-based multi-view target tracking device and method
Jung et al. Object detection and tracking-based camera calibration for normalized human height estimation
CN110992424B (en) Positioning method and system based on binocular vision
CN113281718B (en) 3D multi-target tracking system and method based on laser radar scene flow estimation
CN113255779B (en) Multi-source perception data fusion identification method, system and computer readable storage medium
CN114140527A (en) Dynamic environment binocular vision SLAM method based on semantic segmentation
WO2023130842A1 (en) Camera pose determining method and apparatus
Yang et al. On-road vehicle tracking using keypoint-based representation and online co-training
CN116894963A (en) Object detection method and system based on context clustering and multi-mode fusion
Su Vanishing points in road recognition: A review
Dekkiche et al. Vehicles detection in stereo vision based on disparity map segmentation and objects classification
John et al. Sensor fusion and registration of lidar and stereo camera without calibration objects
Dai Semantic Detection of Vehicle Violation Video Based on Computer 3D Vision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination