CN114862968A - Attention mechanism-based laser radar and camera automatic calibration method and device - Google Patents

Attention mechanism-based laser radar and camera automatic calibration method and device Download PDF

Info

Publication number
CN114862968A
CN114862968A CN202210577412.6A CN202210577412A CN114862968A CN 114862968 A CN114862968 A CN 114862968A CN 202210577412 A CN202210577412 A CN 202210577412A CN 114862968 A CN114862968 A CN 114862968A
Authority
CN
China
Prior art keywords
target
attention
point cloud
laser radar
rgb image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210577412.6A
Other languages
Chinese (zh)
Inventor
李健
孙毅
王宇茹
徐昕
孙振平
杨晓慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210577412.6A priority Critical patent/CN114862968A/en
Publication of CN114862968A publication Critical patent/CN114862968A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S7/00Details of systems according to groups G01S13/00, G01S15/00, G01S17/00
    • G01S7/48Details of systems according to groups G01S13/00, G01S15/00, G01S17/00 of systems according to group G01S17/00
    • G01S7/497Means for monitoring or calibrating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4007Interpolation-based scaling, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Abstract

The application relates to a laser radar and camera automatic calibration method and device based on an attention mechanism, computer equipment and a storage medium. The method comprises the following steps: inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for encoding to obtain a scene characteristic and a target characteristic graph; calculating according to scene characteristics to obtain a view attention diagram, and performing bilinear interpolation sampling on a target characteristic diagram to obtain an initial characteristic set; carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set; performing target association attention on the visual angle attention diagram and the target feature set to obtain a cross-modal target matching result; and optimizing the result of cross-modal target matching by using a cascade particle swarm optimization algorithm to obtain a relative attitude. By adopting the method, the automatic calibration of the laser radar and the camera can be realized.

Description

Attention mechanism-based laser radar and camera automatic calibration method and device
Technical Field
The present application relates to the field of laser radar and camera calibration technologies, and in particular, to an attention-based method and apparatus for automatically calibrating a laser radar and a camera, a computer device, and a storage medium.
Background
The calibration task of the relative postures of the laser radar and the camera is to calculate the relative postures of the laser radar and the camera through a calibration algorithm, namely a relative rotation matrix and a translation matrix. The calibration task of the laser radar and the camera attitude is realized, most importantly, corresponding characteristics in radar data and camera images are found, and then the relative pose is solved through an optimization algorithm such as Epnp and the like. The problem behind the automatic calibration of lidar and cameras is the problem of transmembrane state matching, and for humans, corresponding scenes, targets, and edges can be easily found from uncalibrated lidar-camera data. For example, the manual calibration method is to search for corresponding feature associations by means of human cross-modal matching capability to realize calibration. The calibration method is time-consuming and labor-consuming, has higher requirements on calibration scenes, and cannot carry out online calibration. The basis for realizing automatic calibration is to automatically find corresponding features from a natural environment, and because the features of two modal data (point cloud and RGB image) have larger difference, the difficulty of automatically matching radar and camera related features by the algorithm is reduced by designing calibration objects (such as a circular calibration plate, a square calibration plate or a diamond calibration plate) with specific shapes in part of calibration algorithms, but the method has higher requirements on calibration scenes. In order to reduce the difficulty of searching for corresponding features, an algorithm which is partially independent of a specific calibration object requires that a rough attitude prior of a laser radar and a camera is firstly required, the range of searching for the corresponding features is limited (namely, the visual angles of the laser radar and the camera are roughly aligned at first), and then the prior attitude is corrected according to detail information such as edges and the like.
However, in the existing method, the corresponding characteristic association is often found by designing a specific calibration reference object such as a diamond plate, a spherical calibration object and other auxiliary algorithms, but the method is relatively complicated, needs real-time participation of people, and the calibration object must be carried around. Therefore, an automatic calibration algorithm is proposed, but the current automatic calibration algorithm still encounters two problems: the algorithm without depending on a specific calibration object needs initial parameters, and the algorithm depending on the specific calibration object does not need human participation, but excessively depends on the prior of the geometric shape of the specific calibration object, so that online calibration cannot be achieved, and cross-modal feature correlation cannot be automatically extracted from a natural scene.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device and a storage medium for automatic calibration of a lidar and a camera based on an attention mechanism, which can achieve automatic calibration of the lidar and the camera.
An attention mechanism-based automatic calibration method for a laser radar and a camera, the method comprising:
acquiring an RGB image and a laser radar point cloud shot by a camera;
inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for coding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud;
calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a visual angle attention diagram;
carrying out bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud and a target characteristic diagram of an RGB image to obtain an initial characteristic set;
carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set;
performing target association attention on the visual angle attention diagram and the target feature set to obtain a cross-modal target matching result;
optimizing the result of cross-modal target matching by using a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.
In one embodiment, the pre-trained loss function across the modal attention target correlation network is
Figure BDA0003662714290000021
Wherein, P ij Representing the association probability, P, of the ith point cloud target and the jth image target predicted by the model ij gt The probability of manual labeling is represented, m represents the number of 3D targets in the point cloud, and n represents the number of targets in the image.
In one embodiment, the obtaining of the perspective attention map by performing calculation according to the scene features of the lidar point cloud and the scene features of the RGB image includes:
calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a perspective attention map of A ═ softmax (conv (S) R ,S I )),A∈R 1×32×4×16 In which S is R Features of the scene representing the lidar point cloud, S I The scene characteristics of the RGB image are represented, softmax represents a soft maximum algorithm, and conv represents convolution operation.
In one embodiment, the initial feature set comprises initial features of the lidar point cloud and initial features of the RGB image;
carrying out bilinear interpolation sampling on a target characteristic diagram of laser radar point cloud and a target characteristic diagram of an RGB image to obtain an initial characteristic set, comprising the following steps:
carrying out bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud to obtain an initial characteristic X of the laser radar point cloud I =Bilinear(F I ,O I ),X I ∈R N×32 Wherein, F I A target feature map representing the lidar point cloud, N representing the number of targets in the image, 32 representing the feature dimension, O I Target representing a lidar point cloud(ii) a Biliner represents Bilinear interpolation operation;
carrying out bilinear interpolation sampling on a target characteristic graph of the RGB image to obtain an initial characteristic X of the RGB image R =Bilinear(F R ,O R ),X R ∈R M×32 Wherein F is R A target feature map representing an RGB image, M representing the number of targets in the point cloud, 32 representing the feature dimension, O R Representing the object of an RGB image.
In one embodiment, the image structure encoding of the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set includes:
carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set comprising
Figure BDA0003662714290000031
Where X denotes the initial feature set, w (X) denotes the attention computing units, k denotes the number of attention computing units, mlp denotes the multi-layered non-linear perceptron, and cat denotes the combination of features.
In one embodiment, the performing target association attention on the perspective attention diagram and the target feature set to obtain a cross-modal target matching result includes:
calculating the confidence coefficient of the target of the laser radar point cloud existing in the RGB image and laser radar point cloud overlapping visual field according to the visual angle attention diagram;
and performing target association attention according to the confidence coefficient and the target feature set to obtain a cross-modal target matching result.
In one embodiment, the performing target association attention according to the confidence and the target feature set to obtain a cross-modal target matching result includes:
performing target association attention according to the confidence coefficient and the target feature set, wherein the result of obtaining cross-modal target matching is P ═ softmax ((X) R X I T )·P R ) Wherein P is R Indicating confidence and T indicating a transposition operation.
An attention mechanism based lidar and camera automatic calibration apparatus, the apparatus comprising:
the encoding module is used for acquiring an RGB image and a laser radar point cloud shot by a camera; inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for coding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud;
the overlapped visual angle calculation module is used for calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a visual angle attention diagram;
the sampling module is used for performing bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud and a target characteristic diagram of the RGB image to obtain an initial characteristic set;
the graph structure coding module is used for carrying out graph structure coding on the initial feature set by utilizing a pre-trained cross-modal attention target association network to obtain a target feature set;
the cross-modal target matching module is used for performing target association attention on the visual angle attention diagram and the target feature set to obtain a cross-modal target matching result;
the result optimization module is used for optimizing the result of cross-modal target matching by utilizing a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring an RGB image and a laser radar point cloud shot by a camera;
inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for coding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud;
calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a visual angle attention diagram;
carrying out bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud and a target characteristic diagram of an RGB image to obtain an initial characteristic set;
carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set;
performing target association attention on the visual angle attention diagram and the target feature set to obtain a cross-modal target matching result;
optimizing the result of cross-modal target matching by using a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring an RGB image and a laser radar point cloud shot by a camera;
inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for coding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud;
calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a visual angle attention diagram;
carrying out bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud and a target characteristic diagram of an RGB image to obtain an initial characteristic set;
carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set;
performing target association attention on the visual angle attention diagram and the target feature set to obtain a cross-modal target matching result;
optimizing the result of cross-modal target matching by using a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.
The attention-system-based automatic laser radar and camera calibration method, the attention-system-based automatic laser radar and camera calibration device, the computer equipment and the storage medium are characterized in that firstly, an RGB image and a laser radar point cloud are input into a pre-trained cross-modal attention target association network for encoding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud, the trained cross-modal attention target association network comprises an encoding module and a map structure encoding module, calculation is carried out according to the scene characteristic of the laser radar point cloud and the scene characteristic of the RGB image to obtain a visual angle attention diagram, and the overlapped visual angles of the camera and the laser radar can be obtained through the visual angle attention diagram; and then carrying out bilinear interpolation sampling on a target feature map of the laser radar point cloud and a target feature map of an RGB image to obtain an initial feature set, carrying out map structure coding on the initial feature set to obtain target features fusing context information of a target and the surrounding environment, and finally carrying out target association attention on the visual angle attention map and the target feature set to obtain a cross-modal target matching result. The method can automatically search the overlapped view angle of the camera and the laser radar, reduce the interference of targets outside the non-overlapped view angle, and can code the characteristics of the targets in the image and the point cloud according to the context information by combining the image structure coding module.
Drawings
FIG. 1 is a schematic flow chart illustrating an exemplary method for automatic calibration of a lidar and a camera based on an attention mechanism;
FIG. 2 is a flow diagram of the operation of a pre-trained cross-modal attention target association network in one embodiment;
FIG. 3 is a graphical illustration of calibration results in one embodiment;
FIG. 4 is a diagram of a target feature set in one embodiment;
FIG. 5 is a diagram illustrating results of cross-modal object matching in one embodiment;
FIG. 6 is a block diagram of an embodiment of an automatic calibration apparatus for a lidar and a camera based on an attention mechanism;
FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided an attention-based lidar and camera automatic calibration method, comprising the following steps:
102, acquiring an RGB image and a laser radar point cloud shot by a camera; inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for encoding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud.
The pre-trained cross-modal attention target association network comprises a convolutional neural network coding module, a graph structure coding module and a target association attention module.
And the convolutional neural network coding module is used for coding the RGB image and the laser radar point cloud to obtain a scene characteristic and a target characteristic map.
And the graph structure coding module is used for carrying out graph structure coding on the initial characteristics after the target characteristic graph is sampled to obtain target characteristics of the context relationship fusing the target and the environmental information.
The target association attention module is configured to perform target association attention on the perspective attention diagram and the target feature set to obtain a result of cross-modal target matching, and a workflow diagram of the pre-trained cross-modal attention target association network is shown in fig. 2, where ATOP represents the pre-trained cross-modal attention target association network.
And 104, calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a visual angle attention diagram.
Perspective attention is sought to represent overlapping perspectives of the RGB image and the lidar point cloud, while finding matching target areas from the overlapping perspectives.
And 106, carrying out bilinear interpolation sampling on the target characteristic graph of the laser radar point cloud and the target characteristic graph of the RGB image to obtain an initial characteristic set.
The characteristics (such as non-specially designed calibration objects such as people and vehicles) of each natural target in the two data of the laser radar point cloud and the RGB image are obtained through interpolation sampling, and the characteristics are an initial characteristic set. The initial feature set includes initial features of the lidar point cloud and initial features of the RGB image.
And 108, carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set.
The characteristics in the initial characteristic set do not fully consider the context relation between the target and the surrounding environment, and the method and the device encode the initial characteristic set by introducing the graph structure encoding module to obtain the target characteristics which are promoted after combining the context information.
And step 110, performing target association attention on the visual angle attention diagram and the target feature set to obtain a cross-modal target matching result.
The method comprises the steps of obtaining an overlapped visual angle of an RGB image and a laser radar point cloud from a visual angle attention diagram, obtaining characteristics of each natural target in two data of the laser radar point cloud and the RGB image from a target characteristic set, carrying out target association attention on the two data to obtain cross-modal target matching, automatically extracting cross-modal characteristic association from a natural scene, establishing association of the laser radar data and the image data, and needing no initial parameters and no manually designed calibration objects.
Step 112, optimizing the cross-modal target matching result by using a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.
Due to the influence of target detection errors, the center and the vertex of an object in an RGB image often have larger deviation with the center and the vertex of an object in a laser radar Point cloud, and the errors can seriously interfere with an optimization result.
In the attention mechanism-based automatic calibration method for the laser radar and the camera, firstly, an RGB image and a laser radar point cloud are input into a pre-trained cross-modal attention target association network for encoding to obtain a scene feature and a target feature map corresponding to the RGB image and the laser radar point cloud, the trained cross-modal attention target association network comprises an encoding module and a map structure encoding module, calculation is carried out according to the scene feature of the laser radar point cloud and the scene feature of the RGB image to obtain a visual angle attention diagram, and an overlapped visual angle of the camera and the laser radar can be obtained through the visual angle attention diagram; and then carrying out bilinear interpolation sampling on a target feature map of the laser radar point cloud and a target feature map of an RGB image to obtain an initial feature set, carrying out map structure coding on the initial feature set to obtain target features fusing context information of a target and the surrounding environment, and finally carrying out target association attention on the visual angle attention map and the target feature set to obtain a cross-modal target matching result. The method can automatically search the overlapped view angle of the camera and the laser radar, reduce the interference of targets outside the non-overlapped view angle, and encode the characteristics of the targets in the image and the point cloud according to the context information by combining a graph structure encoding module and the similarity measurement of the target characteristics.
In one embodiment, the pre-trained loss function across the modal attention target correlation network is
Figure BDA0003662714290000081
Wherein, P ij Representing the association probability, P, of the ith point cloud target and the jth image target predicted by the model ij gt The probability of manual labeling is represented, m represents the number of 3D targets in the point cloud, and n represents the number of targets in the image.
In one embodiment, the obtaining of the perspective attention map by performing calculation according to the scene features of the lidar point cloud and the scene features of the RGB image includes:
calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain an angle attention map A ═ soft max (conv (S) R ,S I )),A∈R 1×32×4×16 In which S is R Features of the scene representing the lidar point cloud, S I The scene characteristics of the RGB image are shown, soft max is a soft maximum algorithm, and conv is a convolution operation.
In one embodiment, the initial feature set comprises initial features of the lidar point cloud and initial features of the RGB image;
carrying out bilinear interpolation sampling on a target characteristic diagram of laser radar point cloud and a target characteristic diagram of an RGB image to obtain an initial characteristic set, comprising the following steps:
carrying out bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud to obtain an initial characteristic X of the laser radar point cloud I =Bilinear(F I ,O I ),X I ∈R N×32 Wherein F is I A target feature map representing the lidar point cloud, N representing the number of targets in the image, 32 representing the feature dimension, O J Representing Bilinear interpolation operation by target Bilinear of laser radar point cloud;
carrying out bilinear interpolation sampling on a target characteristic graph of the RGB image to obtain an initial characteristic X of the RGB image R =Bilinear(F R ,O R ),X R ∈R M×32 Wherein F is R A target feature map representing an RGB image, M representing the number of targets in the point cloud, 32 representing the feature dimension, O R Representing the object of an RGB image.
The pre-trained cross-modal attention target association network comprises two convolutional neural network coding modules, wherein each convolutional neural network coding module respectively codes two features, and one feature is a feature map { F) of a target I ,F R One of the other is scene characteristics S I ,S R }. Defining the target contained in the laser radar point cloud and the RGB image as O I And O R Each object being represented by the center of the object, the initial features of the object can be sampled from the feature map by bilinear interpolation { F } I ,F R Obtaining:
X I =Bilinear(F I ,O I ),X I ∈R N×32
X R =Bilinear(F R ,O R ),X R ∈R M×32
in the above formula, N represents the number of targets in the RGB image, M represents the number of targets in the laser radar point cloud, and 32 represents the characteristic dimension. Because the visual angle difference between the RGB image and the laser radar point cloud is large, if a three-dimensional target matched with the image target is directly searched from the whole laser radar visual angle, the matching precision is often interfered by targets outside the overlapped visual angle, and because the visual angle of the laser radar point cloud is far greater than that of the image, the visual angle corresponding to the image is found from the characteristic diagram of the laser radar point cloud by calculating the visual angle attention diagram, the characteristics of the laser radar point cloud are equally divided into 64 parts, and the local scene description of the laser radar point cloud is obtained: s R ∈R 1×32×4×16 The viewing angle attention map is calculated by:
A=soft max(conv(S R ,S I )),A∈R 1×32×4×16
P R =Bilinear(A,O R )
the present invention uses visual attention diagram A to force the network to place the main effort of matching on top of one anotherIn the field of view, training of A is implicit in training of cross-modal object matching, P R Represented by a laser radar point cloud target O R Confidence exists in the overlapping field of view of the RGB image and the lidar point cloud.
In one embodiment, the image structure encoding of the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set includes:
carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set comprising
Figure BDA0003662714290000101
Where X denotes the initial feature set, w (X) denotes the attention computing units, k denotes the number of attention computing units, mlp denotes the multi-layered non-linear perceptron, and cat denotes the combination of features.
In one embodiment, the performing target association attention on the perspective attention diagram and the target feature set to obtain a cross-modal target matching result includes:
calculating the confidence coefficient of the target of the laser radar point cloud existing in the RGB image and laser radar point cloud overlapping visual field according to the visual angle attention diagram;
and performing target association attention according to the confidence coefficient and the target feature set to obtain a cross-modal target matching result.
In one embodiment, the performing target association attention according to the confidence and the target feature set to obtain a cross-modal target matching result includes:
and performing target association attention according to the confidence coefficient and the target feature set, wherein the result of cross-modal target matching is P ═ soft max ((X) R X I T )·P R ) Wherein P is R Indicating confidence and T indicating a transposition operation.
The initial features in the initial feature set do not fully consider context relation, the invention carries out graph structure convolution on the initial features by arranging a graph structure coding module in a cross-modal attention target correlation network, so that the initial features can be fused with the context relation between a target and the surrounding environment to obtain the target features, as shown in fig. 4, wherein cross labels are the target features, and the graph structure coding module is based on multi-head attention mechanism design (MHSA). A basic attention calculation unit is defined as:
Figure BDA0003662714290000111
wherein (W) Q ,W K ,W V ) For three weights to be learned, d is the feature dimension.
Based on target characteristics (X) I ,X R ) Attention is paid to target association with attention A, and a cross-modal target matching result is obtained in a calculation mode of P ═ softmax ((X) R X I T )·P R ) As shown in fig. 5, as a result of cross-modal object matching, the top left corner in the drawing is a view attention diagram, the bottom left corner is a distance projection diagram of the laser point cloud, and the right side is a visible light picture.
In one embodiment, the method (ATOP) of the present invention is verified on the NUDT data set and the KITTI public data set, in the optimization stage, the average rotation angle error (RRE) obtained through the initialization of the Point-PSO is (0.507 and 0.260 degrees) on the NUDT and the KITTI, the average translation error (RTE) is (112 and 144 millimeters), and the error after the initialization of the Point-PSO is reduced to (0.037 and 0.040 degrees) and (30mm and 24 mm).
TABLE 1
Figure BDA0003662714290000112
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 6, there is provided an attention-based lidar and camera automatic calibration apparatus, comprising: an encoding module 602, an overlapped view calculation module 604, a sampling module 606, a graph structure encoding module 608, a cross-modal object matching module 610, and a result optimization module 612, wherein:
the encoding module 602 is configured to acquire an RGB image and a lidar point cloud captured by a camera; inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for coding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud;
the overlapped view angle calculation module 604 is configured to perform calculation according to the scene features of the laser radar point cloud and the scene features of the RGB image to obtain a view angle attention diagram;
the sampling module 606 is configured to perform bilinear interpolation sampling on a target feature map of the laser radar point cloud and a target feature map of the RGB image to obtain an initial feature set;
a graph structure encoding module 608, configured to perform graph structure encoding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set;
a cross-modal target matching module 610, configured to perform target association attention on the perspective attention diagram and the target feature set to obtain a cross-modal target matching result;
a result optimization module 612, configured to optimize a cross-modal target matching result by using a cascaded particle swarm optimization algorithm to obtain a relative pose; the relative attitude is an automatic calibration result.
In one embodiment, the pre-trained loss function across the modal attention target correlation network is
Figure BDA0003662714290000121
Wherein, P ij Representing the association probability, P, of the ith point cloud target and the jth image target predicted by the model ij gt The probability of manual labeling is represented, m represents the number of 3D targets in the point cloud, and n represents the number of targets in the image.
In one embodiment, the overlapped view angle calculating module 604 is further configured to calculate according to the scene features of the lidar point cloud and the scene features of the RGB image, so as to obtain a view angle attention diagram, including:
calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a perspective attention map of A ═ softmax (conv (S) R ,S I )),A∈R 1×32×4×16 In which S is R Features of the scene representing the lidar point cloud, S I The scene characteristics of the RGB image are represented, softmax represents a soft maximum algorithm, and conv represents convolution operation.
In one embodiment, the sampling module 606 is further configured to use the initial feature set to include initial features of the lidar point cloud and initial features of the RGB image;
carrying out bilinear interpolation sampling on a target characteristic diagram of laser radar point cloud and a target characteristic diagram of an RGB image to obtain an initial characteristic set, comprising the following steps:
carrying out bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud to obtain an initial characteristic X of the laser radar point cloud I =Bilinear(F I ,O I ),X I ∈R N×32 Wherein F is I A target feature map representing the lidar point cloud, N representing the number of targets in the image, 32 representing the feature dimension, O I A target representing a lidar point cloud; biliner represents Bilinear interpolation operation;
carrying out bilinear interpolation sampling on a target characteristic graph of the RGB image to obtain an initial characteristic X of the RGB image R =Bilinear(F R ,O R ),X R ∈R M×32 Wherein F is R A target feature map representing an RGB image, M representing the number of targets in the point cloud, 32 representing the feature dimension, O R Representing the object of an RGB image.
In one embodiment, the graph structure encoding module 608 is further configured to perform graph structure encoding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set, including:
carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set comprising
Figure BDA0003662714290000131
Where X denotes the initial feature set, w (X) denotes the attention computing units, k denotes the number of attention computing units, mlp denotes the multi-layered non-linear perceptron, and cat denotes the combination of features.
In one embodiment, the cross-modal target matching module 610 is further configured to perform target association attention on the perspective attention diagram and the target feature set to obtain a cross-modal target matching result, including:
calculating the confidence coefficient of the target of the laser radar point cloud existing in the RGB image and laser radar point cloud overlapping visual field according to the visual angle attention diagram;
and performing target association attention according to the confidence coefficient and the target feature set to obtain a cross-modal target matching result.
In one embodiment, the cross-modal target matching module 610 is further configured to perform target association attention according to the confidence and the target feature set, and obtain a result of cross-modal target matching, including:
performing target association attention according to the confidence coefficient and the target feature set, and obtaining a result of cross-modal target matching as P ═softmax((X R X I T )·P R ) Wherein, P R Indicating confidence and T indicating a transposition operation.
For specific limitations of the laser radar and camera automatic calibration device based on the attention mechanism, reference may be made to the above limitations of a laser radar and camera automatic calibration method based on the attention mechanism, which are not described herein again. The various modules in the attention-based lidar and camera autocalibration apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for automatic calibration of a lidar and a camera based on an attention mechanism. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.
In an embodiment, a computer storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An attention mechanism-based automatic calibration method for a laser radar and a camera is characterized by comprising the following steps:
acquiring an RGB image and a laser radar point cloud shot by a camera;
inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for coding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud;
calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a view attention diagram,
carrying out bilinear interpolation sampling on the target characteristic graph of the laser radar point cloud and the target characteristic graph of the RGB image to obtain an initial characteristic set;
carrying out graph structure coding on the initial feature set by using the pre-trained cross-modal attention target association network to obtain a target feature set;
performing target association attention on the visual angle attention diagram and a target feature set to obtain a cross-modal target matching result;
optimizing the result of the cross-modal target matching by using a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.
2. The method of claim 1, wherein the pre-trained cross-modal attention target association networkA loss function of
Figure FDA0003662714280000011
Wherein, P ij Representing the association probability, P, of the ith point cloud target and the jth image target predicted by the model ij gt The probability of manual labeling is represented, m represents the number of 3D targets in the point cloud, and n represents the number of targets in the image.
3. The method of claim 1, wherein calculating from the scene features of the lidar point cloud and the RGB image to obtain a perspective attention map comprises:
calculating according to the scene features of the laser radar point cloud and the scene features of the RGB image to obtain a perspective attention diagram of A ═ softmax (conv (S) R ,S I )),A∈R 1×32×4×16 In which S is R Features of the scene representing the lidar point cloud, S I The scene characteristics of the RGB image are represented, softmax represents a soft maximum algorithm, and conv represents convolution operation.
4. The method of any one of claims 1 to 3, wherein the initial set of features comprises initial features of the lidar point cloud and initial features of the RGB image;
carrying out bilinear interpolation sampling on the target characteristic diagram of the laser radar point cloud and the target characteristic diagram of the RGB image to obtain an initial characteristic set, wherein the bilinear interpolation sampling comprises the following steps:
carrying out bilinear interpolation sampling on the target characteristic diagram of the laser radar point cloud to obtain the initial characteristic X of the laser radar point cloud I =Bilinear(F I ,O I ),X I ∈R N×32 Wherein F is I Representing a target feature map of the laser radar point cloud, N representing the number of targets in the image, 32 representing a feature dimension, and OI representing the target of the laser radar point cloud; biliner represents Bilinear interpolation operation;
carrying out bilinear interpolation sampling on the target characteristic diagram of the RGB image,the initial feature of the RGB image is X R =Bilinear(F R ,O R ),X R ∈R M×32 Wherein, F R A target feature map representing an RGB image, M representing the number of targets in the point cloud, 32 representing the feature dimension, O R Representing the object of an RGB image.
5. The method according to claim 4, wherein the graph structure coding of the initial feature set by using the pre-trained cross-modal attention target association network to obtain a target feature set comprises:
carrying out graph structure coding on the initial feature set by using the pre-trained cross-modal attention target association network to obtain a target feature set
Figure FDA0003662714280000021
Where X denotes the initial feature set, w (X) denotes the attention computing units, k denotes the number of attention computing units, mlp denotes the multi-layered non-linear perceptron, and cat denotes the combination of features.
6. The method of claim 5, wherein performing target association attention on the perspective attention map and a target feature set to obtain a cross-modal target matching result comprises:
calculating the confidence coefficient of the target of the laser radar point cloud existing in the RGB image and laser radar point cloud overlapping visual field according to the visual angle attention diagram;
and performing target association attention according to the confidence coefficient and the target feature set to obtain a cross-modal target matching result.
7. The method of claim 6, wherein performing target association attention according to the confidence and the target feature set to obtain a cross-modal target matching result comprises:
performing target association attention according to the confidence coefficient and the target feature set, wherein the result of cross-modality target matching is P ═ soft max ((X ═ soft max) R X I T )·P R ) Wherein P is R Indicating confidence and T indicating a transposition operation.
8. An attention mechanism-based automatic laser radar and camera calibration device, which is characterized by comprising:
the encoding module is used for acquiring an RGB image and a laser radar point cloud shot by a camera; inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for coding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud;
the overlapped visual angle calculation module is used for calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a visual angle attention diagram;
the sampling module is used for carrying out bilinear interpolation sampling on the target characteristic graph of the laser radar point cloud and the target characteristic graph of the RGB image to obtain an initial characteristic set;
the graph structure coding module is used for carrying out graph structure coding on the initial feature set by utilizing the pre-trained cross-modal attention target association network to obtain a target feature set;
the cross-modal target matching module is used for performing target association attention on the visual angle attention diagram and the target feature set to obtain a cross-modal target matching result;
the result optimization module is used for optimizing the result of the cross-modal target matching by utilizing a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202210577412.6A 2022-05-25 2022-05-25 Attention mechanism-based laser radar and camera automatic calibration method and device Pending CN114862968A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210577412.6A CN114862968A (en) 2022-05-25 2022-05-25 Attention mechanism-based laser radar and camera automatic calibration method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210577412.6A CN114862968A (en) 2022-05-25 2022-05-25 Attention mechanism-based laser radar and camera automatic calibration method and device

Publications (1)

Publication Number Publication Date
CN114862968A true CN114862968A (en) 2022-08-05

Family

ID=82640135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210577412.6A Pending CN114862968A (en) 2022-05-25 2022-05-25 Attention mechanism-based laser radar and camera automatic calibration method and device

Country Status (1)

Country Link
CN (1) CN114862968A (en)

Similar Documents

Publication Publication Date Title
US11145078B2 (en) Depth information determining method and related apparatus
CN110517278B (en) Image segmentation and training method and device of image segmentation network and computer equipment
CN108805898B (en) Video image processing method and device
CN110070564B (en) Feature point matching method, device, equipment and storage medium
US11651552B2 (en) Systems and methods for fine adjustment of roof models
CN109191554B (en) Super-resolution image reconstruction method, device, terminal and storage medium
CN115655262B (en) Deep learning perception-based multi-level semantic map construction method and device
US20220270323A1 (en) Computer Vision Systems and Methods for Supplying Missing Point Data in Point Clouds Derived from Stereoscopic Image Pairs
CN114937125B (en) Reconstructable metric information prediction method, reconstructable metric information prediction device, computer equipment and storage medium
CN112287730A (en) Gesture recognition method, device, system, storage medium and equipment
CN113378897A (en) Neural network-based remote sensing image classification method, computing device and storage medium
CN112733641A (en) Object size measuring method, device, equipment and storage medium
Lentsch et al. Slicematch: Geometry-guided aggregation for cross-view pose estimation
CN112258565A (en) Image processing method and device
CN117542122A (en) Human body pose estimation and three-dimensional reconstruction method, network training method and device
US20230350418A1 (en) Position determination by means of neural networks
CN116630442B (en) Visual SLAM pose estimation precision evaluation method and device
CN111652245B (en) Vehicle contour detection method, device, computer equipment and storage medium
CN117132649A (en) Ship video positioning method and device for artificial intelligent Beidou satellite navigation fusion
CN114202554A (en) Mark generation method, model training method, mark generation device, model training device, mark method, mark device, storage medium and equipment
CN111721283B (en) Precision detection method and device for positioning algorithm, computer equipment and storage medium
CN114067371B (en) Cross-modal pedestrian trajectory generation type prediction framework, method and device
CN113673288A (en) Idle parking space detection method and device, computer equipment and storage medium
CN114862968A (en) Attention mechanism-based laser radar and camera automatic calibration method and device
CN112184766B (en) Object tracking method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination