CN114862968A

CN114862968A - Attention mechanism-based laser radar and camera automatic calibration method and device

Info

Publication number: CN114862968A
Application number: CN202210577412.6A
Authority: CN
Inventors: 李健; 孙毅; 王宇茹; 徐昕; 孙振平; 杨晓慧
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-08-05

Abstract

The application relates to a laser radar and camera automatic calibration method and device based on an attention mechanism, computer equipment and a storage medium. The method comprises the following steps: inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for encoding to obtain a scene characteristic and a target characteristic graph; calculating according to scene characteristics to obtain a view attention diagram, and performing bilinear interpolation sampling on a target characteristic diagram to obtain an initial characteristic set; carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set; performing target association attention on the visual angle attention diagram and the target feature set to obtain a cross-modal target matching result; and optimizing the result of cross-modal target matching by using a cascade particle swarm optimization algorithm to obtain a relative attitude. By adopting the method, the automatic calibration of the laser radar and the camera can be realized.

Description

Attention mechanism-based laser radar and camera automatic calibration method and device

Technical Field

The present application relates to the field of laser radar and camera calibration technologies, and in particular, to an attention-based method and apparatus for automatically calibrating a laser radar and a camera, a computer device, and a storage medium.

Background

The calibration task of the relative postures of the laser radar and the camera is to calculate the relative postures of the laser radar and the camera through a calibration algorithm, namely a relative rotation matrix and a translation matrix. The calibration task of the laser radar and the camera attitude is realized, most importantly, corresponding characteristics in radar data and camera images are found, and then the relative pose is solved through an optimization algorithm such as Epnp and the like. The problem behind the automatic calibration of lidar and cameras is the problem of transmembrane state matching, and for humans, corresponding scenes, targets, and edges can be easily found from uncalibrated lidar-camera data. For example, the manual calibration method is to search for corresponding feature associations by means of human cross-modal matching capability to realize calibration. The calibration method is time-consuming and labor-consuming, has higher requirements on calibration scenes, and cannot carry out online calibration. The basis for realizing automatic calibration is to automatically find corresponding features from a natural environment, and because the features of two modal data (point cloud and RGB image) have larger difference, the difficulty of automatically matching radar and camera related features by the algorithm is reduced by designing calibration objects (such as a circular calibration plate, a square calibration plate or a diamond calibration plate) with specific shapes in part of calibration algorithms, but the method has higher requirements on calibration scenes. In order to reduce the difficulty of searching for corresponding features, an algorithm which is partially independent of a specific calibration object requires that a rough attitude prior of a laser radar and a camera is firstly required, the range of searching for the corresponding features is limited (namely, the visual angles of the laser radar and the camera are roughly aligned at first), and then the prior attitude is corrected according to detail information such as edges and the like.

However, in the existing method, the corresponding characteristic association is often found by designing a specific calibration reference object such as a diamond plate, a spherical calibration object and other auxiliary algorithms, but the method is relatively complicated, needs real-time participation of people, and the calibration object must be carried around. Therefore, an automatic calibration algorithm is proposed, but the current automatic calibration algorithm still encounters two problems: the algorithm without depending on a specific calibration object needs initial parameters, and the algorithm depending on the specific calibration object does not need human participation, but excessively depends on the prior of the geometric shape of the specific calibration object, so that online calibration cannot be achieved, and cross-modal feature correlation cannot be automatically extracted from a natural scene.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device and a storage medium for automatic calibration of a lidar and a camera based on an attention mechanism, which can achieve automatic calibration of the lidar and the camera.

An attention mechanism-based automatic calibration method for a laser radar and a camera, the method comprising:

acquiring an RGB image and a laser radar point cloud shot by a camera;

inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for coding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud;

calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a visual angle attention diagram;

carrying out bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud and a target characteristic diagram of an RGB image to obtain an initial characteristic set;

carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set;

performing target association attention on the visual angle attention diagram and the target feature set to obtain a cross-modal target matching result;

optimizing the result of cross-modal target matching by using a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.

In one embodiment, the pre-trained loss function across the modal attention target correlation network is

Wherein, P _ij Representing the association probability, P, of the ith point cloud target and the jth image target predicted by the model _ij ^gt The probability of manual labeling is represented, m represents the number of 3D targets in the point cloud, and n represents the number of targets in the image.

In one embodiment, the obtaining of the perspective attention map by performing calculation according to the scene features of the lidar point cloud and the scene features of the RGB image includes:

calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a perspective attention map of A ═ softmax (conv (S) _R ,S _I )),A∈R ^1×32×4×16 In which S is _R Features of the scene representing the lidar point cloud, S _I The scene characteristics of the RGB image are represented, softmax represents a soft maximum algorithm, and conv represents convolution operation.

In one embodiment, the initial feature set comprises initial features of the lidar point cloud and initial features of the RGB image;

carrying out bilinear interpolation sampling on a target characteristic diagram of laser radar point cloud and a target characteristic diagram of an RGB image to obtain an initial characteristic set, comprising the following steps:

carrying out bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud to obtain an initial characteristic X of the laser radar point cloud _I ＝Bilinear(F _I ,O _I ),X _I ∈R ^N×32 Wherein, F _I A target feature map representing the lidar point cloud, N representing the number of targets in the image, 32 representing the feature dimension, O _I Target representing a lidar point cloud(ii) a Biliner represents Bilinear interpolation operation;

carrying out bilinear interpolation sampling on a target characteristic graph of the RGB image to obtain an initial characteristic X of the RGB image _R ＝Bilinear(F _R ,O _R ),X _R ∈R ^M×32 Wherein F is _R A target feature map representing an RGB image, M representing the number of targets in the point cloud, 32 representing the feature dimension, O _R Representing the object of an RGB image.

In one embodiment, the image structure encoding of the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set includes:

carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set comprising

Where X denotes the initial feature set, w (X) denotes the attention computing units, k denotes the number of attention computing units, mlp denotes the multi-layered non-linear perceptron, and cat denotes the combination of features.

In one embodiment, the performing target association attention on the perspective attention diagram and the target feature set to obtain a cross-modal target matching result includes:

calculating the confidence coefficient of the target of the laser radar point cloud existing in the RGB image and laser radar point cloud overlapping visual field according to the visual angle attention diagram;

and performing target association attention according to the confidence coefficient and the target feature set to obtain a cross-modal target matching result.

In one embodiment, the performing target association attention according to the confidence and the target feature set to obtain a cross-modal target matching result includes:

performing target association attention according to the confidence coefficient and the target feature set, wherein the result of obtaining cross-modal target matching is P ═ softmax ((X) _R X _I ^T )·P _R ) Wherein P is _R Indicating confidence and T indicating a transposition operation.

An attention mechanism based lidar and camera automatic calibration apparatus, the apparatus comprising:

the encoding module is used for acquiring an RGB image and a laser radar point cloud shot by a camera; inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for coding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud;

the overlapped visual angle calculation module is used for calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a visual angle attention diagram;

the sampling module is used for performing bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud and a target characteristic diagram of the RGB image to obtain an initial characteristic set;

the graph structure coding module is used for carrying out graph structure coding on the initial feature set by utilizing a pre-trained cross-modal attention target association network to obtain a target feature set;

the cross-modal target matching module is used for performing target association attention on the visual angle attention diagram and the target feature set to obtain a cross-modal target matching result;

the result optimization module is used for optimizing the result of cross-modal target matching by utilizing a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring an RGB image and a laser radar point cloud shot by a camera;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring an RGB image and a laser radar point cloud shot by a camera;

The attention-system-based automatic laser radar and camera calibration method, the attention-system-based automatic laser radar and camera calibration device, the computer equipment and the storage medium are characterized in that firstly, an RGB image and a laser radar point cloud are input into a pre-trained cross-modal attention target association network for encoding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud, the trained cross-modal attention target association network comprises an encoding module and a map structure encoding module, calculation is carried out according to the scene characteristic of the laser radar point cloud and the scene characteristic of the RGB image to obtain a visual angle attention diagram, and the overlapped visual angles of the camera and the laser radar can be obtained through the visual angle attention diagram; and then carrying out bilinear interpolation sampling on a target feature map of the laser radar point cloud and a target feature map of an RGB image to obtain an initial feature set, carrying out map structure coding on the initial feature set to obtain target features fusing context information of a target and the surrounding environment, and finally carrying out target association attention on the visual angle attention map and the target feature set to obtain a cross-modal target matching result. The method can automatically search the overlapped view angle of the camera and the laser radar, reduce the interference of targets outside the non-overlapped view angle, and can code the characteristics of the targets in the image and the point cloud according to the context information by combining the image structure coding module.

Drawings

FIG. 1 is a schematic flow chart illustrating an exemplary method for automatic calibration of a lidar and a camera based on an attention mechanism;

FIG. 2 is a flow diagram of the operation of a pre-trained cross-modal attention target association network in one embodiment;

FIG. 3 is a graphical illustration of calibration results in one embodiment;

FIG. 4 is a diagram of a target feature set in one embodiment;

FIG. 5 is a diagram illustrating results of cross-modal object matching in one embodiment;

FIG. 6 is a block diagram of an embodiment of an automatic calibration apparatus for a lidar and a camera based on an attention mechanism;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided an attention-based lidar and camera automatic calibration method, comprising the following steps:

102, acquiring an RGB image and a laser radar point cloud shot by a camera; inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for encoding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud.

The pre-trained cross-modal attention target association network comprises a convolutional neural network coding module, a graph structure coding module and a target association attention module.

And the convolutional neural network coding module is used for coding the RGB image and the laser radar point cloud to obtain a scene characteristic and a target characteristic map.

And the graph structure coding module is used for carrying out graph structure coding on the initial characteristics after the target characteristic graph is sampled to obtain target characteristics of the context relationship fusing the target and the environmental information.

The target association attention module is configured to perform target association attention on the perspective attention diagram and the target feature set to obtain a result of cross-modal target matching, and a workflow diagram of the pre-trained cross-modal attention target association network is shown in fig. 2, where ATOP represents the pre-trained cross-modal attention target association network.

And 104, calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a visual angle attention diagram.

Perspective attention is sought to represent overlapping perspectives of the RGB image and the lidar point cloud, while finding matching target areas from the overlapping perspectives.

And 106, carrying out bilinear interpolation sampling on the target characteristic graph of the laser radar point cloud and the target characteristic graph of the RGB image to obtain an initial characteristic set.

The characteristics (such as non-specially designed calibration objects such as people and vehicles) of each natural target in the two data of the laser radar point cloud and the RGB image are obtained through interpolation sampling, and the characteristics are an initial characteristic set. The initial feature set includes initial features of the lidar point cloud and initial features of the RGB image.

And 108, carrying out graph structure coding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set.

The characteristics in the initial characteristic set do not fully consider the context relation between the target and the surrounding environment, and the method and the device encode the initial characteristic set by introducing the graph structure encoding module to obtain the target characteristics which are promoted after combining the context information.

And step 110, performing target association attention on the visual angle attention diagram and the target feature set to obtain a cross-modal target matching result.

The method comprises the steps of obtaining an overlapped visual angle of an RGB image and a laser radar point cloud from a visual angle attention diagram, obtaining characteristics of each natural target in two data of the laser radar point cloud and the RGB image from a target characteristic set, carrying out target association attention on the two data to obtain cross-modal target matching, automatically extracting cross-modal characteristic association from a natural scene, establishing association of the laser radar data and the image data, and needing no initial parameters and no manually designed calibration objects.

Step 112, optimizing the cross-modal target matching result by using a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.

Due to the influence of target detection errors, the center and the vertex of an object in an RGB image often have larger deviation with the center and the vertex of an object in a laser radar Point cloud, and the errors can seriously interfere with an optimization result.

In the attention mechanism-based automatic calibration method for the laser radar and the camera, firstly, an RGB image and a laser radar point cloud are input into a pre-trained cross-modal attention target association network for encoding to obtain a scene feature and a target feature map corresponding to the RGB image and the laser radar point cloud, the trained cross-modal attention target association network comprises an encoding module and a map structure encoding module, calculation is carried out according to the scene feature of the laser radar point cloud and the scene feature of the RGB image to obtain a visual angle attention diagram, and an overlapped visual angle of the camera and the laser radar can be obtained through the visual angle attention diagram; and then carrying out bilinear interpolation sampling on a target feature map of the laser radar point cloud and a target feature map of an RGB image to obtain an initial feature set, carrying out map structure coding on the initial feature set to obtain target features fusing context information of a target and the surrounding environment, and finally carrying out target association attention on the visual angle attention map and the target feature set to obtain a cross-modal target matching result. The method can automatically search the overlapped view angle of the camera and the laser radar, reduce the interference of targets outside the non-overlapped view angle, and encode the characteristics of the targets in the image and the point cloud according to the context information by combining a graph structure encoding module and the similarity measurement of the target characteristics.

calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain an angle attention map A ═ soft max (conv (S) _R ，S _I ))，A∈R ^1×32×4×16 In which S is _R Features of the scene representing the lidar point cloud, S _I The scene characteristics of the RGB image are shown, soft max is a soft maximum algorithm, and conv is a convolution operation.

carrying out bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud to obtain an initial characteristic X of the laser radar point cloud _I ＝Bilinear(F _I ，O _I )，X _I ∈R ^N×32 Wherein F is _I A target feature map representing the lidar point cloud, N representing the number of targets in the image, 32 representing the feature dimension, O _J Representing Bilinear interpolation operation by target Bilinear of laser radar point cloud;

carrying out bilinear interpolation sampling on a target characteristic graph of the RGB image to obtain an initial characteristic X of the RGB image _R ＝Bilinear(F _R ，O _R )，X _R ∈R ^M×32 Wherein F is _R A target feature map representing an RGB image, M representing the number of targets in the point cloud, 32 representing the feature dimension, O _R Representing the object of an RGB image.

The pre-trained cross-modal attention target association network comprises two convolutional neural network coding modules, wherein each convolutional neural network coding module respectively codes two features, and one feature is a feature map { F) of a target _I ，F _R One of the other is scene characteristics S _I ，S _R }. Defining the target contained in the laser radar point cloud and the RGB image as O _I And O _R Each object being represented by the center of the object, the initial features of the object can be sampled from the feature map by bilinear interpolation { F } _I ，F _R Obtaining:

X _I ＝Bilinear(F _I ，O _I )，X _I ∈R ^N×32

X _R ＝Bilinear(F _R ，O _R )，X _R ∈R ^M×32

in the above formula, N represents the number of targets in the RGB image, M represents the number of targets in the laser radar point cloud, and 32 represents the characteristic dimension. Because the visual angle difference between the RGB image and the laser radar point cloud is large, if a three-dimensional target matched with the image target is directly searched from the whole laser radar visual angle, the matching precision is often interfered by targets outside the overlapped visual angle, and because the visual angle of the laser radar point cloud is far greater than that of the image, the visual angle corresponding to the image is found from the characteristic diagram of the laser radar point cloud by calculating the visual angle attention diagram, the characteristics of the laser radar point cloud are equally divided into 64 parts, and the local scene description of the laser radar point cloud is obtained: s _R ∈R ^1×32×4×16 The viewing angle attention map is calculated by:

A＝soft max(conv(S _R ，S _I ))，A∈R ^1×32×4×16

P _R ＝Bilinear(A，O _R )

the present invention uses visual attention diagram A to force the network to place the main effort of matching on top of one anotherIn the field of view, training of A is implicit in training of cross-modal object matching, P _R Represented by a laser radar point cloud target O _R Confidence exists in the overlapping field of view of the RGB image and the lidar point cloud.

and performing target association attention according to the confidence coefficient and the target feature set, wherein the result of cross-modal target matching is P ═ soft max ((X) _R X _I ^T )·P _R ) Wherein P is _R Indicating confidence and T indicating a transposition operation.

The initial features in the initial feature set do not fully consider context relation, the invention carries out graph structure convolution on the initial features by arranging a graph structure coding module in a cross-modal attention target correlation network, so that the initial features can be fused with the context relation between a target and the surrounding environment to obtain the target features, as shown in fig. 4, wherein cross labels are the target features, and the graph structure coding module is based on multi-head attention mechanism design (MHSA). A basic attention calculation unit is defined as:

wherein (W) _Q ，W _K ，W _V ) For three weights to be learned, d is the feature dimension.

Based on target characteristics (X) _I ，X _R ) Attention is paid to target association with attention A, and a cross-modal target matching result is obtained in a calculation mode of P ═ softmax ((X) _R X _I ^T )·P _R ) As shown in fig. 5, as a result of cross-modal object matching, the top left corner in the drawing is a view attention diagram, the bottom left corner is a distance projection diagram of the laser point cloud, and the right side is a visible light picture.

In one embodiment, the method (ATOP) of the present invention is verified on the NUDT data set and the KITTI public data set, in the optimization stage, the average rotation angle error (RRE) obtained through the initialization of the Point-PSO is (0.507 and 0.260 degrees) on the NUDT and the KITTI, the average translation error (RTE) is (112 and 144 millimeters), and the error after the initialization of the Point-PSO is reduced to (0.037 and 0.040 degrees) and (30mm and 24 mm).

TABLE 1

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, there is provided an attention-based lidar and camera automatic calibration apparatus, comprising: an encoding module 602, an overlapped view calculation module 604, a sampling module 606, a graph structure encoding module 608, a cross-modal object matching module 610, and a result optimization module 612, wherein:

the encoding module 602 is configured to acquire an RGB image and a lidar point cloud captured by a camera; inputting the RGB image and the laser radar point cloud into a pre-trained cross-modal attention target association network for coding to obtain a scene characteristic and a target characteristic map corresponding to the RGB image and the laser radar point cloud;

the overlapped view angle calculation module 604 is configured to perform calculation according to the scene features of the laser radar point cloud and the scene features of the RGB image to obtain a view angle attention diagram;

the sampling module 606 is configured to perform bilinear interpolation sampling on a target feature map of the laser radar point cloud and a target feature map of the RGB image to obtain an initial feature set;

a graph structure encoding module 608, configured to perform graph structure encoding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set;

a cross-modal target matching module 610, configured to perform target association attention on the perspective attention diagram and the target feature set to obtain a cross-modal target matching result;

a result optimization module 612, configured to optimize a cross-modal target matching result by using a cascaded particle swarm optimization algorithm to obtain a relative pose; the relative attitude is an automatic calibration result.

In one embodiment, the overlapped view angle calculating module 604 is further configured to calculate according to the scene features of the lidar point cloud and the scene features of the RGB image, so as to obtain a view angle attention diagram, including:

In one embodiment, the sampling module 606 is further configured to use the initial feature set to include initial features of the lidar point cloud and initial features of the RGB image;

carrying out bilinear interpolation sampling on a target characteristic diagram of the laser radar point cloud to obtain an initial characteristic X of the laser radar point cloud _I ＝Bilinear(F _I ,O _I ),X _I ∈R ^N×32 Wherein F is _I A target feature map representing the lidar point cloud, N representing the number of targets in the image, 32 representing the feature dimension, O _I A target representing a lidar point cloud; biliner represents Bilinear interpolation operation;

In one embodiment, the graph structure encoding module 608 is further configured to perform graph structure encoding on the initial feature set by using a pre-trained cross-modal attention target association network to obtain a target feature set, including:

In one embodiment, the cross-modal target matching module 610 is further configured to perform target association attention on the perspective attention diagram and the target feature set to obtain a cross-modal target matching result, including:

In one embodiment, the cross-modal target matching module 610 is further configured to perform target association attention according to the confidence and the target feature set, and obtain a result of cross-modal target matching, including:

performing target association attention according to the confidence coefficient and the target feature set, and obtaining a result of cross-modal target matching as P ═softmax((X _R X _I ^T )·P _R ) Wherein, P _R Indicating confidence and T indicating a transposition operation.

For specific limitations of the laser radar and camera automatic calibration device based on the attention mechanism, reference may be made to the above limitations of a laser radar and camera automatic calibration method based on the attention mechanism, which are not described herein again. The various modules in the attention-based lidar and camera autocalibration apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for automatic calibration of a lidar and a camera based on an attention mechanism. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.

In an embodiment, a computer storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An attention mechanism-based automatic calibration method for a laser radar and a camera is characterized by comprising the following steps:

acquiring an RGB image and a laser radar point cloud shot by a camera;

calculating according to the scene characteristics of the laser radar point cloud and the scene characteristics of the RGB image to obtain a view attention diagram,

carrying out bilinear interpolation sampling on the target characteristic graph of the laser radar point cloud and the target characteristic graph of the RGB image to obtain an initial characteristic set;

carrying out graph structure coding on the initial feature set by using the pre-trained cross-modal attention target association network to obtain a target feature set;

performing target association attention on the visual angle attention diagram and a target feature set to obtain a cross-modal target matching result;

optimizing the result of the cross-modal target matching by using a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.

2. The method of claim 1, wherein the pre-trained cross-modal attention target association networkA loss function of

3. The method of claim 1, wherein calculating from the scene features of the lidar point cloud and the RGB image to obtain a perspective attention map comprises:

calculating according to the scene features of the laser radar point cloud and the scene features of the RGB image to obtain a perspective attention diagram of A ═ softmax (conv (S) _R ,S _I )),A∈R ^1×32×4×16 In which S is _R Features of the scene representing the lidar point cloud, S _I The scene characteristics of the RGB image are represented, softmax represents a soft maximum algorithm, and conv represents convolution operation.

4. The method of any one of claims 1 to 3, wherein the initial set of features comprises initial features of the lidar point cloud and initial features of the RGB image;

carrying out bilinear interpolation sampling on the target characteristic diagram of the laser radar point cloud and the target characteristic diagram of the RGB image to obtain an initial characteristic set, wherein the bilinear interpolation sampling comprises the following steps:

carrying out bilinear interpolation sampling on the target characteristic diagram of the laser radar point cloud to obtain the initial characteristic X of the laser radar point cloud _I ＝Bilinear(F _I ,O _I ),X _I ∈R ^N×32 Wherein F is _I Representing a target feature map of the laser radar point cloud, N representing the number of targets in the image, 32 representing a feature dimension, and OI representing the target of the laser radar point cloud; biliner represents Bilinear interpolation operation;

carrying out bilinear interpolation sampling on the target characteristic diagram of the RGB image,the initial feature of the RGB image is X _R ＝Bilinear(F _R ,O _R ),X _R ∈R ^M×32 Wherein, F _R A target feature map representing an RGB image, M representing the number of targets in the point cloud, 32 representing the feature dimension, O _R Representing the object of an RGB image.

5. The method according to claim 4, wherein the graph structure coding of the initial feature set by using the pre-trained cross-modal attention target association network to obtain a target feature set comprises:

carrying out graph structure coding on the initial feature set by using the pre-trained cross-modal attention target association network to obtain a target feature set

6. The method of claim 5, wherein performing target association attention on the perspective attention map and a target feature set to obtain a cross-modal target matching result comprises:

7. The method of claim 6, wherein performing target association attention according to the confidence and the target feature set to obtain a cross-modal target matching result comprises:

performing target association attention according to the confidence coefficient and the target feature set, wherein the result of cross-modality target matching is P ═ soft max ((X ═ soft max) _R X _I ^T )·P _R ) Wherein P is _R Indicating confidence and T indicating a transposition operation.

8. An attention mechanism-based automatic laser radar and camera calibration device, which is characterized by comprising:

the sampling module is used for carrying out bilinear interpolation sampling on the target characteristic graph of the laser radar point cloud and the target characteristic graph of the RGB image to obtain an initial characteristic set;

the graph structure coding module is used for carrying out graph structure coding on the initial feature set by utilizing the pre-trained cross-modal attention target association network to obtain a target feature set;

the result optimization module is used for optimizing the result of the cross-modal target matching by utilizing a cascade particle swarm optimization algorithm to obtain a relative attitude; the relative attitude is an automatic calibration result.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.