CN116823929A

CN116823929A - Cross-modal matching positioning method and system based on visual image and point cloud map

Info

Publication number: CN116823929A
Application number: CN202310588600.3A
Authority: CN
Inventors: 江昆; 杨殿阁; 苗津毓; 刘茂林; 王云龙; 杨彦鼎
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-05-23
Filing date: 2023-05-23
Publication date: 2023-09-29

Abstract

The invention provides a cross-mode matching positioning method and a system based on a visual image and a point cloud map, wherein the method comprises the following steps: acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image; projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle; inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity; and superposing the pose updating quantity to the initial pose of the visual image to obtain the optimized pose, and obtaining the final pose after multiple times of circulation to finish positioning. The invention solves the problems of low positioning precision and high cost of the existing automatic driving.

Description

Cross-modal matching positioning method and system based on visual image and point cloud map

Technical Field

The invention relates to the technical field of visual positioning, in particular to a cross-mode matching positioning method and system based on a visual image and a point cloud map.

Background

High-level autopilot tasks require high-precision position and attitude (hereinafter referred to as pose) information, and thus require the ability of a smart car to be positioned with high precision. In a high-level automatic driving task, a GPS signal is easy to interfere and has poor precision, so that an intelligent automobile generally relies on a high-precision environment map for map matching and positioning.

In the process of high-precision map construction, a mainstream method selects a laser radar to collect the environment. The positioning algorithm using the laser point cloud as input generally relies on the registration of the shape of the point cloud to perform positioning, namely, the current three-dimensional laser point cloud acquired at present is aligned with a point cloud map of a certain area of the environment as much as possible by optimizing the current vehicle pose, so that the most probable vehicle pose is estimated, and positioning is realized. The laser map matching and positioning algorithm is insensitive to interference which may cause appearance change in weather, seasons and the like in the environment, and is better in robustness and higher in accuracy. However, the high-performance laser radar is high in price and difficult to apply on a large scale, and is only suitable for being carried on a small number of special acquisition vehicles to construct a high-precision map; are not suitable for being carried on household and commercial vehicles for positioning. And positioning algorithms based on laser point cloud matching, such as classical ICP, NDT algorithms and the like, are sensitive to initial pose, the pose optimization process is difficult to converge, and the positioning algorithms are easy to interfere in challenging environments.

Another low cost solution is to use the visual data of the monocular camera for mapping and positioning. The visual mapping scheme often uses algorithms such as visual simultaneous localization and mapping (SLAM), visual Odometer (VO), or motion restoration structure (SfM) to construct a three-dimensional point cloud map of an environment, and the visual point cloud map may further include visual feature information, unlike a laser point cloud map. The visual map matching and positioning algorithm performs positioning by matching the features in the currently acquired image features with the three-dimensional road mark points in the visual point cloud map, namely, the current vehicle pose is optimized, so that the error between the image feature points and the re-projection points of the three-dimensional road mark points matched with the image feature points is as small as possible, and the most possible vehicle pose is estimated to realize positioning. The visual map matching and positioning algorithm has low cost and is easy to be widely applied; but the accuracy of the visual map is relatively poor and is subject to environmental appearance changes, which is difficult to meet in large-scale and variable autopilot scenes.

Disclosure of Invention

The invention provides a cross-mode matching positioning method and system based on a visual image and a point cloud map, which are used for solving the problems of low positioning accuracy and high cost of the existing automatic driving.

The invention provides a cross-mode matching positioning method based on a visual image and a point cloud map, which comprises the following steps:

acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;

projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle;

inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity;

and superposing the pose updating quantity to the initial pose of the visual image to obtain the optimized pose, and obtaining the final pose after multiple times of circulation to finish positioning.

According to the cross-modal matching positioning method based on the visual image and the point cloud map, which is provided by the invention, the visual image and the initial pose of the camera are obtained, and the laser point cloud local map is obtained based on the initial pose of the visual image, and the method specifically comprises the following steps:

acquiring a visual image of a camera and rough positioning of an initial pose;

searching in a pre-acquired laser point cloud map based on the rough positioning;

and generating a laser point cloud local map in the world coordinate system near the initial pose of the camera.

According to the cross-modal matching positioning method based on the visual image and the point cloud map, which is provided by the invention, projection is performed based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle, and the method specifically comprises the following steps:

according to the initial pose of the camera and the internal parameters of the camera, projecting a laser point cloud local map under a world coordinate system to the camera coordinate system;

and re-projecting the laser point cloud local map under the camera coordinate system to a normalized pixel coordinate system to obtain a point cloud projection depth map under the initial pose view angle of the camera.

According to the cross-modal matching positioning method based on the visual image and the point cloud map, the point cloud projection depth map, the visual image and the initialized pose updating quantity are input into a trained pose solving network to obtain the optimized pose updating quantity, and the method specifically comprises the following steps:

the pose solving network adopts a full-attention network, and a visual image and a point cloud projection depth map are input to the full-attention network;

the visual image is processed by a visual feature encoder to obtain a high-dimensional visual feature map;

the point cloud projection depth map is processed by a point cloud feature encoder to obtain a Gao Weidian cloud feature map;

the high-dimensional visual features and Gao Weidian cloud features belong to features of different modes, and the similarity of the visual features and each point cloud feature in the Gao Weidian cloud feature map is calculated by taking each visual feature in the high-dimensional visual feature map as a reference to obtain three-dimensional feature matching cost;

and using the initialized pose updating quantity as a retrieval value, processing the three-dimensional feature matching cost to obtain a key value and a content value, updating the retrieval value, and generating the optimized pose updating quantity.

According to the cross-modal matching positioning method based on the visual image and the point cloud map, the pose solving network training process comprises the following steps:

giving a visual image and a laser point cloud map which are acquired at a certain moment, and applying a random posture transformation to the laser point cloud map to obtain a noise-added laser point cloud;

converting and projecting the laser point cloud based on the added noise through the internal parameters and the external parameters of the camera to obtain a new point cloud projection depth map;

inputting the visual image and the new point cloud projection depth map into a pose solving network to obtain a relative pose between the visual image and the new point cloud projection depth map;

and monitoring the relative pose, enabling the relative pose to approach to the known true value relative pose, and optimizing parameters of a pose solving network.

According to the cross-modal matching positioning method based on the visual image and the point cloud map, provided by the invention, the pose updating quantity is superimposed to the pose of the visual image, the initial pose of which is optimized, and the final pose is obtained through multiple times of circulation, so that the positioning is completed, and the method specifically comprises the following steps:

in the first pose optimization process, roughly positioning the initial pose;

and in each pose optimization process, taking the pose after the previous optimization as the initial pose in the current optimization process, searching and projecting from the laser point cloud map to obtain a new point cloud projection depth map, and carrying out iterative optimization on the pose to obtain the final pose, thereby completing positioning.

The invention also provides a cross-modal matching and positioning system based on the visual image and the point cloud map, which comprises the following steps:

the data acquisition module is used for acquiring a visual image and an initial pose of the camera and acquiring a laser point cloud local map based on the initial pose of the visual image;

the projection module is used for projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle;

the pose solving module is used for inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity;

and the pose optimization module is used for superposing the pose updating quantity to the initial pose of the visual image to obtain an optimized pose, and obtaining a final pose after multiple times of circulation to finish positioning.

According to the cross-mode matching positioning system based on the visual image and the point cloud map, the data acquisition module acquires the visual image of the camera and the rough positioning of the initial pose;

and generating a laser point cloud local map in the world coordinate system near the initial pose of the camera. According to the cross-mode matching positioning system based on the visual image and the point cloud map, the projection module projects the laser point cloud local map under the world coordinate system to the camera coordinate system according to the initial pose of the camera and the internal parameters of the camera;

According to the cross-modal matching positioning system based on the visual image and the point cloud map, the pose solving module adopts a full-attention network according to the pose solving network, and inputs the visual image and the point cloud projection depth map into the full-attention network;

According to the cross-mode matching positioning system based on the visual image and the point cloud map, the pose optimization module takes rough positioning as an initial pose in the first pose optimization process;

The invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the cross-mode matching positioning method based on the visual image and the point cloud map when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a cross-modality matching localization method based on a visual image and a point cloud map as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a cross-modal matching localization method based on a visual image and a point cloud map as described in any one of the above.

According to the cross-modal matching positioning method and system based on the visual image and the point cloud map, the visual image and the point cloud projection depth map are input into the pose solving network, and the final pose is obtained through multiple times of optimization; in the positioning stage, only a low-cost visual camera is needed to be used as a sensor, so that the cost is low, and the method is more suitable for large-scale business; compared with the traditional laser point cloud positioning algorithms such as ICP, NDT and the like, the method can iteratively optimize the pose in the positioning stage, iteratively searches the point cloud projection depth map from the laser point cloud map, optimizes the pose, and solves the pose by using the neural network trained from end to end, so that the optimization process is smoother, insensitive to the initial pose and more suitable for extreme scenes with larger GPS signal errors; in the mapping stage, a laser radar is used for mapping, the point cloud map is more accurate and insensitive to environmental appearance change, and the optimized laser point cloud map does not contain point cloud characteristics, so that the space required for map storage is smaller; in addition, as the laser point cloud map is more accurate than the visual point cloud map, the positioning accuracy is higher, and the accuracy requirement of the automatic driving automobile on the positioning function can be met.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is one of the flow diagrams of the cross-modal matching positioning method based on the visual image and the point cloud map provided by the invention;

FIG. 2 is a second schematic flow chart of a cross-modal matching positioning method based on a visual image and a point cloud map;

FIG. 3 is a third flow chart of the cross-modal matching localization method based on the visual image and the point cloud map provided by the invention;

FIG. 4 is a fourth schematic flow chart of a cross-modal matching positioning method based on a visual image and a point cloud map;

FIG. 5 is a fifth flow chart of the cross-modal matching localization method based on the visual image and the point cloud map provided by the invention;

FIG. 6 is a schematic diagram of the module connection of the cross-modal matching localization system based on visual images and point cloud maps provided by the invention;

fig. 7 is a schematic structural diagram of an electronic device provided by the present invention;

FIG. 8 is a diagram of an optimized positioning framework for multiple cycles provided by the present invention;

fig. 9 is a schematic diagram of a pose solving network structure provided by the invention.

Reference numerals:

110: a data acquisition module; 120: a projection module; 130: the pose solving module; 140: the pose optimization module;

710: a processor; 720: a communication interface; 730: a memory; 740: a communication bus.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention relates to a cross-modal matching positioning method based on a visual image and a point cloud map, which is described below with reference to fig. 1-5, and comprises the following steps:

s100, acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;

s200, projecting based on the laser point cloud local map to obtain a point cloud projection depth map under the initial pose view angle;

s300, inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity;

and S400, superposing the pose updating quantity to the initial pose of the visual image to obtain the optimized pose, and obtaining the final pose after multiple times of circulation to finish positioning.

According to the invention, the camera data is used for carrying out a visual matching positioning scheme in the laser point cloud map, so that the problems of precision and cost can be solved. The laser point cloud map has high precision, is constructed by a professional acquisition vehicle, and can be accepted in one-time cost; the visual camera mass production cost is low, the large-scale commercial use can be realized, and the camera data of different modes and the laser point cloud data information are matched through the invention, so that the high-precision positioning with low cost can be realized.

The method comprises the steps of obtaining a visual image and an initial pose of a camera, and obtaining a laser point cloud local map based on the initial pose of the visual image, and specifically comprises the following steps:

s101, acquiring a visual image of a camera and rough positioning of an initial pose;

s102, searching in a pre-acquired laser point cloud map based on the rough positioning;

s103, generating a laser point cloud local map in the world coordinate system near the initial pose of the camera.

In the invention, the laser point cloud map is required to be built offline, and the invention does not limit the construction method of the laser point cloud map. The system inputs a visual image I and a GPS positioning signal thereof which are acquired when the vehicle needs to be positioned, and outputs the visual image I and the GPS positioning signal as the current pose of the vehicle. System for initiating pose by cameraCoarse positioning is carried out, so that searching is carried out in the laser point cloud map to obtain Q _k Nearby laser point cloud local map->Wherein w represents->In world coordinate system. Positioning accuracy can be improved through the laser point cloud map.

Projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle, wherein the method specifically comprises the following steps of:

s201, projecting a laser point cloud local map under a world coordinate system to a camera coordinate system according to an initial pose of the camera and internal parameters of the camera;

s202, the laser point cloud local map under the camera coordinate system is projected to a normalized pixel coordinate system, and a point cloud projection depth map under the initial pose view angle of the camera is obtained.

In the invention, according to the initial pose Q of the camera _k And camera internal parametersThe laser point cloud local map under the world coordinate system is projected under the camera coordinate system c and then projected under the normalized pixel coordinate system I, so as to obtain a point cloud projection depth map D with the same resolution as the visual image I, and the calculation process is as follows:

D(y,x)＝Z#(3)

the method does not limit the point cloud preprocessing operation (such as outlier rejection, motion compensation and the like) in the point cloud map projection process.

Inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity, wherein the method specifically comprises the following steps of:

s301, the pose solving network adopts a full-attention network, and a visual image and a point cloud projection depth map are input to the full-attention network;

s302, the visual image is processed by a visual feature encoder to obtain a high-dimensional visual feature map;

s303, processing the point cloud projection depth map by a point cloud feature encoder to obtain a Gao Weidian cloud feature map;

s304, calculating the similarity of the visual features and each point cloud feature in the Gao Weidian cloud feature map by taking each visual feature in the high-dimensional visual feature map as a reference, wherein the high-dimensional visual features and the Gao Weidian cloud features belong to features of different modes, so as to obtain three-dimensional feature matching cost;

s305, using the initialized pose updating quantity as a search value, processing the three-dimensional feature matching cost to obtain a key value and a content value, updating the search value, and generating the optimized pose updating quantity.

In the invention, the pose solving network is a network with double input branches and single output branches, and adopts an attention mechanism (full attention network transducer) to find data matching relations in different modal data and solve the poses. The two input branches respectively take a visual image I and a point cloud projection depth map D as inputs and respectively pass through a visual feature encoderAnd Point cloud feature encoder->Obtaining high-dimensional visual characteristic diagrams of different modes +.>And Gao Weidian cloud feature mapWhere h and w are the height and width, respectively, of the feature map and c is the dimension of the feature map.

Each visual feature in the high-dimensional visual feature mapCalculating each point cloud characteristic in the visual characteristic and Gao Weidian cloud characteristic diagram as a reference>Similarity of->Obtaining three-dimensional feature matching cost->After the three-dimensional feature matching cost C is obtained, the last pose updating quantity is +.>As a retrieval value q, processing the feature matching cost to obtain a key value k and a content value v, and updating the retrieval value q according to a decoder model based on an attention mechanism to obtain an optimized pose updating amount +.>

The full-attention network can be overlapped with multiple layers, and the first layer of full-attention network randomly initializes the input pose updating quantityThe pose update amount outputted by the previous layer for each layer (i.e. the (i+1) th layer)>For input, iterative optimization is carried out to obtain +.>The process simulates the strategy of iterative optimization solution in the traditional Gauss Newton method, is more stable, and is easier to obtain the optimal optimization result.

In a full-attention network, a linear coding function Ebed is adopted _q () Updating the input poseConverting into a high-dimensional feature vector as a search value q:

flattening the feature matching cost C, and treating the feature matching cost C as h multiplied by w hw dimension feature vectorsAlso via linear coding function Ebed _k (),Embed _v () Obtaining a key value k and a content value v:

updating the retrieval value q according to the attention mechanism:

q＝BN((q ^T k)v+q)#(7)

where BN () is a layer regularization process. The present invention is not limited to attention mechanisms, such as multi-head attention mechanisms, self-attention mechanisms, cross-attention mechanisms, deformable attention mechanisms, etc., which can be applied to the present invention without any obstacle. Finally, the updated search value q is converted into a pose updating quantity through a linear decoding function:

in the invention, the pose solving network training process is as follows:

s401, giving a visual image and a laser point cloud map acquired at a certain moment, and applying a random posture transformation to the laser point cloud map to obtain a noise-added laser point cloud;

s402, converting and projecting the laser point cloud based on noise addition through the internal parameters and the external parameters of the camera to obtain a new point cloud projection depth map;

s403, inputting the visual image and the new point cloud projection depth map into a pose solving network to obtain a relative pose between the visual image and the new point cloud projection depth map;

s404, supervising the relative pose, enabling the relative pose to approach a known true value relative pose, and optimizing parameters of a pose solving network.

It is known that in the cross-modal visual matching positioning system provided by the invention, a pose solving network needs to train network parameters in a targeted manner. The pose solving network needs training data with a visual image and laser point cloud, wherein the visual image and the laser point cloud are different mode data acquired simultaneously, and an internal reference K of a visual camera and the relative pose between the internal reference K and a laser radar are obtained through a sensor internal and external reference calibration tool

During training, a visual image I and a laser point cloud map P acquired at a certain moment are given ^l According to the internal parameter K and the external parameter T _cl Converting the point cloud P under the laser radar coordinate system into a camera coordinate system, and then projecting the point cloud P under the laser radar coordinate system into a pixel coordinate system to obtain a point cloud projection depth map D corresponding to the visual image I:

D(y,x)＝Z#(11)

the relative pose between I and D should be 0 at this time. In an actual positioning scene, a point cloud projection depth map D obtained by projecting from a laser point cloud map according to an initial pose is generally different from a pose where I is, and in order to simulate the situation, the invention solves a network in training the pose, and aims at solving the laser point cloud map P ^l A random pose transformation T is applied _rand Obtaining a noise-added laser point cloud P ^l′ :

P is prepared by using internal and external ginseng ^l′ And converting and projecting to obtain a new point cloud projection depth map D'. Inputting the I and the D 'into a pose solving network to obtain a relative pose Q between the I and the D'. Because of the true relative pose T between I and D' is known _rand The relative pose is directly supervised, so that the predicted relative pose Q approaches the true value relative pose T _rand The parameters theta of the pose solving network are optimized by the method:

θ ^* ＝argmin _θ (‖Q-T _rand ‖)#(13)

the invention predicts the relative pose Q and the true relative pose T by calculation _rand The method of the difference is not limited, such as directly solving the Euclidean or L1 distance of the transformation matrix, converting the transformation matrix into translation and rotation quantities, solving the Euclidean or L1 distance, and the like.

The pose updating amount is overlapped to the initial pose of the visual image to obtain the optimized pose, and the final pose is obtained through multiple times of circulation, so that the positioning is completed, and the method specifically comprises the following steps:

in the first pose optimization process, roughly positioning the initial pose;

In one embodiment, system verification is performed, and to verify the validity of the invention, verification is performed on the disclosed KITTI data set. Training was performed on sequences 03, 05, 06, 07, 08, 09 and testing was performed on sequence 00. The laser point cloud map is constructed using a laser SLAM algorithm. In the test, a pose solving network is used twice, namely, pose is circularly optimized twice. The pose solving network used comprises a 6-layer full-attention network, i.e. the pose is iteratively optimized six times in the network.

Table 1 Cross-modality matching positioning System Effect

Referring to table 1, it can be seen that although the initial relative pose error of the visual image is very large, the positioning accuracy is remarkably improved after one-time cross-modal matching positioning. However, as the point cloud projection depth map is affected by the initial pose, map searching and projection are performed by using the pose after optimization once to obtain a new point cloud projection depth map, and then matching and positioning are performed, so that the effect is further improved. And finally, the positioning precision of the system is higher, and the system only needs a monocular camera during positioning, so that the positioning cost is lower.

According to the cross-modal matching positioning method based on the visual image and the point cloud map, the visual image and the point cloud projection depth map are input into a pose solving network, and the final pose is obtained through multiple times of optimization; in the positioning stage, only a low-cost visual camera is needed to be used as a sensor, so that the cost is low, and the method is more suitable for large-scale business; compared with the traditional laser point cloud positioning algorithms such as ICP, NDT and the like, the method can iteratively optimize the pose in the positioning stage, iteratively searches the point cloud projection depth map from the laser point cloud map, optimizes the pose, and solves the pose by using the neural network trained from end to end, so that the optimization process is smoother, insensitive to the initial pose and more suitable for extreme scenes with larger GPS signal errors; in the mapping stage, a laser radar is used for mapping, the point cloud map is more accurate and insensitive to environmental appearance change, and the optimized laser point cloud map does not contain point cloud characteristics, so that the space required for map storage is smaller; in addition, as the laser point cloud map is more accurate than the visual point cloud map, the positioning accuracy is higher, and the accuracy requirement of the automatic driving automobile on the positioning function can be met.

Referring to fig. 6, 8 and 9, the invention also discloses a cross-modal matching positioning system based on the visual image and the point cloud map, the system comprises:

the data acquisition module 110 is used for acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;

the projection module 120 is configured to project a point cloud projection depth map under an initial pose view based on the laser point cloud local map;

the pose solving module 130 is configured to input the point cloud projection depth map, the visual image and the initialized pose updating amount into a trained pose solving network, so as to obtain an optimized pose updating amount;

and the pose optimization module 140 is configured to superimpose the pose update amount on the pose of the visual image, where the initial pose is optimized, and obtain a final pose after multiple cycles, so as to complete positioning.

The data acquisition module 110 acquires a visual image of the camera and rough positioning of an initial pose;

The projection module 120 projects the laser point cloud local map under the world coordinate system to the camera coordinate system according to the initial pose of the camera and the internal parameters of the camera;

and re-projecting the camera coordinate system to a normalized pixel coordinate system to obtain a point cloud projection depth map under the initial pose view angle of the camera.

The pose solving module 130 adopts a full-attention network, and inputs the visual image and the point cloud projection depth map to the full-attention network;

The pose solving network training process comprises the following steps:

The pose optimization module 140 takes rough positioning as an initial pose in the first pose optimization process;

According to the cross-mode matching positioning system based on the visual image and the point cloud map, the visual image and the point cloud projection depth map are input into a pose solving network, and the final pose is obtained through multiple times of optimization; in the positioning stage, only a low-cost visual camera is needed to be used as a sensor, so that the cost is low, and the method is more suitable for large-scale business; compared with the traditional laser point cloud positioning algorithms such as ICP, NDT and the like, the method can iteratively optimize the pose in the positioning stage, iteratively searches the point cloud projection depth map from the laser point cloud map, optimizes the pose, and solves the pose by using the neural network trained from end to end, so that the optimization process is smoother, insensitive to the initial pose and more suitable for extreme scenes with larger GPS signal errors; in the mapping stage, a laser radar is used for mapping, the point cloud map is more accurate and insensitive to environmental appearance change, and the optimized laser point cloud map does not contain point cloud characteristics, so that the space required for map storage is smaller; in addition, as the laser point cloud map is more accurate than the visual point cloud map, the positioning accuracy is higher, and the accuracy requirement of the automatic driving automobile on the positioning function can be met.

Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a cross-modality matching localization method based on visual images and point cloud maps, the method comprising: acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;

Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute a cross-modal matching positioning method based on a visual image and a point cloud map provided by the above methods, and the method includes: acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;

In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the cross-modal matching localization method based on a visual image and a point cloud map provided by the above methods, the method comprising: acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The cross-modal matching positioning method based on the visual image and the point cloud map is characterized by comprising the following steps of:

2. The cross-modal matching positioning method based on the visual image and the point cloud map according to claim 1, wherein the obtaining the visual image and the initial pose of the camera, and obtaining the laser point cloud local map based on the initial pose of the visual image, specifically comprises:

acquiring a visual image of a camera and rough positioning of an initial pose;

3. The cross-modal matching positioning method based on the visual image and the point cloud map according to claim 1, wherein the method is characterized by obtaining a point cloud projection depth map under an initial pose view angle by projecting based on the laser point cloud local map, and specifically comprises the following steps:

4. The cross-modal matching positioning method based on the visual image and the point cloud map according to claim 1, wherein the point cloud projection depth map, the visual image and the initialized pose updating amount are input into a trained pose solving network to obtain the optimized pose updating amount, and specifically comprises the following steps:

the pose solving network adopts a full-attention network, and visual images and a point cloud projection depth map are input into the full-attention network;

5. The cross-modal matching positioning method based on the visual image and the point cloud map as claimed in claim 4, wherein the pose solving network training process is as follows:

6. The cross-modal matching positioning method based on the visual image and the point cloud map according to claim 1, wherein the method is characterized in that the pose updating amount is superimposed to the pose of the visual image, the initial pose of which is optimized, and the final pose is obtained through a plurality of times of circulation, and the positioning is completed, and specifically comprises the following steps:

in the first pose optimization process, roughly positioning the initial pose;

7. A cross-modality matching and positioning system based on visual images and point cloud maps, the system comprising:

8. The cross-modal matching and positioning system based on the visual image and the point cloud map according to claim 7, wherein the pose optimization module performs rough positioning as an initial pose in a first pose optimization process;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the cross-modality matching localization method based on a visual image and a point cloud map as claimed in any one of claims 1 to 6 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a cross-modality matching localization method based on a visual image and a point cloud map as claimed in any one of claims 1 to 6.