CN116823929A - Cross-modal matching positioning method and system based on visual image and point cloud map - Google Patents

Cross-modal matching positioning method and system based on visual image and point cloud map Download PDF

Info

Publication number
CN116823929A
CN116823929A CN202310588600.3A CN202310588600A CN116823929A CN 116823929 A CN116823929 A CN 116823929A CN 202310588600 A CN202310588600 A CN 202310588600A CN 116823929 A CN116823929 A CN 116823929A
Authority
CN
China
Prior art keywords
pose
point cloud
map
visual image
positioning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310588600.3A
Other languages
Chinese (zh)
Inventor
江昆
杨殿阁
苗津毓
刘茂林
王云龙
杨彦鼎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202310588600.3A priority Critical patent/CN116823929A/en
Publication of CN116823929A publication Critical patent/CN116823929A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a cross-mode matching positioning method and a system based on a visual image and a point cloud map, wherein the method comprises the following steps: acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image; projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle; inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity; and superposing the pose updating quantity to the initial pose of the visual image to obtain the optimized pose, and obtaining the final pose after multiple times of circulation to finish positioning. The invention solves the problems of low positioning precision and high cost of the existing automatic driving.

Description

Cross-modal matching positioning method and system based on visual image and point cloud map
Technical Field
The invention relates to the technical field of visual positioning, in particular to a cross-mode matching positioning method and system based on a visual image and a point cloud map.
Background
High-level autopilot tasks require high-precision position and attitude (hereinafter referred to as pose) information, and thus require the ability of a smart car to be positioned with high precision. In a high-level automatic driving task, a GPS signal is easy to interfere and has poor precision, so that an intelligent automobile generally relies on a high-precision environment map for map matching and positioning.
In the process of high-precision map construction, a mainstream method selects a laser radar to collect the environment. The positioning algorithm using the laser point cloud as input generally relies on the registration of the shape of the point cloud to perform positioning, namely, the current three-dimensional laser point cloud acquired at present is aligned with a point cloud map of a certain area of the environment as much as possible by optimizing the current vehicle pose, so that the most probable vehicle pose is estimated, and positioning is realized. The laser map matching and positioning algorithm is insensitive to interference which may cause appearance change in weather, seasons and the like in the environment, and is better in robustness and higher in accuracy. However, the high-performance laser radar is high in price and difficult to apply on a large scale, and is only suitable for being carried on a small number of special acquisition vehicles to construct a high-precision map; are not suitable for being carried on household and commercial vehicles for positioning. And positioning algorithms based on laser point cloud matching, such as classical ICP, NDT algorithms and the like, are sensitive to initial pose, the pose optimization process is difficult to converge, and the positioning algorithms are easy to interfere in challenging environments.
Another low cost solution is to use the visual data of the monocular camera for mapping and positioning. The visual mapping scheme often uses algorithms such as visual simultaneous localization and mapping (SLAM), visual Odometer (VO), or motion restoration structure (SfM) to construct a three-dimensional point cloud map of an environment, and the visual point cloud map may further include visual feature information, unlike a laser point cloud map. The visual map matching and positioning algorithm performs positioning by matching the features in the currently acquired image features with the three-dimensional road mark points in the visual point cloud map, namely, the current vehicle pose is optimized, so that the error between the image feature points and the re-projection points of the three-dimensional road mark points matched with the image feature points is as small as possible, and the most possible vehicle pose is estimated to realize positioning. The visual map matching and positioning algorithm has low cost and is easy to be widely applied; but the accuracy of the visual map is relatively poor and is subject to environmental appearance changes, which is difficult to meet in large-scale and variable autopilot scenes.
Disclosure of Invention
The invention provides a cross-mode matching positioning method and system based on a visual image and a point cloud map, which are used for solving the problems of low positioning accuracy and high cost of the existing automatic driving.
The invention provides a cross-mode matching positioning method based on a visual image and a point cloud map, which comprises the following steps:
acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;
projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle;
inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity;
and superposing the pose updating quantity to the initial pose of the visual image to obtain the optimized pose, and obtaining the final pose after multiple times of circulation to finish positioning.
According to the cross-modal matching positioning method based on the visual image and the point cloud map, which is provided by the invention, the visual image and the initial pose of the camera are obtained, and the laser point cloud local map is obtained based on the initial pose of the visual image, and the method specifically comprises the following steps:
acquiring a visual image of a camera and rough positioning of an initial pose;
searching in a pre-acquired laser point cloud map based on the rough positioning;
and generating a laser point cloud local map in the world coordinate system near the initial pose of the camera.
According to the cross-modal matching positioning method based on the visual image and the point cloud map, which is provided by the invention, projection is performed based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle, and the method specifically comprises the following steps:
according to the initial pose of the camera and the internal parameters of the camera, projecting a laser point cloud local map under a world coordinate system to the camera coordinate system;
and re-projecting the laser point cloud local map under the camera coordinate system to a normalized pixel coordinate system to obtain a point cloud projection depth map under the initial pose view angle of the camera.
According to the cross-modal matching positioning method based on the visual image and the point cloud map, the point cloud projection depth map, the visual image and the initialized pose updating quantity are input into a trained pose solving network to obtain the optimized pose updating quantity, and the method specifically comprises the following steps:
the pose solving network adopts a full-attention network, and a visual image and a point cloud projection depth map are input to the full-attention network;
the visual image is processed by a visual feature encoder to obtain a high-dimensional visual feature map;
the point cloud projection depth map is processed by a point cloud feature encoder to obtain a Gao Weidian cloud feature map;
the high-dimensional visual features and Gao Weidian cloud features belong to features of different modes, and the similarity of the visual features and each point cloud feature in the Gao Weidian cloud feature map is calculated by taking each visual feature in the high-dimensional visual feature map as a reference to obtain three-dimensional feature matching cost;
and using the initialized pose updating quantity as a retrieval value, processing the three-dimensional feature matching cost to obtain a key value and a content value, updating the retrieval value, and generating the optimized pose updating quantity.
According to the cross-modal matching positioning method based on the visual image and the point cloud map, the pose solving network training process comprises the following steps:
giving a visual image and a laser point cloud map which are acquired at a certain moment, and applying a random posture transformation to the laser point cloud map to obtain a noise-added laser point cloud;
converting and projecting the laser point cloud based on the added noise through the internal parameters and the external parameters of the camera to obtain a new point cloud projection depth map;
inputting the visual image and the new point cloud projection depth map into a pose solving network to obtain a relative pose between the visual image and the new point cloud projection depth map;
and monitoring the relative pose, enabling the relative pose to approach to the known true value relative pose, and optimizing parameters of a pose solving network.
According to the cross-modal matching positioning method based on the visual image and the point cloud map, provided by the invention, the pose updating quantity is superimposed to the pose of the visual image, the initial pose of which is optimized, and the final pose is obtained through multiple times of circulation, so that the positioning is completed, and the method specifically comprises the following steps:
in the first pose optimization process, roughly positioning the initial pose;
and in each pose optimization process, taking the pose after the previous optimization as the initial pose in the current optimization process, searching and projecting from the laser point cloud map to obtain a new point cloud projection depth map, and carrying out iterative optimization on the pose to obtain the final pose, thereby completing positioning.
The invention also provides a cross-modal matching and positioning system based on the visual image and the point cloud map, which comprises the following steps:
the data acquisition module is used for acquiring a visual image and an initial pose of the camera and acquiring a laser point cloud local map based on the initial pose of the visual image;
the projection module is used for projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle;
the pose solving module is used for inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity;
and the pose optimization module is used for superposing the pose updating quantity to the initial pose of the visual image to obtain an optimized pose, and obtaining a final pose after multiple times of circulation to finish positioning.
According to the cross-mode matching positioning system based on the visual image and the point cloud map, the data acquisition module acquires the visual image of the camera and the rough positioning of the initial pose;
searching in a pre-acquired laser point cloud map based on the rough positioning;
and generating a laser point cloud local map in the world coordinate system near the initial pose of the camera. According to the cross-mode matching positioning system based on the visual image and the point cloud map, the projection module projects the laser point cloud local map under the world coordinate system to the camera coordinate system according to the initial pose of the camera and the internal parameters of the camera;
and re-projecting the laser point cloud local map under the camera coordinate system to a normalized pixel coordinate system to obtain a point cloud projection depth map under the initial pose view angle of the camera.
According to the cross-modal matching positioning system based on the visual image and the point cloud map, the pose solving module adopts a full-attention network according to the pose solving network, and inputs the visual image and the point cloud projection depth map into the full-attention network;
the visual image is processed by a visual feature encoder to obtain a high-dimensional visual feature map;
the point cloud projection depth map is processed by a point cloud feature encoder to obtain a Gao Weidian cloud feature map;
the high-dimensional visual features and Gao Weidian cloud features belong to features of different modes, and the similarity of the visual features and each point cloud feature in the Gao Weidian cloud feature map is calculated by taking each visual feature in the high-dimensional visual feature map as a reference to obtain three-dimensional feature matching cost;
and using the initialized pose updating quantity as a retrieval value, processing the three-dimensional feature matching cost to obtain a key value and a content value, updating the retrieval value, and generating the optimized pose updating quantity.
According to the cross-mode matching positioning system based on the visual image and the point cloud map, the pose optimization module takes rough positioning as an initial pose in the first pose optimization process;
and in each pose optimization process, taking the pose after the previous optimization as the initial pose in the current optimization process, searching and projecting from the laser point cloud map to obtain a new point cloud projection depth map, and carrying out iterative optimization on the pose to obtain the final pose, thereby completing positioning.
The invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the cross-mode matching positioning method based on the visual image and the point cloud map when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a cross-modality matching localization method based on a visual image and a point cloud map as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a cross-modal matching localization method based on a visual image and a point cloud map as described in any one of the above.
According to the cross-modal matching positioning method and system based on the visual image and the point cloud map, the visual image and the point cloud projection depth map are input into the pose solving network, and the final pose is obtained through multiple times of optimization; in the positioning stage, only a low-cost visual camera is needed to be used as a sensor, so that the cost is low, and the method is more suitable for large-scale business; compared with the traditional laser point cloud positioning algorithms such as ICP, NDT and the like, the method can iteratively optimize the pose in the positioning stage, iteratively searches the point cloud projection depth map from the laser point cloud map, optimizes the pose, and solves the pose by using the neural network trained from end to end, so that the optimization process is smoother, insensitive to the initial pose and more suitable for extreme scenes with larger GPS signal errors; in the mapping stage, a laser radar is used for mapping, the point cloud map is more accurate and insensitive to environmental appearance change, and the optimized laser point cloud map does not contain point cloud characteristics, so that the space required for map storage is smaller; in addition, as the laser point cloud map is more accurate than the visual point cloud map, the positioning accuracy is higher, and the accuracy requirement of the automatic driving automobile on the positioning function can be met.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is one of the flow diagrams of the cross-modal matching positioning method based on the visual image and the point cloud map provided by the invention;
FIG. 2 is a second schematic flow chart of a cross-modal matching positioning method based on a visual image and a point cloud map;
FIG. 3 is a third flow chart of the cross-modal matching localization method based on the visual image and the point cloud map provided by the invention;
FIG. 4 is a fourth schematic flow chart of a cross-modal matching positioning method based on a visual image and a point cloud map;
FIG. 5 is a fifth flow chart of the cross-modal matching localization method based on the visual image and the point cloud map provided by the invention;
FIG. 6 is a schematic diagram of the module connection of the cross-modal matching localization system based on visual images and point cloud maps provided by the invention;
fig. 7 is a schematic structural diagram of an electronic device provided by the present invention;
FIG. 8 is a diagram of an optimized positioning framework for multiple cycles provided by the present invention;
fig. 9 is a schematic diagram of a pose solving network structure provided by the invention.
Reference numerals:
110: a data acquisition module; 120: a projection module; 130: the pose solving module; 140: the pose optimization module;
710: a processor; 720: a communication interface; 730: a memory; 740: a communication bus.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention relates to a cross-modal matching positioning method based on a visual image and a point cloud map, which is described below with reference to fig. 1-5, and comprises the following steps:
s100, acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;
s200, projecting based on the laser point cloud local map to obtain a point cloud projection depth map under the initial pose view angle;
s300, inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity;
and S400, superposing the pose updating quantity to the initial pose of the visual image to obtain the optimized pose, and obtaining the final pose after multiple times of circulation to finish positioning.
According to the invention, the camera data is used for carrying out a visual matching positioning scheme in the laser point cloud map, so that the problems of precision and cost can be solved. The laser point cloud map has high precision, is constructed by a professional acquisition vehicle, and can be accepted in one-time cost; the visual camera mass production cost is low, the large-scale commercial use can be realized, and the camera data of different modes and the laser point cloud data information are matched through the invention, so that the high-precision positioning with low cost can be realized.
The method comprises the steps of obtaining a visual image and an initial pose of a camera, and obtaining a laser point cloud local map based on the initial pose of the visual image, and specifically comprises the following steps:
s101, acquiring a visual image of a camera and rough positioning of an initial pose;
s102, searching in a pre-acquired laser point cloud map based on the rough positioning;
s103, generating a laser point cloud local map in the world coordinate system near the initial pose of the camera.
In the invention, the laser point cloud map is required to be built offline, and the invention does not limit the construction method of the laser point cloud map. The system inputs a visual image I and a GPS positioning signal thereof which are acquired when the vehicle needs to be positioned, and outputs the visual image I and the GPS positioning signal as the current pose of the vehicle. System for initiating pose by cameraCoarse positioning is carried out, so that searching is carried out in the laser point cloud map to obtain Q k Nearby laser point cloud local map->Wherein w represents->In world coordinate system. Positioning accuracy can be improved through the laser point cloud map.
Projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle, wherein the method specifically comprises the following steps of:
s201, projecting a laser point cloud local map under a world coordinate system to a camera coordinate system according to an initial pose of the camera and internal parameters of the camera;
s202, the laser point cloud local map under the camera coordinate system is projected to a normalized pixel coordinate system, and a point cloud projection depth map under the initial pose view angle of the camera is obtained.
In the invention, according to the initial pose Q of the camera k And camera internal parametersThe laser point cloud local map under the world coordinate system is projected under the camera coordinate system c and then projected under the normalized pixel coordinate system I, so as to obtain a point cloud projection depth map D with the same resolution as the visual image I, and the calculation process is as follows:
D(y,x)=Z#(3)
the method does not limit the point cloud preprocessing operation (such as outlier rejection, motion compensation and the like) in the point cloud map projection process.
Inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity, wherein the method specifically comprises the following steps of:
s301, the pose solving network adopts a full-attention network, and a visual image and a point cloud projection depth map are input to the full-attention network;
s302, the visual image is processed by a visual feature encoder to obtain a high-dimensional visual feature map;
s303, processing the point cloud projection depth map by a point cloud feature encoder to obtain a Gao Weidian cloud feature map;
s304, calculating the similarity of the visual features and each point cloud feature in the Gao Weidian cloud feature map by taking each visual feature in the high-dimensional visual feature map as a reference, wherein the high-dimensional visual features and the Gao Weidian cloud features belong to features of different modes, so as to obtain three-dimensional feature matching cost;
s305, using the initialized pose updating quantity as a search value, processing the three-dimensional feature matching cost to obtain a key value and a content value, updating the search value, and generating the optimized pose updating quantity.
In the invention, the pose solving network is a network with double input branches and single output branches, and adopts an attention mechanism (full attention network transducer) to find data matching relations in different modal data and solve the poses. The two input branches respectively take a visual image I and a point cloud projection depth map D as inputs and respectively pass through a visual feature encoderAnd Point cloud feature encoder->Obtaining high-dimensional visual characteristic diagrams of different modes +.>And Gao Weidian cloud feature mapWhere h and w are the height and width, respectively, of the feature map and c is the dimension of the feature map.
Each visual feature in the high-dimensional visual feature mapCalculating each point cloud characteristic in the visual characteristic and Gao Weidian cloud characteristic diagram as a reference>Similarity of->Obtaining three-dimensional feature matching cost->After the three-dimensional feature matching cost C is obtained, the last pose updating quantity is +.>As a retrieval value q, processing the feature matching cost to obtain a key value k and a content value v, and updating the retrieval value q according to a decoder model based on an attention mechanism to obtain an optimized pose updating amount +.>
The full-attention network can be overlapped with multiple layers, and the first layer of full-attention network randomly initializes the input pose updating quantityThe pose update amount outputted by the previous layer for each layer (i.e. the (i+1) th layer)>For input, iterative optimization is carried out to obtain +.>The process simulates the strategy of iterative optimization solution in the traditional Gauss Newton method, is more stable, and is easier to obtain the optimal optimization result.
In a full-attention network, a linear coding function Ebed is adopted q () Updating the input poseConverting into a high-dimensional feature vector as a search value q:
flattening the feature matching cost C, and treating the feature matching cost C as h multiplied by w hw dimension feature vectorsAlso via linear coding function Ebed k (),Embed v () Obtaining a key value k and a content value v:
updating the retrieval value q according to the attention mechanism:
q=BN((q T k)v+q)#(7)
where BN () is a layer regularization process. The present invention is not limited to attention mechanisms, such as multi-head attention mechanisms, self-attention mechanisms, cross-attention mechanisms, deformable attention mechanisms, etc., which can be applied to the present invention without any obstacle. Finally, the updated search value q is converted into a pose updating quantity through a linear decoding function:
in the invention, the pose solving network training process is as follows:
s401, giving a visual image and a laser point cloud map acquired at a certain moment, and applying a random posture transformation to the laser point cloud map to obtain a noise-added laser point cloud;
s402, converting and projecting the laser point cloud based on noise addition through the internal parameters and the external parameters of the camera to obtain a new point cloud projection depth map;
s403, inputting the visual image and the new point cloud projection depth map into a pose solving network to obtain a relative pose between the visual image and the new point cloud projection depth map;
s404, supervising the relative pose, enabling the relative pose to approach a known true value relative pose, and optimizing parameters of a pose solving network.
It is known that in the cross-modal visual matching positioning system provided by the invention, a pose solving network needs to train network parameters in a targeted manner. The pose solving network needs training data with a visual image and laser point cloud, wherein the visual image and the laser point cloud are different mode data acquired simultaneously, and an internal reference K of a visual camera and the relative pose between the internal reference K and a laser radar are obtained through a sensor internal and external reference calibration tool
During training, a visual image I and a laser point cloud map P acquired at a certain moment are given l According to the internal parameter K and the external parameter T cl Converting the point cloud P under the laser radar coordinate system into a camera coordinate system, and then projecting the point cloud P under the laser radar coordinate system into a pixel coordinate system to obtain a point cloud projection depth map D corresponding to the visual image I:
D(y,x)=Z#(11)
the relative pose between I and D should be 0 at this time. In an actual positioning scene, a point cloud projection depth map D obtained by projecting from a laser point cloud map according to an initial pose is generally different from a pose where I is, and in order to simulate the situation, the invention solves a network in training the pose, and aims at solving the laser point cloud map P l A random pose transformation T is applied rand Obtaining a noise-added laser point cloud P l′ :
P is prepared by using internal and external ginseng l′ And converting and projecting to obtain a new point cloud projection depth map D'. Inputting the I and the D 'into a pose solving network to obtain a relative pose Q between the I and the D'. Because of the true relative pose T between I and D' is known rand The relative pose is directly supervised, so that the predicted relative pose Q approaches the true value relative pose T rand The parameters theta of the pose solving network are optimized by the method:
θ * =argmin θ (‖Q-T rand ‖)#(13)
the invention predicts the relative pose Q and the true relative pose T by calculation rand The method of the difference is not limited, such as directly solving the Euclidean or L1 distance of the transformation matrix, converting the transformation matrix into translation and rotation quantities, solving the Euclidean or L1 distance, and the like.
The pose updating amount is overlapped to the initial pose of the visual image to obtain the optimized pose, and the final pose is obtained through multiple times of circulation, so that the positioning is completed, and the method specifically comprises the following steps:
in the first pose optimization process, roughly positioning the initial pose;
and in each pose optimization process, taking the pose after the previous optimization as the initial pose in the current optimization process, searching and projecting from the laser point cloud map to obtain a new point cloud projection depth map, and carrying out iterative optimization on the pose to obtain the final pose, thereby completing positioning.
In one embodiment, system verification is performed, and to verify the validity of the invention, verification is performed on the disclosed KITTI data set. Training was performed on sequences 03, 05, 06, 07, 08, 09 and testing was performed on sequence 00. The laser point cloud map is constructed using a laser SLAM algorithm. In the test, a pose solving network is used twice, namely, pose is circularly optimized twice. The pose solving network used comprises a 6-layer full-attention network, i.e. the pose is iteratively optimized six times in the network.
Table 1 Cross-modality matching positioning System Effect
Referring to table 1, it can be seen that although the initial relative pose error of the visual image is very large, the positioning accuracy is remarkably improved after one-time cross-modal matching positioning. However, as the point cloud projection depth map is affected by the initial pose, map searching and projection are performed by using the pose after optimization once to obtain a new point cloud projection depth map, and then matching and positioning are performed, so that the effect is further improved. And finally, the positioning precision of the system is higher, and the system only needs a monocular camera during positioning, so that the positioning cost is lower.
According to the cross-modal matching positioning method based on the visual image and the point cloud map, the visual image and the point cloud projection depth map are input into a pose solving network, and the final pose is obtained through multiple times of optimization; in the positioning stage, only a low-cost visual camera is needed to be used as a sensor, so that the cost is low, and the method is more suitable for large-scale business; compared with the traditional laser point cloud positioning algorithms such as ICP, NDT and the like, the method can iteratively optimize the pose in the positioning stage, iteratively searches the point cloud projection depth map from the laser point cloud map, optimizes the pose, and solves the pose by using the neural network trained from end to end, so that the optimization process is smoother, insensitive to the initial pose and more suitable for extreme scenes with larger GPS signal errors; in the mapping stage, a laser radar is used for mapping, the point cloud map is more accurate and insensitive to environmental appearance change, and the optimized laser point cloud map does not contain point cloud characteristics, so that the space required for map storage is smaller; in addition, as the laser point cloud map is more accurate than the visual point cloud map, the positioning accuracy is higher, and the accuracy requirement of the automatic driving automobile on the positioning function can be met.
Referring to fig. 6, 8 and 9, the invention also discloses a cross-modal matching positioning system based on the visual image and the point cloud map, the system comprises:
the data acquisition module 110 is used for acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;
the projection module 120 is configured to project a point cloud projection depth map under an initial pose view based on the laser point cloud local map;
the pose solving module 130 is configured to input the point cloud projection depth map, the visual image and the initialized pose updating amount into a trained pose solving network, so as to obtain an optimized pose updating amount;
and the pose optimization module 140 is configured to superimpose the pose update amount on the pose of the visual image, where the initial pose is optimized, and obtain a final pose after multiple cycles, so as to complete positioning.
The data acquisition module 110 acquires a visual image of the camera and rough positioning of an initial pose;
searching in a pre-acquired laser point cloud map based on the rough positioning;
and generating a laser point cloud local map in the world coordinate system near the initial pose of the camera.
The projection module 120 projects the laser point cloud local map under the world coordinate system to the camera coordinate system according to the initial pose of the camera and the internal parameters of the camera;
and re-projecting the camera coordinate system to a normalized pixel coordinate system to obtain a point cloud projection depth map under the initial pose view angle of the camera.
The pose solving module 130 adopts a full-attention network, and inputs the visual image and the point cloud projection depth map to the full-attention network;
the visual image is processed by a visual feature encoder to obtain a high-dimensional visual feature map;
the point cloud projection depth map is processed by a point cloud feature encoder to obtain a Gao Weidian cloud feature map;
the high-dimensional visual features and Gao Weidian cloud features belong to features of different modes, and the similarity of the visual features and each point cloud feature in the Gao Weidian cloud feature map is calculated by taking each visual feature in the high-dimensional visual feature map as a reference to obtain three-dimensional feature matching cost;
and using the initialized pose updating quantity as a retrieval value, processing the three-dimensional feature matching cost to obtain a key value and a content value, updating the retrieval value, and generating the optimized pose updating quantity.
The pose solving network training process comprises the following steps:
giving a visual image and a laser point cloud map which are acquired at a certain moment, and applying a random posture transformation to the laser point cloud map to obtain a noise-added laser point cloud;
converting and projecting the laser point cloud based on the added noise through the internal parameters and the external parameters of the camera to obtain a new point cloud projection depth map;
inputting the visual image and the new point cloud projection depth map into a pose solving network to obtain a relative pose between the visual image and the new point cloud projection depth map;
and monitoring the relative pose, enabling the relative pose to approach to the known true value relative pose, and optimizing parameters of a pose solving network.
The pose optimization module 140 takes rough positioning as an initial pose in the first pose optimization process;
and in each pose optimization process, taking the pose after the previous optimization as the initial pose in the current optimization process, searching and projecting from the laser point cloud map to obtain a new point cloud projection depth map, and carrying out iterative optimization on the pose to obtain the final pose, thereby completing positioning.
According to the cross-mode matching positioning system based on the visual image and the point cloud map, the visual image and the point cloud projection depth map are input into a pose solving network, and the final pose is obtained through multiple times of optimization; in the positioning stage, only a low-cost visual camera is needed to be used as a sensor, so that the cost is low, and the method is more suitable for large-scale business; compared with the traditional laser point cloud positioning algorithms such as ICP, NDT and the like, the method can iteratively optimize the pose in the positioning stage, iteratively searches the point cloud projection depth map from the laser point cloud map, optimizes the pose, and solves the pose by using the neural network trained from end to end, so that the optimization process is smoother, insensitive to the initial pose and more suitable for extreme scenes with larger GPS signal errors; in the mapping stage, a laser radar is used for mapping, the point cloud map is more accurate and insensitive to environmental appearance change, and the optimized laser point cloud map does not contain point cloud characteristics, so that the space required for map storage is smaller; in addition, as the laser point cloud map is more accurate than the visual point cloud map, the positioning accuracy is higher, and the accuracy requirement of the automatic driving automobile on the positioning function can be met.
Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a cross-modality matching localization method based on visual images and point cloud maps, the method comprising: acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;
projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle;
inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity;
and superposing the pose updating quantity to the initial pose of the visual image to obtain the optimized pose, and obtaining the final pose after multiple times of circulation to finish positioning.
Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute a cross-modal matching positioning method based on a visual image and a point cloud map provided by the above methods, and the method includes: acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;
projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle;
inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity;
and superposing the pose updating quantity to the initial pose of the visual image to obtain the optimized pose, and obtaining the final pose after multiple times of circulation to finish positioning.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the cross-modal matching localization method based on a visual image and a point cloud map provided by the above methods, the method comprising: acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;
projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle;
inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity;
and superposing the pose updating quantity to the initial pose of the visual image to obtain the optimized pose, and obtaining the final pose after multiple times of circulation to finish positioning.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The cross-modal matching positioning method based on the visual image and the point cloud map is characterized by comprising the following steps of:
acquiring a visual image and an initial pose of a camera, and acquiring a laser point cloud local map based on the initial pose of the visual image;
projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle;
inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity;
and superposing the pose updating quantity to the initial pose of the visual image to obtain the optimized pose, and obtaining the final pose after multiple times of circulation to finish positioning.
2. The cross-modal matching positioning method based on the visual image and the point cloud map according to claim 1, wherein the obtaining the visual image and the initial pose of the camera, and obtaining the laser point cloud local map based on the initial pose of the visual image, specifically comprises:
acquiring a visual image of a camera and rough positioning of an initial pose;
searching in a pre-acquired laser point cloud map based on the rough positioning;
and generating a laser point cloud local map in the world coordinate system near the initial pose of the camera.
3. The cross-modal matching positioning method based on the visual image and the point cloud map according to claim 1, wherein the method is characterized by obtaining a point cloud projection depth map under an initial pose view angle by projecting based on the laser point cloud local map, and specifically comprises the following steps:
according to the initial pose of the camera and the internal parameters of the camera, projecting a laser point cloud local map under a world coordinate system to the camera coordinate system;
and re-projecting the laser point cloud local map under the camera coordinate system to a normalized pixel coordinate system to obtain a point cloud projection depth map under the initial pose view angle of the camera.
4. The cross-modal matching positioning method based on the visual image and the point cloud map according to claim 1, wherein the point cloud projection depth map, the visual image and the initialized pose updating amount are input into a trained pose solving network to obtain the optimized pose updating amount, and specifically comprises the following steps:
the pose solving network adopts a full-attention network, and visual images and a point cloud projection depth map are input into the full-attention network;
the visual image is processed by a visual feature encoder to obtain a high-dimensional visual feature map;
the point cloud projection depth map is processed by a point cloud feature encoder to obtain a Gao Weidian cloud feature map;
the high-dimensional visual features and Gao Weidian cloud features belong to features of different modes, and the similarity of the visual features and each point cloud feature in the Gao Weidian cloud feature map is calculated by taking each visual feature in the high-dimensional visual feature map as a reference to obtain three-dimensional feature matching cost;
and using the initialized pose updating quantity as a retrieval value, processing the three-dimensional feature matching cost to obtain a key value and a content value, updating the retrieval value, and generating the optimized pose updating quantity.
5. The cross-modal matching positioning method based on the visual image and the point cloud map as claimed in claim 4, wherein the pose solving network training process is as follows:
giving a visual image and a laser point cloud map which are acquired at a certain moment, and applying a random posture transformation to the laser point cloud map to obtain a noise-added laser point cloud;
converting and projecting the laser point cloud based on the added noise through the internal parameters and the external parameters of the camera to obtain a new point cloud projection depth map;
inputting the visual image and the new point cloud projection depth map into a pose solving network to obtain a relative pose between the visual image and the new point cloud projection depth map;
and monitoring the relative pose, enabling the relative pose to approach to the known true value relative pose, and optimizing parameters of a pose solving network.
6. The cross-modal matching positioning method based on the visual image and the point cloud map according to claim 1, wherein the method is characterized in that the pose updating amount is superimposed to the pose of the visual image, the initial pose of which is optimized, and the final pose is obtained through a plurality of times of circulation, and the positioning is completed, and specifically comprises the following steps:
in the first pose optimization process, roughly positioning the initial pose;
and in each pose optimization process, taking the pose after the previous optimization as the initial pose in the current optimization process, searching and projecting from the laser point cloud map to obtain a new point cloud projection depth map, and carrying out iterative optimization on the pose to obtain the final pose, thereby completing positioning.
7. A cross-modality matching and positioning system based on visual images and point cloud maps, the system comprising:
the data acquisition module is used for acquiring a visual image and an initial pose of the camera and acquiring a laser point cloud local map based on the initial pose of the visual image;
the projection module is used for projecting based on the laser point cloud local map to obtain a point cloud projection depth map under an initial pose view angle;
the pose solving module is used for inputting the point cloud projection depth map, the visual image and the initialized pose updating quantity into a trained pose solving network to obtain an optimized pose updating quantity;
and the pose optimization module is used for superposing the pose updating quantity to the initial pose of the visual image to obtain an optimized pose, and obtaining a final pose after multiple times of circulation to finish positioning.
8. The cross-modal matching and positioning system based on the visual image and the point cloud map according to claim 7, wherein the pose optimization module performs rough positioning as an initial pose in a first pose optimization process;
and in each pose optimization process, taking the pose after the previous optimization as the initial pose in the current optimization process, searching and projecting from the laser point cloud map to obtain a new point cloud projection depth map, and carrying out iterative optimization on the pose to obtain the final pose, thereby completing positioning.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the cross-modality matching localization method based on a visual image and a point cloud map as claimed in any one of claims 1 to 6 when the program is executed by the processor.
10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a cross-modality matching localization method based on a visual image and a point cloud map as claimed in any one of claims 1 to 6.
CN202310588600.3A 2023-05-23 2023-05-23 Cross-modal matching positioning method and system based on visual image and point cloud map Pending CN116823929A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310588600.3A CN116823929A (en) 2023-05-23 2023-05-23 Cross-modal matching positioning method and system based on visual image and point cloud map

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310588600.3A CN116823929A (en) 2023-05-23 2023-05-23 Cross-modal matching positioning method and system based on visual image and point cloud map

Publications (1)

Publication Number Publication Date
CN116823929A true CN116823929A (en) 2023-09-29

Family

ID=88126664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310588600.3A Pending CN116823929A (en) 2023-05-23 2023-05-23 Cross-modal matching positioning method and system based on visual image and point cloud map

Country Status (1)

Country Link
CN (1) CN116823929A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117765084A (en) * 2024-02-21 2024-03-26 电子科技大学 Visual positioning method for iterative solution based on dynamic branch prediction

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117765084A (en) * 2024-02-21 2024-03-26 电子科技大学 Visual positioning method for iterative solution based on dynamic branch prediction
CN117765084B (en) * 2024-02-21 2024-05-03 电子科技大学 Visual positioning method for iterative solution based on dynamic branch prediction

Similar Documents

Publication Publication Date Title
Chang et al. Kimera-multi: a system for distributed multi-robot metric-semantic simultaneous localization and mapping
Fan et al. Learning collision-free space detection from stereo images: Homography matrix brings better data augmentation
Wang et al. 3d lidar and stereo fusion using stereo matching network with conditional cost volume normalization
CN111902826A (en) Positioning, mapping and network training
CN113139996B (en) Point cloud registration method and system based on three-dimensional point cloud geometric feature learning
CN114001733B (en) Map-based consistent efficient visual inertial positioning algorithm
CN112183171A (en) Method and device for establishing beacon map based on visual beacon
CN113538218B (en) Weak pairing image style migration method based on pose self-supervision countermeasure generation network
CN110260866A (en) A kind of robot localization and barrier-avoiding method of view-based access control model sensor
Rhodes et al. LIDAR-based relative navigation of non-cooperative objects using point Cloud Descriptors
CN116823929A (en) Cross-modal matching positioning method and system based on visual image and point cloud map
Liu et al. Plc-vio: Visual–inertial odometry based on point-line constraints
Ishihara et al. Deep radio-visual localization
CN115471748A (en) Monocular vision SLAM method oriented to dynamic environment
Gao et al. Gyro-net: IMU gyroscopes random errors compensation method based on deep learning
Jo et al. Mixture density-PoseNet and its application to monocular camera-based global localization
CN114088103B (en) Method and device for determining vehicle positioning information
CN111833395B (en) Direction-finding system single target positioning method and device based on neural network model
CN112598730A (en) Method for determining the positioning pose of an at least partially automated mobile platform
Wu et al. Self-supervised monocular depth estimation scale recovery using ransac outlier removal
CN113483769A (en) Particle filter based vehicle self-positioning method, system, device and medium
Sun et al. Accurate deep direct geo-localization from ground imagery and phone-grade gps
Duan Visual smart navigation for UAV mission-oriented flight
Cattaneo et al. CMRNext: Camera to LiDAR Matching in the Wild for Localization and Extrinsic Calibration
US20240153139A1 (en) Object pose estimation in the context of neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination