CN112344922A

CN112344922A - Monocular vision odometer positioning method and system

Info

Publication number: CN112344922A
Application number: CN202011153385.7A
Authority: CN
Inventors: 高伟; 万一鸣; 吴毅红
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2021-02-09
Anticipated expiration: 2040-10-26
Also published as: CN112344922B

Abstract

The invention relates to a monocular vision odometer positioning method and a monocular vision odometer positioning system, wherein the monocular vision odometer positioning method comprises the following steps: acquiring a training data set, wherein the training data set comprises a plurality of video sequences, and each video sequence comprises a plurality of frames of continuous images; establishing a monocular vision odometer positioning model according to each video sequence; the method specifically comprises the following steps: stacking each adjacent frame image to obtain a corresponding stacked image; extracting high-dimensional features from each stacked image through a FlowNet encoder; sequentially extracting local information and global information from the high-dimensional features through an LCGR module; obtaining a relative pose through full-connection regression processing according to the local information and the global information; based on the monocular vision odometer positioning model, the relative pose can be accurately determined according to the video sequence to be detected, and the positioning precision is improved.

Description

Monocular vision odometer positioning method and system

Technical Field

The invention relates to the technical field of computer vision, in particular to a monocular vision odometer positioning method and a monocular vision odometer positioning system based on local global information fusion and dynamic object perception of SLAM (Simultaneous localization and mapping).

Background

The visual odometer is an important link in mobile robots, autonomous navigation and augmented reality. The visual odometer may be classified into a Monocular visual odometer (Monocular VO) and a binocular visual odometer (Stereo VO) according to the number of cameras used. Monocular VOs are generally more challenging than binocular VOs, but are widely studied because they require only one camera, are lighter and cheaper. Classical visual mileage calculation methods include camera rectification, feature detection, feature matching, outlier rejection, motion estimation, scale estimation, and back-end optimization. The algorithm can achieve a good effect under most conditions, but still fails in the case of scenes such as occlusion, large illumination change, no texture and the like.

In recent years, deep learning techniques have been successfully applied to face recognition, target tracking, speech recognition, machine translation, and the like. Deep learning methods represented by convolutional neural networks play a very important role in the field of computer vision, and the deep networks have a remarkable effect in the aspects of extracting picture features, finding out potential rules and the like compared with the traditional method, so that a plurality of students consider applying deep learning to the fields of pose estimation and the like, directly learn the geometric relationship between pictures by the deep networks, and realize end-to-end pose estimation. The end-to-end mode completely abandons the steps of feature extraction, feature matching, camera calibration, image optimization and the like in the traditional method, and the camera posture is directly obtained according to the input image. Although the convolutional network can deal with some extreme situations, the overall accuracy is lower than that of the traditional method, and in addition, the generalization capability of the network is also an important reason influencing the practical application of the deep network. In addition, most deep learning methods do not consider the influence of dynamic objects in the scene, so that the positioning accuracy is also low.

Disclosure of Invention

In order to solve the above problems in the prior art, i.e. to improve the positioning accuracy, the present invention aims to provide a monocular vision odometer positioning method and system.

In order to solve the technical problems, the invention provides the following scheme:

a monocular visual odometer positioning method, the monocular visual odometer positioning method comprising:

acquiring a training data set, wherein the training data set comprises a plurality of video sequences, and each video sequence comprises a plurality of frames of continuous images;

establishing a monocular vision odometer positioning model according to each video sequence;

wherein, according to each video sequence, establish monocular vision odometer location model, specifically include:

stacking each adjacent frame image to obtain a corresponding stacked image;

extracting high-dimensional features from each stacked image through a FlowNet encoder;

sequentially extracting local information and global information from the high-dimensional features through an LCGR module;

obtaining a relative pose through full-connection regression processing according to the local information and the global information;

and obtaining the relative pose to be detected according to the video sequence to be detected based on the monocular vision odometer positioning model.

Optionally, the extracting, by the LCGR module, local information and global information from the high-dimensional features sequentially includes:

performing convolution operation on each high-dimensional feature and K groups of 3D convolution kernels to obtain local information with different lengths, wherein the size of the kth group of convolution kernels is K multiplied by 3, and

based on Bi-ConvLSTM, global information of the video sequence is extracted from each high-dimensional feature.

Optionally, the relative pose comprises a displacement and a pose;

the establishing of the monocular visual odometer positioning model according to each video sequence further comprises:

calculating the displacement loss value L according to the following formula_transAnd attitude loss value L_rot：

Wherein the content of the first and second substances,

in order for the displacement to be predicted,

for the predicted angle, p_t，φ_tFor the true value, T denotes the image number, T is 1,2, …, T denotes the number of images;

represents a two-norm;

according to the displacement loss value L_transAnd attitude loss value L_rotDetermining a total loss value;

and adjusting the monocular vision odometer positioning model according to the total loss value.

Optionally, the total loss value further comprises an optical flow loss value, a constraint loss value, and a pole pair loss value;

determining an optical flow loss value L according to each adjacent frame image and the optical flow output by the FlowNet encoder through an optical flow and mask estimation module_ptotometricConstraint loss value L_regAnd a pole pair loss value L_e；

According to the displacement loss value L_transAttitude loss value L_rotOptical flow loss value L_ptotometricConstraint loss value L_regAnd a pole pair loss value L_eThe total loss value is determined.

Optionally, the optical flow loss value L is determined according to the following formula_ptotometricConstraint loss value L_regPole pair loss value L_eAnd total loss value L_total：

L_e＝|q^TA-^T[r]×RK^-1p|；

L_total＝L_trans+100L_rot+L_ptotometric+L_e+L_reg；

Wherein (I, j) represents the pixel coordinate position, I_tRepresents the image of the t-th frame, I' (I, j, t +1) represents the images I (I, j, t) and I (I, j, t +1) of two consecutive frames and the FlowNet encoder outputs the image of optical flow synthesis; c (i, j) is the mask value for the (i, j) location, indicating the confidence that the pixel can be successfully synthesized; two continuous frames of images I (I, j, t) and I (I, j, t +1), the pixel correspondence between the source image and the target image is provided through the estimated optical flow, q is the pixel position in the target image, p represents the corresponding pixel position in the source image, A is the camera internal parameter, and R and R are the relative poses between the source image and the target image.

Optionally, the monocular visual odometer positioning method further comprises:

the size of each image is adjusted to a uniform size.

In order to solve the technical problems, the invention also provides the following scheme:

a monocular visual odometer positioning system, the monocular visual odometer positioning system comprising:

an obtaining unit, configured to obtain a training data set, where the training data set includes a plurality of video sequences, and each video sequence includes multiple frames of continuous images;

the modeling unit is used for establishing a monocular vision odometer positioning model according to each video sequence;

wherein the modeling unit includes:

the stacking module is used for stacking each adjacent frame image to obtain a corresponding stacked image;

the characteristic extraction module is used for extracting high-dimensional characteristics from each stacked image through a FlowNet encoder;

the information extraction module is used for sequentially extracting local information and global information from the high-dimensional features through the LCGR module;

the pose determining module is used for obtaining a relative pose through full-connection regression processing according to the local information and the global information;

and the positioning unit is used for obtaining the relative pose to be detected according to the video sequence to be detected based on the monocular vision odometer positioning model.

Optionally, the modeling unit comprises:

and the preprocessing module is respectively connected with the acquisition unit and the stacking module, and is used for adjusting the size of each image to be uniform and sending the image to the stacking module.

a monocular visual odometer positioning system comprising:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

stacking each adjacent frame image to obtain a corresponding stacked image;

a computer-readable storage medium storing one or more programs that, when executed by an electronic device including a plurality of application programs, cause the electronic device to:

stacking each adjacent frame image to obtain a corresponding stacked image;

According to the embodiment of the invention, the invention discloses the following technical effects:

according to the invention, high-dimensional features are extracted from each stacked image through the FlowNet encoder, local information and global information are sequentially extracted from the high-dimensional features through the LCGR module, then the relative pose is obtained through full-connection regression processing, a monocular vision odometer positioning model is established, a video sequence to be measured is positioned through the monocular vision odometer positioning model, the relative pose can be accurately determined, and the positioning precision is improved.

Drawings

FIG. 1 is a flow chart of a monocular visual odometer positioning method of the present invention;

fig. 2 is a schematic block diagram of a monocular visual odometer positioning system of the present invention.

Description of the symbols:

the system comprises an acquisition unit-1, a modeling unit-2, a preprocessing module-20, a stacking module-21, a feature extraction module-22, an information extraction module-23, a pose determination module-24 and a positioning unit-3.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

The invention aims to provide a monocular vision odometer positioning method, which comprises the steps of extracting high-dimensional features from each stacked image through a FlowNet encoder, sequentially extracting local information and global information from the high-dimensional features through an LCGR (link control group) module, further obtaining a relative pose through full-connection regression processing, establishing a monocular vision odometer positioning model, positioning a video sequence to be measured through the monocular vision odometer positioning model, accurately determining the relative pose, and improving the positioning precision.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in FIG. 1, the monocular visual odometer positioning method of the present invention comprises:

step 10: acquiring a training data set, wherein the training data set comprises a plurality of video sequences, and each video sequence comprises a plurality of frames of continuous images;

step 20: establishing a monocular vision odometer positioning model according to each video sequence;

step 30: and obtaining the relative pose to be detected according to the video sequence to be detected based on the monocular vision odometer positioning model.

In step 20, the establishing a monocular visual odometer positioning model according to each video sequence specifically includes:

step 210: stacking each adjacent frame image to obtain a corresponding stacked image;

step 220: extracting high-dimensional features from each stacked image through a Flow Networks (optical Flow network) encoder;

step 230: sequentially extracting Local information and Global information from the high-dimensional features through an LCGR (Local constraint and Global recovery neural network) module;

step 240: and obtaining the relative pose through full-connection regression processing according to the local information and the global information.

Preferably, the establishing a monocular visual odometry positioning model according to each video sequence further includes: step 200: the size of each image is adjusted to a uniform size. In this embodiment, the size is 384 x 1280 pixels.

In step 220, given N +1 frames of consecutive images from t to t + N, after stacking, N sets of 6 × 20 × 1024 high-dimensional features can be obtained by the FlowNet encoder.

In step 230, the extracting, by the LCGR module (as shown in table 1), local information and global information from the high-dimensional features sequentially includes:

step 231: performing convolution operation on each high-dimensional feature and K groups of 3D convolution kernels to obtain local information with different lengths, wherein the size of the kth group of convolution kernels is K multiplied by 3, and

to ensure that a more compact high dimensional feature can be obtained after convolution.

In this embodiment, K is 2, and the convolution step size is 1, C_kAre both 128.

TABLE 1

Step 232: based on Bi-ConvLSTM (Bidirectional consistent Long Short-Term Memory convolution), extracting global information of the video sequence from each high-dimensional feature.

Further, the relative pose includes displacement and attitude.

In step 240, the creating a monocular visual odometer positioning model from each video sequence further comprises:

step 241: calculating the displacement loss value L according to the following formula_transAnd attitude loss value L_rot：

Wherein the content of the first and second substances,

in order for the displacement to be predicted,

represents a two-norm;

step 242: according to the displacement loss value L_transAnd attitude loss value L_rotDetermining a total loss value;

step 243: and adjusting the monocular vision odometer positioning model according to the total loss value.

Further, the total loss value also includes an optical flow loss value, a constraint loss value, and a pole pair loss value.

Correspondingly, the establishing of the monocular visual odometer positioning model according to each video sequence further comprises:

Wherein the optical flow loss value L is determined according to the following formula_ptotometricConstraint loss value L_regPole pair loss value L_eAnd total loss value L_total：

L_e＝|q^TA^-T[r]×RK^-1p|；

L_total＝L_trans+100L_rot+L_ptotometric+L_e+L_reg；

Wherein (i, j) representsPixel coordinate position, I_tRepresents the image of the t-th frame, I' (I, j, t +1) represents the images I (I, j, t) and I (I, j, t +1) of two consecutive frames and the FlowNet encoder outputs the image of optical flow synthesis; c (i, j) is the mask value for the (i, j) location, indicating the confidence that the pixel can be successfully synthesized; two continuous frames of images I (I, j, t) and I (I, j, t +1), the pixel correspondence between the source image and the target image is provided through the estimated optical flow, q is the pixel position in the target image, p is the pixel position in the source image which does not correspond to the p, A is the camera internal parameter, and R and R are the relative poses between the source image and the target image.

Wherein the optical flow and mask estimation module: given two consecutive pictures I (I, j, t) and I (I, j, t +1) and the optical flow output by the FlowNet encoder, one can synthesize:

I′(i，j，t+1)＝I(i+u_i，j，j+v_i，j，t)，

the photometric error can be obtained by calculating the synthesized I' (I, j, t +1) and the original I (I, j, t +1), i.e.:

where u, v are the horizontal and vertical components of the optical flow. This process may be implemented by differentiable bilinear interpolation. Given a sequence of pictures I₁，I₂，...I_TThe overall loss function is:

to mitigate the effect of errors in regions of light inconsistency in the scene on gradient propagation, the optical flow prediction component simultaneously evaluates a mask whose value represents the probability that each pixel can be successfully synthesized. The mask is estimated by adding a branched convolutional layer before the last layer of FlowNet, and the activation function of the convolutional layer adopts a sigmoid function. The final loss function becomes:

furthermore, to prevent C (i, j) from being optimized to 0, it is necessary to constrain C (i, j) to be 1, with the penalty of this constraint:

in order to solve the problems brought by dynamic objects in a scene, the invention utilizes pose true value construction antipodal constraints to explicitly prompt a network to output a lower mask value to a moving object region to relieve the influence on gradient propagation. Given two images of adjacent frames, the estimated optical flow provides a pixel correspondence between the source image and the target image. Assuming that G is the pixel position in the target image, and its corresponding pixel position in the source image is p, the epipolar constraint can be expressed as:

q^TK^-T[t]×RK^-1p＝0。

the final antipodal losses are:

L_e＝|q^TA^-T[r]×RK^-1p|。

the monocular visual odometer localization model of the present invention was trained on KITTIVO/SLAM. The data set comprises 22 video sequences, wherein 00-10 provide pose truth values, and 11-21 only provide original video sequences. Many dynamic objects are contained in these 22 video sequences, which is very challenging for monocular VOs. The pictures in the training are all adjusted to 384 x 1280 pixels, the initial learning rate is 0.0001, the batch size is 2, and the learning rate is halved every 10 epochs.

In addition, the invention also provides a monocular vision odometer positioning system which can improve the positioning precision.

As shown in fig. 2, the monocular visual odometer positioning system of the present invention includes an obtaining unit 1, a modeling unit 2, and a positioning unit 3.

The acquiring unit 1 is configured to acquire a training data set, where the training data set includes a plurality of video sequences, and each video sequence includes multiple frames of continuous images;

the modeling unit 2 is used for establishing a monocular vision odometer positioning model according to each video sequence;

the positioning unit 3 is used for obtaining the relative pose to be detected according to the video sequence to be detected based on the monocular vision odometer positioning model.

Preferably, the modeling unit 2 includes a stacking module 21, a feature extraction module 22, an information extraction module 23, and a pose determination module 24. In particular, the amount of the solvent to be used,

the stacking module 21 is configured to perform stacking processing on each adjacent frame image to obtain a corresponding stacked image;

the feature extraction module 22 is configured to extract high-dimensional features from each stacked image through a FlowNet encoder;

the information extraction module 23 is configured to sequentially extract local information and global information from the high-dimensional features through the LCGR module;

the pose determining module 24 is configured to obtain a relative pose through full-connected regression processing according to the local information and the global information.

Further, the modeling unit 2 further includes a preprocessing module 20, the preprocessing module 20 is respectively connected to the obtaining unit 1 and the stacking module 21, and the preprocessing module 20 is configured to adjust the size of each image to a uniform size and send the image to the stacking module 21.

In addition, the invention also provides the following scheme: a monocular visual odometer positioning system comprising:

a processor; and

stacking each adjacent frame image to obtain a corresponding stacked image;

Further, the present invention also provides a computer-readable storage medium storing one or more programs that, when executed by an electronic device including a plurality of application programs, cause the electronic device to perform operations of:

stacking each adjacent frame image to obtain a corresponding stacked image;

Compared with the prior art, the monocular visual odometer positioning system and the computer readable storage medium have the same beneficial effects as the monocular visual odometer positioning method, and are not repeated herein.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A monocular visual odometer positioning method, comprising:

stacking each adjacent frame image to obtain a corresponding stacked image;

2. The monocular visual odometer positioning method according to claim 1, wherein the extracting, by the LCGR module, the local information and the global information from the high-dimensional features in sequence specifically includes:

3. The monocular visual odometer positioning method of claim 1, wherein the relative pose comprises a displacement and a pose;

Wherein the content of the first and second substances,

in order for the displacement to be predicted,

represents a two-norm;

according to the displacement loss value L_transAnd attitude loss value L_rotDetermining total lossA value;

4. The monocular visual odometer positioning method of claim 3, wherein the total loss value further comprises an optical flow loss value, a constraint loss value, and a pole pair loss value;

5. The monocular visual odometer positioning method of claim 4, wherein the optical flow loss value L is determined according to the following formula_ptotometricConstraint loss value L_regPole pair loss value L_eAnd total loss value L_total：

Le＝|q^TA^-T[r]×RK^-1p|；

L_total＝L_trans+100L_rot+L_ptotometric+L_e+L_reg；

Wherein (I, j) represents the pixel coordinate position, I_tRepresents the image of the t-th frame, I' (I, j, t +1) represents the images I (I, j, t) and I (I, j, t +1) of two consecutive frames and the FlowNet encoder outputs the image of optical flow synthesis; c (i, j) is the mask value for the (i, j) location, indicating the confidence that the pixel can be successfully synthesized; two continuous frames of images I (I, j, t) and I (u, j, t +1), the pixel correspondence between the source image and the target image is provided through the estimated optical flow, q is the pixel position in the target image, p represents the corresponding pixel position in the source image, A is the camera internal parameter, and R and R are the relative poses between the source image and the target image.

6. The monocular visual odometer positioning method of any one of claims 1-5, further comprising:

the size of each image is adjusted to a uniform size.

7. A monocular visual odometer positioning system, comprising:

wherein the modeling unit includes:

8. The monocular visual odometer positioning system of claim 7, wherein the modeling unit further comprises:

9. A monocular visual odometer positioning system comprising:

a processor; and

stacking each adjacent frame image to obtain a corresponding stacked image;

10. A computer-readable storage medium storing one or more programs that, when executed by an electronic device including a plurality of application programs, cause the electronic device to:

stacking each adjacent frame image to obtain a corresponding stacked image;