CN112115786A

CN112115786A - Monocular vision odometer method based on attention U-net

Info

Publication number: CN112115786A
Application number: CN202010813907.5A
Authority: CN
Inventors: 刘瑞军; 王向上; 张伦
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2020-08-13
Filing date: 2020-08-13
Publication date: 2020-12-22
Anticipated expiration: 2040-08-13
Also published as: CN112115786B

Abstract

The present application discloses a monocular visual odometry method and device based on attention U-net. The method will acquire a monocular image sequence, pass several adjacent images through the lens boundary recognition algorithm in turn, and identify the lens boundary from consecutive frames. , reducing the amount of subsequent calculations. Using an attention-based local feature enhancement method, an attention mechanism is added to the U‑net auto-encoding network, and the key frame sequence is input into the network. First, an attention-based method is used to distinguish texture areas and smooth areas. When locating high-frequency When the location of details, the attention mechanism acts as a feature selector, which enhances high-frequency features and suppresses noise in smooth regions. Input the feature-enhanced sequence into the final Bi‑LSTM (Bi‑directional Long Short‑Term Memory), and at each timestamp, roughly infer the t _n+1 frame according to the t _n frame. The image is used as input to the reverse sequence, and the camera pose for each timestamp is obtained according to the context.

Description

A monocular visual odometry method based on attention U-net

技术领域technical field

本申请涉及图像增强，视觉里程计领域，特别是涉及一种基于注意力U-net的单目视觉里程计方法及系统。The present application relates to the fields of image enhancement and visual odometry, in particular to a monocular visual odometry method and system based on attention U-net.

背景技术Background technique

移动机器人完成自主导航，首先需要确定自身的位置和姿态，即定位，视觉里程计(Visual Odometry,VO)也因此提出来的，仅利用单个或多个相机所获取的相邻帧图像流估计智能体位姿，可以对环境进行重建。VO大多借助计算帧间的运动估计当前帧的位姿，VO目的是为了计算帧与帧之间相机的运动轨迹，从而为后端的闭环检测和建图减少漂移。基于深度学习的视觉里程计，无需复杂的几何运算，端到端的运算形式使得基于深度学习的方法更简洁。To complete autonomous navigation, a mobile robot first needs to determine its own position and attitude, that is, positioning. Visual Odometry (VO) is also proposed, which only uses the adjacent frame image streams obtained by a single or multiple cameras to estimate intelligence. Body posture, which can reconstruct the environment. VO mostly estimates the pose of the current frame by calculating the motion between frames. The purpose of VO is to calculate the motion trajectory of the camera between frames, thereby reducing drift for back-end closed-loop detection and mapping. The visual odometry based on deep learning does not need complex geometric operations, and the end-to-end operation form makes the method based on deep learning more concise.

在这些基础上，研究者尝试探索一种新的智能图片单应性计算方式。通过实时收集图像序列，通过神经网络的学习丰富对图像的理解，获取相邻帧特征匹配，得到相机位姿的目的。Konda等人最先通过提取视觉运动和深度信息实现了基于深度学习的VO。在使用立体图像估计出深度信息之后，卷积神经网络(convolutional neural network，CNN)通过softmax函数预测相机速度和方向的改变。Kendall等人利用CNN实现了输入为RGB图像，输出为相机位姿的端到端定位系统。该系统提出了23层深度卷积网络PoseNet，利用迁移学习将分类问题的数据库用于解决复杂的图像回归问题。其训练得到的特征相较于传统局部视觉特征，对于光照、运动模糊以及相机内参等具有更强的鲁棒性。Costante等人用稠密光流代替RGB图像作为CNN的输入。该系统设计了三种不同的CNN架构用于VO的特征学习，实现了算法在图像模糊和曝光不足等条件下的鲁棒性。然而，实验结果也表明训练数据对于算法影响很大，当图像序列帧间运动较大时，算法误差很大，这主要是由于训练数据缺少高速训练样本。On these basis, the researchers try to explore a new intelligent image homography calculation method. By collecting image sequences in real time, enriching the understanding of the image through the learning of neural network, obtaining the feature matching of adjacent frames, and obtaining the camera pose. Konda et al. were the first to implement deep learning-based VO by extracting visual motion and depth information. After estimating depth information using stereo images, a convolutional neural network (CNN) predicts changes in camera speed and orientation through a softmax function. Kendall et al. used CNN to implement an end-to-end localization system with RGB images as input and camera pose as output. The system proposes a 23-layer deep convolutional network PoseNet, which utilizes transfer learning to apply a database of classification problems to solve complex image regression problems. Compared with traditional local visual features, the trained features are more robust to illumination, motion blur, and camera intrinsic parameters. Costante et al. replaced RGB images with dense optical flow as the input of CNN. The system designs three different CNN architectures for feature learning of VO, and achieves the robustness of the algorithm under conditions such as image blur and underexposure. However, the experimental results also show that the training data has a great influence on the algorithm. When the motion between frames of the image sequence is large, the algorithm has a large error, which is mainly due to the lack of high-speed training samples in the training data.

为解决上述问题，现有的特征提取网络里，一般采用CNN提取实例表征，但是遇到光照复杂和纹理复杂等环境的情形时，很难提取到有效的特征或者提取到的特征不够突出，存在很大误差；常见的CNN处理过程中，高频信息在后一层容易丢失，而使用残差链接可以缓解这种丢失，进一步增强高频信号；另外在数据关联方面，现有的视觉里程计方法一般在处理图片序列是仅仅考虑了前向序列的传播，往往忽略反向序列的作用，没有充分挖掘上下文关联关系。In order to solve the above problems, in the existing feature extraction network, CNN is generally used to extract instance representations, but when encountering complex lighting and complex texture environments, it is difficult to extract effective features or the extracted features are not prominent enough. There is a large error; in the common CNN processing process, high-frequency information is easily lost in the latter layer, and the use of residual links can alleviate this loss and further enhance the high-frequency signal; in addition, in terms of data association, the existing visual odometry The method generally only considers the propagation of the forward sequence when processing the image sequence, and often ignores the role of the reverse sequence, and does not fully exploit the contextual relationship.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种基于注意力U-net的单目视觉里程计方法，在于克服上述问题或者至少部分地解决或缓减解决上述问题。The purpose of the present invention is to provide a monocular visual odometry method based on attention U-net, which is to overcome the above-mentioned problems or at least partially solve or alleviate the above-mentioned problems.

根据本发明的一个方面，获取单目图像序列，将相邻的若干图像依次通过镜头边界识别算法，从连续帧中识别出镜头边界，整个模块使用关键帧进行运算，并使用高斯金字塔对原图像进行降维，减少后续计算量。According to one aspect of the present invention, a sequence of monocular images is obtained, and several adjacent images are sequentially passed through the lens boundary recognition algorithm to identify the lens boundary from consecutive frames. Dimension reduction is performed to reduce the amount of subsequent calculations.

其中获取关键帧序列方法，通过将每一帧划分为16×16大小的非重叠网格来识别每一帧的镜头边界。采用卡方距离计算相邻两帧之间相应的网格直方图差d：Among them, the key frame sequence method is used to identify the shot boundary of each frame by dividing each frame into 16×16 non-overlapping grids. The chi-square distance is used to calculate the corresponding grid histogram difference d between two adjacent frames:

H_i表示第i帧直方图，H_i+1表示第(i+1)帧直方图。I表示两帧中同一位置的图像块。连续两帧之间的直方图平均差计算如下H _i represents the histogram of the ith frame, and H _i+1 represents the histogram of the (i+1)th frame. I represents an image block at the same location in both frames. The histogram mean difference between two consecutive frames is calculated as follows

D为连续两帧的平均直方图差，d_k为第k个图像块之间的卡方差。N表示图像中图像块的总数。在直方图差异大于阈值Tshot的帧上识别镜头边界：D is the average histogram difference of two consecutive frames, and d _k is the card variance between the kth image block. N represents the total number of image patches in the image. Identify shot boundaries on frames with histogram differences greater than a threshold Tshot:

将获取到的图片序列使用现有的高斯金字塔实行降维，使用步长为2的卷积，降到原图的1/4。The acquired image sequence is reduced to 1/4 of the original image by using the existing Gaussian pyramid and convolution with a stride of 2.

根据本发明的另一个方面，特征的重建，对多纹理区域进行识别与强化，强化高频细节。L_i-1表示第i个卷积层的输入，第i层的输出表示为：According to another aspect of the present invention, feature reconstruction identifies and enhances multi-texture regions to enhance high-frequency details. L _i-1 represents the input of the i-th convolutional layer, and the output of the i-th layer is expressed as:

L_i＝σ(W_i*L_i-1+b_i)L _i =σ(W _i *L _i-1 +b _i )

*指卷积的操作,σ是指非线性激活(ReLU)，特征重构网络由特征提取的卷积层、多个堆叠的稠密块和作为上采样模块的亚像素卷积层组成，稠密块由残差模块(Resblock)组成，显示了强大的对象识别学习能力。设H_i为第i个残差模块的输入，输出F_i可表示为：*Refers to the operation of convolution, σ refers to nonlinear activation (ReLU), and the feature reconstruction network consists of a convolutional layer for feature extraction, multiple stacked dense blocks, and a sub-pixel convolutional layer as an upsampling module. The dense block Composed of residual blocks (Resblock), it shows strong learning ability for object recognition. Let H _i be the input of the i-th residual module, and the output F _i can be expressed as:

F_i＝φ_i(H_i,W_i)+H_i F _i =φ _i (H _i ,W _i )+H _i

残差块包含两个卷积层。具体来说，残差模块函数可以表示如下：The residual block contains two convolutional layers. Specifically, the residual module function can be expressed as follows:

φ_i(H_i；W_i)＝σ₂(W_i ²*σ₁(W_i ¹*H_i))φ _i (H _i ; Wi )=σ ₂ (W _i ² *σ ₁ (W _i ¹ *H _i ₎ )

其中W_i ¹和W_i ²分别为两个卷积层的权值，σ₁、σ₂表示归一化，第i个残差模块H_i的输入是前一个残差模块输出的串联。使用一个具有1×1内核的卷积层来控制应该保留多少以前的状态。它自适应地学习不同状态的权重。第i个残差块的输入表示为：Wherein W _i ¹ and W _i ² are the weights of the two convolutional layers respectively, σ ₁ and σ ₂ represent normalization, and the input of the i-th residual module H _i is the concatenation of the output of the previous residual module. Use a convolutional layer with a 1×1 kernel to control how much of the previous state should be retained. It adaptively learns the weights of different states. The input of the i-th residual block is represented as:

H_i＝σ₀(W_i ⁰*[F₁,F₂,...,F_i-1])H _i =σ ₀ (W _i ⁰ *[F ₁ ,F ₂ ,...,F _i-1 ])

W_i ⁰表示1×1的卷积权重，σ₀表示ReLU归一化。 _Wi ⁰ represents a 1×1 convolution weight, and σ ₀ represents ReLU normalization.

生成注意力，确定纹理的确切位置。采用类U-net架构，用稠密块代替卷积层，在压缩路径中，首先利用卷积层对插值后的图像进行低层特征提取。然后使用2×2的最大池化来缩小数据的维数，得到更大的接受域，压缩路径中使用两次池化。在扩展路径中，加入反卷积层对已有的特征图进行上采样。通过结合扩展路径中的低层特征和高层特征，输出可以精确地分割出该区域是否是有纹理的区域，是否需要特征重构网络进行修复。网络输出的特征通道为1，在最后一层，使用激活sigmoid来控制从0到1的输出掩码。如果像素属于纹理区域的概率越高，掩码值就越接近1，这意味着这些像素需要更多的关注，如果不是，掩码值将更接近于0。Generate attention to determine the exact location of the texture. The U-net-like architecture is adopted, and the convolutional layers are replaced by dense blocks. In the compression path, the convolutional layers are first used to perform low-level feature extraction on the interpolated image. Then use 2×2 max pooling to reduce the dimensionality of the data and get a larger receptive field, using two pooling in the compression path. In the expansion path, a deconvolution layer is added to upsample the existing feature maps. By combining low-level features and high-level features in the expansion path, the output can accurately segment whether the region is textured and whether it needs a feature reconstruction network for repair. The feature channel output by the network is 1, and in the last layer, an activation sigmoid is used to control the output mask from 0 to 1. The higher the probability that the pixel belongs to the texture region, the closer the mask value is to 1, which means that these pixels need more attention, if not, the mask value will be closer to 0.

基于注意力的残差学习，通过特征强化网络的输出以及掩模的生成，得到了强化后的图像残差，同时将强化前的特征作为注意力生成网络的输入进行内插，得到最终的注意力生成结果。通过特征重构网络的输出和掩模值的点生成，得到了ILR图像的残差，通过加入内插的HR图像作为注意力生成网络的输入，得到了最终的注意力生成结果。它可以表示为：Residual learning based on attention, through the output of feature reinforcement network and the generation of mask, the image residual after reinforcement is obtained, and the feature before reinforcement is used as the input of attention generation network to interpolate to obtain the final attention Force produces results. Through the output of the feature reconstruction network and the point generation of the mask value, the residual of the ILR image is obtained, and the final attention generation result is obtained by adding the interpolated HR image as the input of the attention generation network. It can be expressed as:

HR^c(i,j)＝F^c(i,j)×M(i,j)+ILR^c(i,j)HRc(i,j)= ^Fc (i,j)×M(i,j)+ ^ILRc ⁽ i,j)

其中F＝[F¹；F²；F³]是特征重构网络的输出，输出通道数为3。M是掩码值。ILR＝[ILR¹,ILR²,ILR³]是插值后的图像，HR＝[HR¹,HR²,HR³]是方法的最终强化结果。i和j表示每个通道中的像素位置，c表示通道索引。where F=[F ¹ ; F ² ; F ³ ] is the output of the feature reconstruction network, and the number of output channels is 3. M is the mask value. ILR=[ILR ¹ , ILR ² , ILR ³ ] is the interpolated image, and HR=[HR ¹ , HR ² , HR ³ ] is the final enhancement result of the method. i and j represent the pixel position in each channel, and c represents the channel index.

根据本发明的又一方面，根据强化的图像序列获取每个时间戳相机的位姿。使用包含4个卷积的CNN网络，一层7×7，两层5×5，最后使用3×3卷积核，将CNN的输出进行Bi-LSTM的序列建模，给定t时刻的特征，那么t时刻的Bi-LSTM更新为：According to yet another aspect of the present invention, the pose of each time-stamped camera is obtained from the augmented image sequence. Using a CNN network containing 4 convolutions, one layer of 7 × 7, two layers of 5 × 5, and finally using a 3 × 3 convolution kernel, the output of the CNN is modeled by Bi-LSTM sequence, given the features at time t , then the Bi-LSTM at time t is updated as:

s_t＝f(Ux_t+Ws_t-1)s _t =f(Ux _t +Ws _t-1 )

s′_t＝f(U'x_t+W's_t+1)s' _t =f(U'x _t +W's _t+1 )

A_t＝f(WA_t-1+Ux_t)A _t =f(WA _t-1 +Ux _t )

A′_t＝f(W'A′_t+1+U'x_t)A' _t =f(W'A' _t+1 +U'x _t )

y_t＝g(VA₂+V'A'₂)y _t =g(VA ₂ +V'A' ₂ )

其中s_t和s′_t表示t时刻正向和反向的内存变量，A_t和A′_t表示t时刻正向和反向的隐藏层变量，y_t表示输出变量。另外，f、g表示非线性激活函数，U、U'、W、W'、V、V'分别表示各自变量对应的权重矩阵。然后将Bi-LSTM后面加两个全连接层，将经过训练的图像序列信息整合到一个最终精确的6自由度位姿，包括三个旋转和三个平移提出人工合成图片的方法，认为短时间内相机做的是直线运动：where s _t and s' _t represent the forward and reverse memory variables at time t, A _t and A' _t represent the forward and reverse hidden layer variables at time t, and y _t represents the output variable. In addition, f and g represent nonlinear activation functions, and U, U', W, W', V, and V' represent weight matrices corresponding to their respective variables. Then two fully connected layers are added behind Bi-LSTM to integrate the trained image sequence information into a final accurate 6-DOF pose, including three rotations and three translations. What the inner camera does is linear motion:

其中P表示图像中心在当前t时刻的截取率，v表示当前速度，T表示摄像机的采样周期，α为常数，n表示当前时间的第n帧，2.57是根据实验获取的经验值。Among them, P represents the interception rate of the image center at the current time t, v represents the current speed, T represents the sampling period of the camera, α is a constant, n represents the nth frame at the current time, and 2.57 is the empirical value obtained from experiments.

附图说明Description of drawings

后文将参照附图以示例性而非限制性的方式详细描述本申请的一些具体实施例。附图中相同的附图标记标示了相同或类似的部件或部分。本领域技术人员应该理解，这些附图未必是按比例绘制的。附图中：Hereinafter, some specific embodiments of the present application will be described in detail by way of example and not limitation with reference to the accompanying drawings. The same reference numbers in the figures designate the same or similar parts or parts. It will be understood by those skilled in the art that the drawings are not necessarily to scale. In the attached picture:

图1是本申请方法的流程示意图；Fig. 1 is the schematic flow sheet of the present application method;

图2是根据本申请一个实施例的一种基于注意力U-net的单目视觉里程计方法的示意性结构框图；2 is a schematic structural block diagram of a monocular visual odometry method based on attention U-net according to an embodiment of the present application;

图3示意性地示出本发明优选实施方式的特征强化模型结构；FIG. 3 schematically shows the feature enhancement model structure of the preferred embodiment of the present invention;

图4显示了基于上下文环境的Bi-LSTM模型结构；Figure 4 shows the context-based Bi-LSTM model structure;

图5本申请实施例提供的一种计算设备；FIG. 5 is a computing device provided by an embodiment of the present application;

图6是本申请实施例提供的一种计算机可读存储介质；FIG. 6 is a computer-readable storage medium provided by an embodiment of the present application;

图7是本申请实施例基于KITTI数据集01序列和07序列，生成的里程计运算结果。FIG. 7 is an odometer operation result generated based on the KITTI data set 01 sequence and 07 sequence in an embodiment of the present application.

具体实施方式Detailed ways

下面结合附图和具体实施方式对本发明进行描述。The present invention will be described below with reference to the accompanying drawings and specific embodiments.

图1是根据本申请实施例的一种基于注意力U-net的单目视觉里程计方法流程示意图。参见图1所知，本申请实施例提供的一种基于注意力U-net的单目视觉里程计方法及系统可以包括：FIG. 1 is a schematic flowchart of a monocular visual odometry method based on attention U-net according to an embodiment of the present application. Referring to FIG. 1 , a monocular visual odometry method and system based on attention U-net provided by the embodiment of the present application may include:

步骤S1：获取单目图像序列，通过镜头边界识别法，筛选出关键帧序列，并通过高斯金字塔对图像降维。Step S1: Obtain a monocular image sequence, filter out the key frame sequence through the lens boundary recognition method, and reduce the dimension of the image through the Gaussian pyramid.

步骤S2：采用卷积神经网络，多个堆叠的残差块以及子像素反卷积组成的全卷积网络模型，对降维后的关键帧进行重建。Step S2: Using a convolutional neural network, a fully convolutional network model composed of multiple stacked residual blocks and sub-pixel deconvolution to reconstruct the dimensionality-reduced key frame.

步骤S3：使用类U-net结构，将卷积块替换为残差块，解决由于深度造成梯度消失问题，输入降维后的关键帧，生成图像的纹理掩模。Step S3: Using a U-net-like structure, replace the convolution block with a residual block to solve the problem of gradient disappearance due to depth, input the key frame after dimension reduction, and generate a texture mask of the image.

步骤S4：利用场景中对应实体特征一致性理论完成整个学习场景实例的纹理对齐，并利用得到的对齐关系，将数据融合，从而实现视觉输入的真实性对局部场景特征强化。Step S4: Complete the texture alignment of the entire learning scene instance using the corresponding entity feature consistency theory in the scene, and use the obtained alignment relationship to fuse the data, so as to realize the authenticity of the visual input to strengthen the local scene features.

步骤S5：基于强化后的图像，结合提出的人工合成图片方法，根据t时刻图像，生成t+n时刻图像(t≤3)，为位姿估计提供反向数据。Step S5 : Based on the enhanced image, combined with the proposed artificial image synthesis method, according to the image at time t, an image at time t+n (t≤3) is generated to provide reverse data for pose estimation.

步骤S6：使用包含4个卷积的CNN网络，把强化后的图像序列以及人工合成数据融合并降维，将CNN的输出进行Bi-LSTM的序列建模，最终推测出每个时间戳的位姿变化。Step S6: Using a CNN network containing 4 convolutions, fuse the enhanced image sequence and artificially synthesized data and reduce the dimension, and perform the sequence modeling of Bi-LSTM on the output of the CNN, and finally infer the bit of each timestamp. Posture changes.

本申请实施例提供了一种基于注意力U-net的单目视觉里程计方法，在本申请实施例提供的方法中，先获取实时图像序列，使用镜头边界识别方法筛选出关键帧，并对关键帧序列降维，然后通过特征重建网络，恢复高频细节，重建视觉图像，另外关键帧序列通过基于注意力的类U-net网络，来区分纹理区域和平滑区域，当定位高频细节的位置时，注意机构用作特征选择器，其增强高频特征并抑制平滑区域中的噪声，生成纹理掩模。通过特征强化网络的输出以及掩模的生成，得到了强化后的图像残差，同时将强化前的特征作为注意力生成网络的输入进行内插，得到最终的注意力生成结果。The embodiment of the present application provides a monocular visual odometry method based on the attention U-net. In the method provided by the embodiment of the present application, a real-time image sequence is obtained first, and a shot boundary recognition method is used to screen out key frames, and the The key frame sequence is dimensionally reduced, and then the feature reconstruction network is used to restore high-frequency details and reconstruct the visual image. In addition, the key frame sequence uses an attention-based U-net-like network to distinguish texture areas and smooth areas. When locating high-frequency details location, the attention mechanism acts as a feature selector, which enhances high-frequency features and suppresses noise in smooth regions, generating texture masks. Through the output of the feature enhancement network and the generation of the mask, the image residual after enhancement is obtained, and the feature before enhancement is used as the input of the attention generation network to interpolate to obtain the final attention generation result.

所述方法采用的实验数据集为KITTI数据集(由德国卡尔斯鲁厄理工学院和丰田美国技术研究院联合创办)，该数据集是目前国际上最大的自动驾驶场景下的计算机视觉算法评测数据集。KITTI数据采集平台包括2个灰度摄像机、2个彩色摄像机、一个Velodyne3D激光雷达、4个光学镜头、以及1个GPS导航系统。整个数据集由389对立体图像和光流图(每张图像最多包含15辆车及30个行人，并且存在不同程度的遮挡)、39.2公里视觉测距序列以及超过200,000 3D标注物体的图像组成。The experimental data set used in the method is the KITTI data set (co-founded by Karlsruhe Institute of Technology in Germany and Toyota American Institute of Technology), which is currently the largest international computer vision algorithm evaluation data in autonomous driving scenarios. set. The KITTI data acquisition platform includes 2 grayscale cameras, 2 color cameras, a Velodyne3D lidar, 4 optical lenses, and a GPS navigation system. The entire dataset consists of 389 pairs of stereo images and optical flow maps (each image contains up to 15 vehicles and 30 pedestrians with varying degrees of occlusion), 39.2 kilometers of visual odometry sequences, and more than 200,000 images of 3D annotated objects.

S1、通过将每一帧划分为16×16大小的非重叠网格来识别每一帧的镜头边界。采用卡方距离计算相邻两帧之间相应的网格直方图差d：S1. Identify the shot boundary of each frame by dividing each frame into non-overlapping grids of size 16×16. The chi-square distance is used to calculate the corresponding grid histogram difference d between two adjacent frames:

H_i表示第i帧直方图，H_i+1表示第(i+1)帧直方图。I表示两帧中同一位置的图像块。连续两帧之间的直方图平均差计算如下：H _i represents the histogram of the ith frame, and H _i+1 represents the histogram of the (i+1)th frame. I represents an image block at the same location in both frames. The histogram mean difference between two consecutive frames is calculated as follows:

S2、特征的重建，对多纹理区域进行识别与强化，强化高频细节。L_i-1表示第i个卷积层的输入，第i层的输出表示为：S2, feature reconstruction, identify and strengthen multi-texture areas, and strengthen high-frequency details. L _i-1 represents the input of the i-th convolutional layer, and the output of the i-th layer is expressed as:

L_i＝σ(W_i*L_i-1+b_i)L _i =σ(W _i *L _i-1 +b _i )

F_i＝φ_i(H_i,W_i)+H_i F _i =φ _i (H _i ,W _i )+H _i

残差块包含两个卷积层。具体来说，残差模块函数可以表示如下The residual block contains two convolutional layers. Specifically, the residual module function can be expressed as follows

S3、网络由收缩路径、扩展路径和跳跃连接组成，它采用一个双线性插值的原图像(按所需的大小)作为输入。插值所增加的冗余可以减少前向传播的信息损失，有利于纹理区域和平滑区域的精确分割。使用类U-net结构，将卷积块替换为残差块，解决由于深度造成梯度消失问题，由于残差网络的重用性，堆叠残差块可以大大减少参数的数量。基于注意力机制的实时场景特征纹理恢复具体实施过程如下：S3. The network consists of shrinking paths, expanding paths and skip connections, and it takes a bilinearly interpolated original image (at the desired size) as input. The redundancy added by interpolation can reduce the information loss of forward propagation, which is beneficial to the accurate segmentation of textured and smooth regions. Using a U-net-like structure, the convolution blocks are replaced with residual blocks to solve the problem of gradient disappearance due to depth. Due to the reusability of residual networks, stacking residual blocks can greatly reduce the number of parameters. The specific implementation process of real-time scene feature texture recovery based on attention mechanism is as follows:

1)在压缩路径中，首先利用卷积层提取低阶特征。然后使用最大池化来缩小数据的维数，得到更大的接受域。在压缩路径中使用两次池化，网络就可以利用更大的区域来预测一个像素是否属于高频区域。1) In the compression path, low-level features are first extracted using convolutional layers. Then use max pooling to reduce the dimensionality of the data and get a larger receptive field. By using pooling twice in the compression path, the network can use a larger area to predict whether a pixel belongs to a high frequency region.

2)在扩展路径中，加入反卷积层对已有的特征图进行上采样。低层特性包含很多有用的信息，在正向传播过程中丢失了很多信息。通过结合扩展路径中的低层特征和高层特征，输出可以精确地分割出该区域是否是有纹理的区域，是否需要特征重构网络进行修复。网络最后一层利用sigmoid激活函数控制从0到1的输出掩码。如果像素属于纹理区域的概率越高，掩码值就越接近1，这意味着这些像素需要更多的关注；如果不是，掩码值将更接近于0。2) In the expansion path, a deconvolution layer is added to upsample the existing feature maps. Low-level features contain a lot of useful information, and a lot of information is lost during forward propagation. By combining low-level features and high-level features in the expansion path, the output can accurately segment whether the region is textured and whether it needs a feature reconstruction network for repair. The last layer of the network uses a sigmoid activation function to control the output mask from 0 to 1. If a pixel has a higher probability of belonging to a textured region, the mask value will be closer to 1, which means those pixels need more attention; if not, the mask value will be closer to 0.

S4、通过特征重构网络的输出和掩模值的点生成，得到了HR图像的残差，通过加入内插的LR图像作为注意力生成网络的输入，得到了最终的注意力生成结果。它可以表示为：S4. Through the output of the feature reconstruction network and the point generation of the mask value, the residual of the HR image is obtained, and the final attention generation result is obtained by adding the interpolated LR image as the input of the attention generation network. It can be expressed as:

其中F＝[F¹；F²；F³]是特征重构网络的输出，输出通道数为3。M是掩码值。ILR＝[ILR¹,ILR²,ILR³]是插值后的图像。HR＝[HR¹,HR²,HR³]是方法的最终强化结果。i和j表示每个通道中的像素位置，c表示通道索引。注意力产生网络会使来自纹理区域的残差值变大，而来自非纹理区域的残差值趋近于0。掩模M是一个增强高频特性和抑制噪声的特征选择器，所以在输出图像中，高频细节将被恢复，在平滑的地方，噪声将被去除。where F=[F ¹ ; F ² ; F ³ ] is the output of the feature reconstruction network, and the number of output channels is 3. M is the mask value. ILR=[ILR ¹ , ILR ² , ILR ³ ] is the interpolated image. HR=[HR ¹ , HR ² , HR ³ ] is the final reinforcement result of the method. i and j represent the pixel position in each channel, and c represents the channel index. The attention generating network makes the residual values from textured regions larger, while the residual values from non-textured regions tend to be zero. Mask M is a feature selector that enhances high-frequency features and suppresses noise, so in the output image, high-frequency details will be recovered, and where smooth, noise will be removed.

S5、提出人工合成图片的方法，认为短时间内相机做的是直线运动：S5. A method of artificially synthesizing pictures is proposed, and it is believed that the camera moves in a straight line in a short period of time:

S6、根据强化的图像序列获取每个时间戳相机的位姿。使用包含4个卷积的CNN网络，一层7×7，两层5×5，最后使用3×3卷积核，将CNN的输出进行Bi-LSTM的序列建模，给定t时刻的特征，那么t时刻的Bi-LSTM更新为：S6. Obtain the pose of each timestamped camera according to the enhanced image sequence. Using a CNN network containing 4 convolutions, one layer of 7 × 7, two layers of 5 × 5, and finally using a 3 × 3 convolution kernel, the output of the CNN is modeled by Bi-LSTM sequence, given the features at time t , then the Bi-LSTM at time t is updated as:

s_t＝f(Ux_t+Ws_t-1)s _t =f(Ux _t +Ws _t-1 )

s′_t＝f(U'x_t+W's_t+1)s' _t =f(U'x _t +W's _t+1 )

A_t＝f(WA_t-1+Ux_t)A _t =f(WA _t-1 +Ux _t )

A′_t＝f(W'A′_t+1+U'x_t)A' _t =f(W'A' _t+1 +U'x _t )

y_t＝g(VA₂+V'A'₂)y _t =g(VA ₂ +V'A' ₂ )

其中s_t和s′_t表示t时刻正向和反向的内存变量，A_t和A′_t表示t时刻正向和反向的隐藏层变量，y_t表示输出变量。另外，f、g表示非线性激活函数，U、U'、W、W'、V、V'分别表示各自变量对应的权重矩阵。然后将Bi-LSTM后面加两个全连接层，将经过训练的图像序列信息整合到一个最终精确的6自由度位姿，包括三个旋转和三个平移。where s _t and s' _t represent the forward and reverse memory variables at time t, A _t and A' _t represent the forward and reverse hidden layer variables at time t, and y _t represents the output variable. In addition, f and g represent nonlinear activation functions, and U, U', W, W', V, and V' represent weight matrices corresponding to their respective variables. The Bi-LSTM is then followed by two fully connected layers to integrate the trained image sequence information into a final accurate 6DOF pose, including three rotations and three translations.

图2是根据本申请一个实施例的一种基于注意力U-net的单目视觉里程计方法的示意性结构框图。所述装置一般性地可包括预处理模块、特征重建模块、注意力强化模块、残差学习模块、人工数据合成模块、和基于上下文位姿推测模块。FIG. 2 is a schematic structural block diagram of a monocular visual odometry method based on attention U-net according to an embodiment of the present application. The apparatus may generally include a preprocessing module, a feature reconstruction module, an attention enhancement module, a residual learning module, an artificial data synthesis module, and a context-based pose inference module.

图3显示了根据本发明优选实施方式的特征强化模型结构，该模型由特征重建模块，基于注意力的局部特征强化模块，以及残差学习模块Figure 3 shows the structure of a feature enhancement model according to a preferred embodiment of the present invention, the model consists of a feature reconstruction module, an attention-based local feature enhancement module, and a residual learning module

图4显示了基于上下文环境的Bi-LSTM模型结构，将之前的CNN模型的输出作为Bi-LSTM输入，获取每个时间戳的相机位姿估计。Figure 4 shows the context-based Bi-LSTM model structure, taking the output of the previous CNN model as the Bi-LSTM input to obtain the camera pose estimate for each timestamp.

本发明旨在保护一种局部特征强化方法以及基于上下文环境的位姿推测方法，基于深度学习的视觉里程计方法，主要是根据局部表征对齐，估计出相机位姿，然而现有方法，在多纹理或者光照复杂的环境中，对于纹理对齐不够精确，本方案提出的方法，通过特征重建，强化高频细节，对多纹理区域进行识别与强化，并通过对U-net附加注意力，生成纹理掩模确定纹理的确切位置，为了强化尽可能多的纹理细节，提供需要大量接受域的注意力机制，再与重建后的图像叠加，完成特征强化。另外，相机位姿估计是一个序列化问题，现有方法仅仅考虑了图像前向序列，本方案通过人工合成数据，附加反向约束，充分挖掘上下文约束，每个时间戳上获取更加精确的相机位姿。The present invention aims to protect a local feature enhancement method, a context-based pose estimation method, and a deep learning-based visual odometry method, which mainly estimates the camera pose based on local representation alignment. In environments with complex textures or lighting, the texture alignment is not accurate enough. The method proposed in this scheme enhances high-frequency details through feature reconstruction, identifies and strengthens multi-texture areas, and generates textures by adding attention to U-net. The mask determines the exact location of the texture. In order to enhance as many texture details as possible, an attention mechanism that requires a large receptive field is provided, and then superimposed with the reconstructed image to complete feature enhancement. In addition, camera pose estimation is a serialization problem. The existing method only considers the forward sequence of images. This scheme uses artificial synthetic data, adds reverse constraints, and fully mines context constraints to obtain more accurate cameras at each timestamp. pose.

根据下文结合附图对本申请的具体实施例的详细描述，本领域技术人员将会更加明了本申请的上述以及其他目的、优点和特征。The above and other objects, advantages and features of the present application will be more apparent to those skilled in the art from the following detailed description of the specific embodiments of the present application in conjunction with the accompanying drawings.

本申请实施例还提供了一种计算设备，参照图5，该计算设备包括存储器520、处理器510和存储在所述存储器520内并能由所述处理器510运行的计算机程序，该计算机程序存储于存储器520中的用于程序代码的空间530，该计算机程序在由处理器510执行时实现用于执行任一项根据本发明的方法531。An embodiment of the present application further provides a computing device, referring to FIG. 5 , the computing device includes a memory 520, a processor 510, and a computer program stored in the memory 520 and executable by the processor 510, the computer program Space 530 stored in memory 520 for program code which, when executed by processor 510, implements for performing any one of the methods 531 according to the invention.

本申请实施例还提供了一种计算机可读存储介质。参照图6，该计算机可读存储介质包括用于程序代码的存储单元，该存储单元设置有用于执行根据本发明的方法步骤的程序531′，该程序被处理器执行。Embodiments of the present application also provide a computer-readable storage medium. Referring to Figure 6, the computer-readable storage medium comprises a storage unit for program codes provided with a program 531' for performing the method steps according to the invention, the program being executed by a processor.

本申请实施例还提供了一种包含指令的计算机程序产品。当该计算机程序产品在计算机上运行时，使得计算机执行根据本发明的方法步骤。Embodiments of the present application also provide a computer program product including instructions. The computer program product, when run on a computer, causes the computer to perform the method steps according to the invention.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机加载和执行所述计算机程序指令时，全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、获取其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer loads and executes the computer program instructions, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), among others.

专业人员应该还可以进一步意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Professionals should be further aware that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. Interchangeability, the above description has generally described the components and steps of each example in terms of function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分步骤是可以通过程序来指令处理器完成，所述的程序可以存储于计算机可读存储介质中，所述存储介质是非短暂性(英文：non-transitory)介质，例如随机存取存储器，只读存储器，快闪存储器，硬盘，固态硬盘，磁带(英文：magnetic tape)，软盘(英文：floppy disk)，光盘(英文：optical disc)及其任意组合。Those of ordinary skill in the art can understand that all or part of the steps in the method of implementing the above embodiments can be completed by instructing the processor through a program, and the program can be stored in a computer-readable storage medium, and the storage medium is non-transitory ( English: non-transitory) media, such as random access memory, read only memory, flash memory, hard disk, solid state disk, magnetic tape (English: magnetic tape), floppy disk (English: floppy disk), optical disc (English: optical disc) and any combination thereof.

最终的示例效果如图7所示，基于KITTI数据集的01序列和07序列的运算结果，显示了模型估计的相机位姿运动轨迹与真实轨迹的对比情况。The final example effect is shown in Figure 7. Based on the operation results of the 01 sequence and the 07 sequence of the KITTI data set, it shows the comparison between the camera pose motion trajectory estimated by the model and the real trajectory.

以上所述，仅为本申请较佳的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应该以权利要求的保护范围为准。The above are only the preferred specific embodiments of the present application, but the protection scope of the present application is not limited to this. Substitutions should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A monocular visual odometry method based on attention U-net, comprising:

Obtain a monocular image sequence, pass several adjacent images through the lens boundary recognition algorithm in turn, and identify the lens boundary from consecutive frames. The entire module uses key frames to perform operations, and uses the Gaussian pyramid to reduce the dimension of the original image and reduce subsequent calculations. quantity.

Using an attention-based local feature enhancement method, an attention mechanism is added to the U-net self-encoding network, and the key frame sequence is input into the network. First, an attention-based method is used to distinguish texture areas and smooth areas. When locating high-frequency When the location of details, the attention mechanism acts as a feature selector, which enhances high-frequency features and suppresses noise in smooth regions.

Input the feature-enhanced sequence into the final Bi-LSTM, at each timestamp, roughly infer the image of t _n+1 frame as the input of the reverse sequence according to t _n frame, and obtain the camera position of each timestamp. posture.

2 . The method of claim 1 , wherein the calculating key frame sequence method identifies the shot boundary of each frame by dividing each frame into non-overlapping grids of size 16×16. 3 . The chi-square distance is used to calculate the corresponding grid histogram difference d between two adjacent frames:

H _i represents the histogram of the ith frame, and H _i+1 represents the histogram of the (i+1)th frame. I represents an image block at the same location in both frames. The histogram mean difference between two consecutive frames is calculated as follows

D is the average histogram difference of two consecutive frames, and d _k is the card variance between the kth image block. N represents the total number of image patches in the image. Identify shot boundaries on frames with histogram differences greater than a threshold Tshot:

The acquired image sequence is reduced to 1/4 of the original image by using the existing Gaussian pyramid and convolution with a stride of 2.

3. The method according to claim 1, reconstruction of features, identification and enhancement of multi-texture regions, and enhancement of high-frequency details. L _i-1 represents the input of the i-th convolutional layer, and the output of the i-th layer is expressed as:

L _i =σ(W _i *L _i-1 +b _i )

*Refers to the operation of convolution, σ refers to nonlinear activation (ReLU), and the feature reconstruction network consists of a convolutional layer for feature extraction, multiple stacked dense blocks, and a sub-pixel convolutional layer as an upsampling module. The dense block Composed of residual blocks (Resblock), it shows strong learning ability for object recognition. Let H _i be the input of the i-th residual module, and the output F _i can be expressed as:

F _i =φ _i (H _i ,W _i )+H _i

The residual block contains two convolutional layers. Specifically, the residual module function can be expressed as follows

φ _i (H _i ; Wi )=σ ₂ (W _i ² *σ ₁ (W _i ¹ *H _i ₎ )

Wherein W _i ¹ and W _i ² are the weights of the two convolutional layers respectively, and σ ₁ and σ ₂ indicate that the input of the normalized ith residual module _Hi is the concatenation of the output of the previous residual module. Use a convolutional layer with a 1×1 kernel to control how much of the previous state should be retained. It adaptively learns the weights of different states. The input of the i-th residual block is represented as:

H _i =σ ₀ (W _i ⁰ *[F ₁ ,F ₂ ,...,F _i-1 ])

_Wi ⁰ represents a 1×1 convolution weight, and σ ₀ represents ReLU normalization.

4. The method of claim 1, wherein attention is generated to determine the exact location of the texture. The U-net-like architecture is adopted, and the convolutional layers are replaced by dense blocks. In the compression path, the convolutional layers are first used to perform low-level feature extraction on the interpolated image. Then use 2×2 max pooling to reduce the dimensionality of the data and get a larger receptive field, using two pooling in the compression path. In the expansion path, a deconvolution layer is added to upsample the existing feature maps. By combining low-level features and high-level features in the expansion path, the output can accurately segment whether the region is textured and whether it needs a feature reconstruction network for repair. The feature channel output by the network is 1, and in the last layer, an activation sigmoid is used to control the output mask from 0 to 1. The higher the probability that the pixel belongs to the texture region, the closer the mask value is to 1, which means that these pixels need more attention, if not, the mask value will be closer to 0.

5 . The method according to claim 1 , wherein the residual learning based on attention obtains the image residual after enhancement through the output of the feature enhancement network and the generation of the mask, and at the same time, the feature before enhancement is obtained. 6 . Interpolate as the input of the attention generation network to get the final attention generation result. Through the output of the feature reconstruction network and the point generation of the mask value, the residual of the HR image is obtained, and the final attention generation result is obtained by adding the interpolated LR image as the input of the attention generation network. It can be expressed as:

HRc(i,j)= ^Fc (i,j)×M(i,j)+ ^ILRc ⁽ i,j)

where F=[F ¹ ; F ² ; F ³ ] is the output of the feature reconstruction network, and the number of output channels is 3. M is the mask value. ILR=[ILR ¹ , ILR ² , ILR ³ ] is the interpolated image. HR=[HR ¹ , HR ² , HR ³ ] is the final reinforcement result of the method. i and j represent the pixel position in each channel, and c represents the channel index.

6. The method of claim 1, wherein the pose of each time-stamped camera is obtained from the enhanced image sequence. Using a CNN network containing 4 convolutions, one layer of 7 × 7, two layers of 5 × 5, and finally using a 3 × 3 convolution kernel, the input of the CNN is modeled by Bi-LSTM sequence, given the features at time t , then the Bi-LSTM at time t is updated as:

s _t =f(Ux _t +Ws _t-1 )

s' _t =f(U'x _t +W's _t+1 )

A _t =f(WA _t-1 +Ux _t )

A' _t =f(W'A' _t+1 +U'x _t )

y _t =g(VA ₂ +V'A' ₂ )

where s _t and s' _t represent the forward and reverse memory variables at time t, A _t and A' _t represent the forward and reverse hidden layer variables at time t, and y _t represents the output variable. In addition, f and g represent nonlinear activation functions, and U, U', W, W', V, and V' represent weight matrices corresponding to their respective variables. The Bi-LSTM is then followed by two fully connected layers to integrate the trained image sequence information into a final accurate 6DOF pose, including three rotations and three translations.

7. The method according to claim 1 and claim 6, characterized in that, in practical application, it is unrealistic to obtain a reverse sequence, and in order to add a reverse constraint, a method for artificially synthesizing pictures is proposed, and it is considered that the camera does the work in a short time. The linear motion is:

Among them, P represents the interception rate of the image center at the current time t, v represents the current speed, T represents the sampling period of the camera, α is a constant, n represents the nth frame at the current time, and 2.57 is the empirical value obtained from experiments.

8 . The method according to claim 1 , wherein, in a laboratory environment, the data set used in the method is the KITTI data set. 9 .