CN112115786A - Monocular vision odometer method based on attention U-net - Google Patents

Monocular vision odometer method based on attention U-net Download PDF

Info

Publication number
CN112115786A
CN112115786A CN202010813907.5A CN202010813907A CN112115786A CN 112115786 A CN112115786 A CN 112115786A CN 202010813907 A CN202010813907 A CN 202010813907A CN 112115786 A CN112115786 A CN 112115786A
Authority
CN
China
Prior art keywords
attention
image
sequence
network
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010813907.5A
Other languages
Chinese (zh)
Other versions
CN112115786B (en
Inventor
刘瑞军
王向上
张伦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
Original Assignee
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University filed Critical Beijing Technology and Business University
Priority to CN202010813907.5A priority Critical patent/CN112115786B/en
Publication of CN112115786A publication Critical patent/CN112115786A/en
Application granted granted Critical
Publication of CN112115786B publication Critical patent/CN112115786B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C22/00Measuring distance traversed on the ground by vehicles, persons, animals or other moving solid bodies, e.g. using odometers, using pedometers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Automation & Control Theory (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a monocular vision odometer method and device based on attention U-net. The method includes the steps that a monocular image sequence is obtained, a plurality of adjacent images sequentially pass through a shot boundary recognition algorithm, a shot boundary is recognized from continuous frames, the whole module uses key frames for operation, and a Gaussian pyramid is used for reducing the dimension of an original image, so that the subsequent calculation amount is reduced. An attention-based local feature enhancement method is used, an attention mechanism is added to a U-net self-coding network, a key frame sequence is input into the network, and an attention-based method is firstly adopted to distinguish texture regions and smooth the texture regionsRegion, when locating the position of the high frequency detail, the attention mechanism acts as a feature selector that enhances the high frequency features and suppresses noise in the smooth region. Inputting the sequence after the characteristic enhancement into a final Bi-LSTM (Bi-directional Long Short-Term Memory network), wherein each time stamp is according to tnFrame by approximation tn+1The images of the frames are used as input of the reverse sequence, and the camera pose of each timestamp is acquired according to the context environment.

Description

Monocular vision odometer method based on attention U-net
Technical Field
The application relates to the field of image enhancement and visual odometry, in particular to a monocular visual odometry method and a monocular visual odometry system based on attention U-net.
Background
The mobile robot finishes autonomous navigation, firstly, the position and the posture of the mobile robot are determined, namely, positioning, and a Visual Odometer (VO) is also provided, so that the pose of the intelligent body is estimated by using adjacent frame image streams acquired by a single camera or a plurality of cameras, and the environment can be reconstructed. The VO estimates the pose of the current frame mostly by calculating the motion between frames, and the VO aims to calculate the motion track of a camera between the frames so as to reduce drift for closed-loop detection and mapping at the rear end. The visual odometer based on deep learning does not need complex geometric operation, and the end-to-end operation form makes the method based on deep learning more concise.
On these bases, researchers have attempted to explore a new intelligent picture homography calculation approach. The aim of acquiring the camera pose is achieved by collecting the image sequence in real time and enriching the understanding of the image through the learning of the neural network and acquiring the feature matching of adjacent frames. Konda et al first realized a VO based on deep learning by extracting visual motion and depth information. After estimating depth information using a stereoscopic image, a Convolutional Neural Network (CNN) predicts a change in camera speed and direction by a softmax function. Kendall et al have implemented an end-to-end positioning system with input as RGB images and output as camera poses using CNN. The system provides a 23-layer deep convolutional network PoseNet, and a database of classification problems is used for solving a complex image regression problem by utilizing transfer learning. Compared with the traditional local visual features, the features obtained by training have stronger robustness to illumination, motion blur, camera internal parameters and the like. Costante et al used dense optical flow instead of RGB images as input to CNN. The system designs three different CNN architectures for feature learning of VO, and realizes the robustness of the algorithm under the conditions of image blur, underexposure and the like. However, experimental results also show that the training data has a great influence on the algorithm, and when the motion between frames of the image sequence is large, the algorithm error is large, which is mainly due to the lack of high-speed training samples in the training data.
In order to solve the problems, a CNN extraction example is generally adopted in the existing feature extraction network, but when the environment with complex illumination, complex texture and the like is met, effective features are difficult to extract or the extracted features are not prominent enough, and a great error exists; in a common CNN processing process, high-frequency information is easy to lose in a next layer, and the loss can be relieved by using residual linking, so that high-frequency signals are further enhanced; in addition, in the aspect of data association, the existing visual odometry method generally only considers the propagation of a forward sequence in processing a picture sequence, usually ignores the effect of a reverse sequence, and does not fully mine the context association relationship.
Disclosure of Invention
It is an object of the present invention to provide an attention U-net based monocular visual odometry method that overcomes the above problems or at least partially solves or mitigates the above problems.
According to one aspect of the invention, a monocular image sequence is obtained, a plurality of adjacent images are sequentially subjected to a shot boundary identification algorithm, shot boundaries are identified from continuous frames, the whole module uses key frames for operation, and a Gaussian pyramid is used for reducing the dimension of an original image, so that the subsequent calculation amount is reduced.
In which a key frame sequence method is obtained, shot boundaries for each frame are identified by dividing each frame into non-overlapping grids of size 16 x 16. Calculating the corresponding grid histogram difference d between two adjacent frames by adopting the chi-square distance:
Figure BDA0002632001940000021
Hirepresents the ith frame histogram, Hi+1Represents the (i +1) th frame histogram. I denotes the image block at the same position in both frames. The mean difference of the histograms between two consecutive frames is calculated as follows
Figure BDA0002632001940000022
D is the average histogram difference of two consecutive frames, DkIs the chi-squared difference between the k-th image blocks. N represents the total number of image blocks in the image. Identifying shot boundaries on frames whose histogram difference is greater than a threshold Tshot:
and reducing the dimension of the obtained picture sequence by using the existing Gaussian pyramid, and reducing the dimension to 1/4 of the original image by using convolution with the step length of 2.
According to another aspect of the invention, feature reconstruction is performed to identify and enhance multi-texture regions and enhance high frequency detail. L isi-1Represents the input to the ith convolutional layer, and the output of the ith layer is represented as:
Li=σ(Wi*Li-1+bi)
the method comprises the following steps that a, a feature reconstruction network and a residual module (Resblock), wherein the operation of convolution is denoted by sigma, nonlinear activation (ReLU), the feature reconstruction network is composed of a convolution layer for feature extraction, a plurality of stacked dense blocks and a sub-pixel convolution layer serving as an upsampling module, and the dense blocks are composed of the residual module (Resblock) and show strong object recognition learning capacity. Let HiFor the input of the i-th residual block, output FiCan be expressed as:
Fi=φi(Hi,Wi)+Hi
the residual block contains two convolutional layers. Specifically, the residual block function can be expressed as follows:
φi(Hi;Wi)=σ2(Wi 21(Wi 1*Hi))
wherein Wi 1And Wi 2Respectively the weight of two convolution layers, σ1、σ2Denotes the normalization, i-th residual module HiIs a concatenation of the outputs of the previous residual modules. A convolutional layer with a 1 x 1 kernel is used to control how much of the previous state should be retained. It adaptively learns the weights of the different states. The input to the ith residual block is represented as:
Hi=σ0(Wi 0*[F1,F2,...,Fi-1])
Wi 0represents a convolution weight of 1 × 1, σ0Indicating the ReLU normalization.
Attention is generated and the exact location of the texture is determined. And (3) adopting a similar U-net structure, replacing a convolution layer with a dense block, and in a compression path, firstly, utilizing the convolution layer to extract low-layer features of the interpolated image. Then the dimensionality of the data is reduced using a maximum pooling of 2 x 2 to get a larger acceptance domain, with twice pooling used in the compression path. And adding an deconvolution layer in the extension path to perform upsampling on the existing feature map. By combining the low-layer characteristics and the high-layer characteristics in the extended path, the output can accurately divide whether the region is a textured region or not and whether a characteristic reconstruction network is needed for repairing or not. The signature channel of the net output is 1, and at the last layer, the output mask from 0 to 1 is controlled using an active sigmoid. The mask value is closer to 1 if the probability that the pixel belongs to the texture region is higher, which means that these pixels require more attention, and if not, the mask value will be closer to 0.
The residual learning based on attention obtains the image residual after enhancement through the output of the feature enhancement network and the generation of the mask, and simultaneously obtains the final attention generation result through interpolation by taking the features before enhancement as the input of the attention generation network. The residual of the ILR image is obtained by the output of the feature reconstruction network and the point generation of the mask value, and the final attention generation result is obtained by adding the interpolated HR image as the input of the attention generation network. It can be expressed as:
HRc(i,j)=Fc(i,j)×M(i,j)+ILRc(i,j)
wherein F ═ F1;F2;F3]Is the output of the feature reconstruction network, and the number of output channels is 3. M is a mask value. ILR ═ ILR1,ILR2,ILR3]Is the interpolated image, HR ═ HR [, HR [ ]1,HR2,HR3]Is the ultimate strength of the processAnd (5) converting the result. i and j denote pixel positions in each channel, and c denotes a channel index.
According to yet another aspect of the invention, the pose of each timestamp camera is acquired from the sequence of enhanced images. Using a CNN network containing 4 convolutions, one layer of 7 × 7, two layers of 5 × 5, and finally using a 3 × 3 convolution kernel, the output of the CNN is subjected to Bi-LSTM sequence modeling, and given the characteristics at time t, the Bi-LSTM at time t is updated as:
st=f(Uxt+Wst-1)
s′t=f(U'xt+W'st+1)
At=f(WAt-1+Uxt)
A′t=f(W'A′t+1+U'xt)
yt=g(VA2+V'A'2)
wherein s istAnd s'tA memory variable representing the forward and reverse directions at time t, AtAnd A'tHidden layer variable, y, representing the forward and reverse directions at time ttRepresenting the output variable. Further, f and g denote nonlinear activation functions, and U, U ', W, W ', and V, V ' denote weight matrices corresponding to the respective variables. Then adding two full-connected layers behind the Bi-LSTM, integrating the trained image sequence information into a final accurate 6-degree-of-freedom pose, providing a method for artificially synthesizing pictures by three rotations and three translations, and considering that the camera does linear motion in a short time:
Figure BDA0002632001940000041
wherein P represents the interception rate of the image center at the current time T, v represents the current speed, T represents the sampling period of the camera, α is a constant, n represents the nth frame of the current time, and 2.57 is an empirical value obtained according to an experiment.
Drawings
Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:
FIG. 1 is a schematic flow diagram of the process of the present application;
FIG. 2 is a block schematic diagram of a monocular visual odometry approach based on attention U-net according to one embodiment of the present application;
FIG. 3 schematically illustrates a feature-enhanced model structure of a preferred embodiment of the present invention;
FIG. 4 shows a Bi-LSTM model structure based on a context environment;
FIG. 5 illustrates a computing device provided by an embodiment of the present application;
FIG. 6 is a computer-readable storage medium provided by embodiments of the present application;
fig. 7 is a result of an odometry operation generated based on the sequence of the KITTI dataset 01 and the sequence of 07 according to the embodiment of the present application.
Detailed Description
The invention is described below with reference to the accompanying drawings and the detailed description.
FIG. 1 is a flow chart of a monocular visual odometry method based on attention U-net according to an embodiment of the present application. Referring to fig. 1, a monocular visual odometry method and system based on attention U-net according to an embodiment of the present application may include:
step S1: obtaining a monocular image sequence, screening out a key frame sequence by a shot boundary identification method, and reducing the dimension of the image by a Gaussian pyramid.
Step S2: and reconstructing the key frame after dimensionality reduction by adopting a full convolution network model formed by a convolution neural network, a plurality of stacked residual blocks and sub-pixel deconvolution.
Step S3: and replacing the convolution block with a residual block by using a U-net-like structure, solving the problem of gradient disappearance caused by depth, inputting the reduced-dimension key frame, and generating a texture mask of the image.
Step S4: and finishing texture alignment of the whole learning scene instance by using a corresponding entity feature consistency theory in the scene, and fusing data by using the obtained alignment relation, thereby realizing the reality of visual input to strengthen the local scene features.
Step S5: based on the enhanced image, combining with the provided artificial synthesis picture method, generating an image at the t + n moment (t is less than or equal to 3) according to the image at the t moment, and providing reverse data for pose estimation.
Step S6: and (3) fusing and reducing dimensions of the enhanced image sequence and the artificially synthesized data by using a CNN network containing 4 convolutions, performing Bi-LSTM sequence modeling on the output of the CNN, and finally inferring the pose change of each timestamp.
The embodiment of the application provides a monocular vision odometry method based on attention U-net, in the method provided by the embodiment of the application, a real-time image sequence is obtained, a lens boundary identification method is used for screening out key frames, the dimension of the key frame sequence is reduced, then a high-frequency detail is recovered through a feature reconstruction network, a visual image is reconstructed, in addition, the key frame sequence is used for distinguishing a texture region and a smooth region through a similar U-net network based on attention, and when the position of the high-frequency detail is located, an attention mechanism is used as a feature selector, the high-frequency feature is enhanced, noise in the smooth region is inhibited, and a texture mask is generated. The output of the feature enhancement network and the generation of the mask result in the enhanced image residual, and the features before enhancement are interpolated as the input of the attention generation network to obtain the final attention generation result.
The experimental data set adopted by the method is a KITTI data set (created by the union of the German Carlsuhe institute of technology and the Toyota American technical research institute), and the data set is an evaluation data set of the computer vision algorithm under the current international largest automatic driving scene. The KITTI data acquisition platform comprises 2 gray cameras, 2 color cameras, a Velodyne 3D laser radar, 4 optical lenses and 1 GPS navigation system. The entire dataset consists of 389 pairs of stereo images and optical flow maps (each image contains up to 15 vehicles and 30 pedestrians, and there are varying degrees of occlusion), 39.2 km visual ranging sequence, and images of over 200,0003D annotation objects.
S1, identifying shot boundaries for each frame by dividing each frame into non-overlapping grids of size 16 x 16. Calculating the corresponding grid histogram difference d between two adjacent frames by adopting the chi-square distance:
Figure BDA0002632001940000051
Hirepresents the ith frame histogram, Hi+1Represents the (i +1) th frame histogram. I denotes the image block at the same position in both frames. The mean difference in histogram between two consecutive frames is calculated as follows:
Figure BDA0002632001940000052
d is the average histogram difference of two consecutive frames, DkIs the chi-squared difference between the k-th image blocks. N represents the total number of image blocks in the image. Identifying shot boundaries on frames whose histogram difference is greater than a threshold Tshot:
Figure BDA0002632001940000061
and reducing the dimension of the obtained picture sequence by using the existing Gaussian pyramid, and reducing the dimension to 1/4 of the original image by using convolution with the step length of 2.
And S2, reconstructing the characteristics, identifying and strengthening the multi-texture region, and strengthening the high-frequency details. L isi-1Represents the input to the ith convolutional layer, and the output of the ith layer is represented as:
Li=σ(Wi*Li-1+bi)
the method comprises the following steps that a, a feature reconstruction network and a residual module (Resblock), wherein the operation of convolution is denoted by sigma, nonlinear activation (ReLU), the feature reconstruction network is composed of a convolution layer for feature extraction, a plurality of stacked dense blocks and a sub-pixel convolution layer serving as an upsampling module, and the dense blocks are composed of the residual module (Resblock) and show strong object recognition learning capacity. Let HiFor i-th residual blockInput, output FiCan be expressed as:
Fi=φi(Hi,Wi)+Hi
the residual block contains two convolutional layers. Specifically, the residual block function can be expressed as follows
φi(Hi;Wi)=σ2(Wi 21(Wi 1*Hi))
Wherein Wi 1And Wi 2Respectively the weight of two convolution layers, σ1、σ2Denotes the normalization, i-th residual module HiIs a concatenation of the outputs of the previous residual modules. A convolutional layer with a 1 x 1 kernel is used to control how much of the previous state should be retained. It adaptively learns the weights of the different states. The input to the ith residual block is represented as:
Hi=σ0(Wi 0*[F1,F2,...,Fi-1])
Wi 0represents a convolution weight of 1 × 1, σ0Indicating the ReLU normalization.
S3, the network consists of a contraction path, an expansion path and a jump connection, which takes as input a bilinear interpolated original image (in the required size). The redundancy added by interpolation can reduce the information loss of forward propagation, and is beneficial to the accurate segmentation of texture areas and smooth areas. A similar U-net structure is used, the convolution block is replaced by a residual block, the problem of gradient disappearance caused by depth is solved, and the quantity of parameters can be greatly reduced by stacking the residual block due to the reusability of a residual network. The real-time scene feature texture recovery implementation process based on the attention mechanism is as follows:
1) in the compression path, first, low-order features are extracted using convolutional layers. Maximum pooling is then used to reduce the dimensionality of the data, resulting in a larger acceptance domain. Using pooling twice in the compression path, the network can use a larger area to predict whether a pixel belongs to a high frequency region.
2) And adding an deconvolution layer in the extension path to perform upsampling on the existing feature map. The low-level features contain much useful information, much of which is lost in the forward propagation process. By combining the low-layer characteristics and the high-layer characteristics in the extended path, the output can accurately divide whether the region is a textured region or not and whether a characteristic reconstruction network is needed for repairing or not. The last layer of the network controls the output mask from 0 to 1 using the sigmoid activation function. The higher the probability that a pixel belongs to a texture region, the closer the mask value is to 1, which means that these pixels need more attention; if not, the mask value will be closer to 0.
S4, the residual of the HR image is obtained by the output of the feature reconstruction network and the point generation of the mask value, and the final attention generation result is obtained by adding the interpolated LR image as the input of the attention generation network. It can be expressed as:
HRc(i,j)=Fc(i,j)×M(i,j)+ILRc(i,j)
wherein F ═ F1;F2;F3]Is the output of the feature reconstruction network, and the number of output channels is 3. M is a mask value. ILR ═ ILR1,ILR2,ILR3]Is the interpolated image. HR ═ HR [ < HR > ]1,HR2,HR3]Is the final enhancement result of the process. i and j denote pixel positions in each channel, and c denotes a channel index. The attention-generating network increases the residual value from the texture region, while the residual value from the non-texture region approaches 0. The mask M is a feature selector that enhances high frequency characteristics and suppresses noise so that in the output image, high frequency details will be restored and where smooth, the noise will be removed.
S5, a method for artificially synthesizing pictures is provided, wherein the camera is considered to do linear motion in a short time:
Figure BDA0002632001940000071
wherein P represents the interception rate of the image center at the current time T, v represents the current speed, T represents the sampling period of the camera, α is a constant, n represents the nth frame of the current time, and 2.57 is an empirical value obtained according to an experiment.
And S6, acquiring the pose of each timestamp camera according to the strengthened image sequence. Using a CNN network containing 4 convolutions, one layer of 7 × 7, two layers of 5 × 5, and finally using a 3 × 3 convolution kernel, the output of the CNN is subjected to Bi-LSTM sequence modeling, and given the characteristics at time t, the Bi-LSTM at time t is updated as:
st=f(Uxt+Wst-1)
s′t=f(U'xt+W'st+1)
At=f(WAt-1+Uxt)
A′t=f(W'A′t+1+U'xt)
yt=g(VA2+V'A'2)
wherein s istAnd s'tA memory variable representing the forward and reverse directions at time t, AtAnd A'tHidden layer variable, y, representing the forward and reverse directions at time ttRepresenting the output variable. Further, f and g denote nonlinear activation functions, and U, U ', W, W ', and V, V ' denote weight matrices corresponding to the respective variables. And then adding two full-connection layers behind the Bi-LSTM, and integrating the trained image sequence information into a final accurate 6-degree-of-freedom pose, which comprises three rotations and three translations.
FIG. 2 is a block schematic diagram of a monocular visual odometry approach based on attention U-net according to one embodiment of the present application. The apparatus may generally include a preprocessing module, a feature reconstruction module, an attention enhancement module, a residual learning module, an artificial data synthesis module, and a context-based pose inference module.
FIG. 3 shows a feature-enhanced model structure consisting of a feature reconstruction module, an attention-based local feature enhancement module, and a residual learning module according to a preferred embodiment of the present invention
FIG. 4 shows a Bi-LSTM model structure based on a context environment, with the output of the previous CNN model as the Bi-LSTM input, obtaining a camera pose estimate for each timestamp.
The invention aims to protect a local feature strengthening method and a pose presumption method based on a context environment, a visual odometry method based on deep learning is mainly used for estimating the pose of a camera according to local characteristic alignment, however, the existing method is not accurate enough for texture alignment in a multi-texture or illumination complex environment. In addition, camera pose estimation is a serialization problem, only an image forward sequence is considered in the existing method, context constraints are fully mined by artificially synthesizing data and adding reverse constraints, and more accurate camera poses are obtained on each timestamp.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
An embodiment of the present application also provides a computing device, which, with reference to fig. 5, comprises a memory 520, a processor 510 and a computer program stored in said memory 520 and executable by said processor 510, the computer program being stored in a space 530 for program code in the memory 520, the computer program, when executed by the processor 510, implementing a method 531 for performing any of the methods according to the present invention.
The embodiment of the application also provides a computer readable storage medium. Referring to fig. 6, the computer readable storage medium comprises a storage unit for program code provided with a program 531' for performing the steps of the method according to the invention, which program is executed by a processor.
The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.
The final example effect is shown in fig. 7, and based on the operation results of the 01 sequence and the 07 sequence of the KITTI data set, the comparison situation of the camera pose motion trajectory estimated by the model and the real trajectory is displayed.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (8)

1. A monocular visual odometry method based on attention U-net, comprising:
the monocular image sequence is obtained, a plurality of adjacent images sequentially pass through a shot boundary recognition algorithm, a shot boundary is recognized from continuous frames, the whole module uses key frames for operation, and a Gaussian pyramid is used for reducing the dimension of an original image, so that the subsequent calculation amount is reduced.
Using an attention-based local feature enhancement method, adding an attention mechanism to a U-net self-encoding network, inputting a sequence of key frames into the network, first distinguishing between textured and smooth regions using an attention-based method, when locating the location of high frequency details, the attention mechanism acts as a feature selector that enhances the high frequency features and suppresses noise in the smooth regions.
After strengthening the characteristicThe sequence is input into the final Bi-LSTM, at each time stamp, according to tnFrame by approximation tn+1The images of the frames are used as input for the reverse sequence, and the camera pose of each timestamp is acquired.
2. The method of claim 1, wherein the method of computing a sequence of key frames identifies shot boundaries for each frame by dividing each frame into non-overlapping grids of size 16 x 16. Calculating the corresponding grid histogram difference d between two adjacent frames by adopting the chi-square distance:
Figure FDA0002632001930000011
Hirepresents the ith frame histogram, Hi+1Represents the (i +1) th frame histogram. I denotes the image block at the same position in both frames. The mean difference of the histograms between two consecutive frames is calculated as follows
Figure FDA0002632001930000012
D is the average histogram difference of two consecutive frames, DkIs the chi-squared difference between the k-th image blocks. N represents the total number of image blocks in the image. Identifying shot boundaries on frames whose histogram difference is greater than a threshold Tshot:
Figure FDA0002632001930000013
and reducing the dimension of the obtained picture sequence by using the existing Gaussian pyramid, and reducing the dimension to 1/4 of the original image by using convolution with the step length of 2.
3. The method of claim 1, wherein the reconstructing of features is performed to identify and emphasize multi-texture regions and emphasize high frequency details. L isi-1Represents the input to the ith convolutional layer, and the output of the ith layer is represented as:
Li=σ(Wi*Li-1+bi)
the method comprises the following steps that a, a feature reconstruction network and a residual module (Resblock), wherein the operation of convolution is denoted by sigma, nonlinear activation (ReLU), the feature reconstruction network is composed of a convolution layer for feature extraction, a plurality of stacked dense blocks and a sub-pixel convolution layer serving as an upsampling module, and the dense blocks are composed of the residual module (Resblock) and show strong object recognition learning capacity. Let HiFor the input of the i-th residual block, output FiCan be expressed as:
Fi=φi(Hi,Wi)+Hi
the residual block contains two convolutional layers. Specifically, the residual block function can be expressed as follows
φi(Hi;Wi)=σ2(Wi 21(Wi 1*Hi))
Wherein Wi 1And Wi 2Respectively the weight of two convolution layers, σ1、σ2Representing the normalized ith residual block HiIs a concatenation of the outputs of the previous residual modules. A convolutional layer with a 1 x 1 kernel is used to control how much of the previous state should be retained. It adaptively learns the weights of the different states. The input to the ith residual block is represented as:
Hi=σ0(Wi 0*[F1,F2,...,Fi-1])
Wi 0represents a convolution weight of 1 × 1, σ0Indicating the ReLU normalization.
4. The method of claim 1, wherein attention is generated and the exact location of the texture is determined. And (3) adopting a similar U-net structure, replacing a convolution layer with a dense block, and in a compression path, firstly, utilizing the convolution layer to extract low-layer features of the interpolated image. Then the dimensionality of the data is reduced using a maximum pooling of 2 x 2 to get a larger acceptance domain, with twice pooling used in the compression path. And adding an anti-pleat layer in the expansion path to up-sample the existing feature map. By combining the low-layer characteristics and the high-layer characteristics in the extended path, the output can accurately divide whether the region is a textured region or not and whether a characteristic reconstruction network is needed for repairing or not. The signature channel of the net output is 1, and at the last layer, the output mask from 0 to 1 is controlled using an active sigmoid. The mask value is closer to 1 if the probability that the pixel belongs to the texture region is higher, which means that these pixels require more attention, and if not, the mask value will be closer to 0.
5. The method of claim 1, wherein the residual learning based on attention is performed by generating a mask and an output of a feature enhancement network to obtain an image residual after enhancement, and interpolating features before enhancement as an input of an attention generation network to obtain a final result of attention generation. The residual of the HR image is obtained by the output of the feature reconstruction network and the point generation of the mask value, and the final attention generation result is obtained by adding the interpolated LR image as the input of the attention generation network. It can be expressed as:
HRc(i,j)=Fc(i,j)×M(i,j)+ILRc(i,j)
wherein F ═ F1;F2;F3]Is the output of the feature reconstruction network, and the number of output channels is 3. M is a mask value. ILR ═ ILR1,ILR2,ILR3]Is the interpolated image. HR ═ HR [ < HR > ]1,HR2,HR3]Is the final enhancement result of the process. i and j denote pixel positions in each channel, and c denotes a channel index.
6. The method of claim 1, wherein the pose of each timestamp camera is acquired from the sequence of enhanced images. Using a CNN network containing 4 convolutions, one layer of 7 × 7, two layers of 5 × 5, and finally using a 3 × 3 convolution kernel, the input of the CNN is subjected to Bi-LSTM sequence modeling, and given the characteristics at time t, the Bi-LSTM at time t is updated as:
st=f(Uxt+Wst-1)
s′t=f(U'xt+W'st+1)
At=f(WAt-1+Uxt)
A′t=f(W'A′t+1+U'xt)
yt=g(VA2+V'A'2)
wherein s istAnd s'tA memory variable representing the forward and reverse directions at time t, AtAnd A'tHidden layer variable, y, representing the forward and reverse directions at time ttRepresenting the output variable. Further, f and g denote nonlinear activation functions, and U, U ', W, W ', and V, V ' denote weight matrices corresponding to the respective variables. And then adding two full-connection layers behind the Bi-LSTM, and integrating the trained image sequence information into a final accurate 6-degree-of-freedom pose, which comprises three rotations and three translations.
7. The method according to claim 1 and claim 6, characterized in that in practical applications, it is not practical to acquire a reverse sequence, and in order to add a reverse constraint, a method is proposed to artificially synthesize pictures, considering that the camera is doing a linear motion in a short time:
Figure FDA0002632001930000031
wherein P represents the interception rate of the image center at the current time T, v represents the current speed, T represents the sampling period of the camera, α is a constant, n represents the nth frame of the current time, and 2.57 is an empirical value obtained according to an experiment.
8. The method of claims 1-7, wherein the data set used in the method is a KITTI data set in a laboratory environment.
CN202010813907.5A 2020-08-13 2020-08-13 Monocular vision odometer method based on attention U-net Active CN112115786B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010813907.5A CN112115786B (en) 2020-08-13 2020-08-13 Monocular vision odometer method based on attention U-net

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010813907.5A CN112115786B (en) 2020-08-13 2020-08-13 Monocular vision odometer method based on attention U-net

Publications (2)

Publication Number Publication Date
CN112115786A true CN112115786A (en) 2020-12-22
CN112115786B CN112115786B (en) 2024-08-13

Family

ID=73803967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010813907.5A Active CN112115786B (en) 2020-08-13 2020-08-13 Monocular vision odometer method based on attention U-net

Country Status (1)

Country Link
CN (1) CN112115786B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112268564A (en) * 2020-12-25 2021-01-26 中国人民解放军国防科技大学 Unmanned aerial vehicle landing space position and attitude end-to-end estimation method
CN113989318A (en) * 2021-10-20 2022-01-28 电子科技大学 Monocular vision odometer pose optimization and error correction method based on deep learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473254A (en) * 2019-08-20 2019-11-19 北京邮电大学 A kind of position and orientation estimation method and device based on deep neural network
CN110889361A (en) * 2019-11-20 2020-03-17 北京影谱科技股份有限公司 ORB feature visual odometer learning method and device based on image sequence
US20200139973A1 (en) * 2018-11-01 2020-05-07 GM Global Technology Operations LLC Spatial and temporal attention-based deep reinforcement learning of hierarchical lane-change policies for controlling an autonomous vehicle
CN111405360A (en) * 2020-03-25 2020-07-10 腾讯科技(深圳)有限公司 Video processing method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200139973A1 (en) * 2018-11-01 2020-05-07 GM Global Technology Operations LLC Spatial and temporal attention-based deep reinforcement learning of hierarchical lane-change policies for controlling an autonomous vehicle
CN110473254A (en) * 2019-08-20 2019-11-19 北京邮电大学 A kind of position and orientation estimation method and device based on deep neural network
CN110889361A (en) * 2019-11-20 2020-03-17 北京影谱科技股份有限公司 ORB feature visual odometer learning method and device based on image sequence
CN111405360A (en) * 2020-03-25 2020-07-10 腾讯科技(深圳)有限公司 Video processing method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张聪聪;何宁;: "基于关键帧的双流卷积网络的人体动作识别方法", 南京信息工程大学学报(自然科学版), no. 06, 28 November 2019 (2019-11-28) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112268564A (en) * 2020-12-25 2021-01-26 中国人民解放军国防科技大学 Unmanned aerial vehicle landing space position and attitude end-to-end estimation method
CN113989318A (en) * 2021-10-20 2022-01-28 电子科技大学 Monocular vision odometer pose optimization and error correction method based on deep learning

Also Published As

Publication number Publication date
CN112115786B (en) 2024-08-13

Similar Documents

Publication Publication Date Title
CN109377530B (en) Binocular depth estimation method based on depth neural network
Zou et al. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency
WO2022036777A1 (en) Method and device for intelligent estimation of human body movement posture based on convolutional neural network
CN112446270B (en) Training method of pedestrian re-recognition network, pedestrian re-recognition method and device
CN111914997B (en) Method for training neural network, image processing method and device
CN111797881B (en) Image classification method and device
WO2021249114A1 (en) Target tracking method and target tracking device
CN113610087B (en) Priori super-resolution-based image small target detection method and storage medium
CN112529904A (en) Image semantic segmentation method and device, computer readable storage medium and chip
CN114764856A (en) Image semantic segmentation method and image semantic segmentation device
CN114677479A (en) Natural landscape multi-view three-dimensional reconstruction method based on deep learning
CN112115786B (en) Monocular vision odometer method based on attention U-net
CN116563682A (en) Attention scheme and strip convolution semantic line detection method based on depth Hough network
CN113096239A (en) Three-dimensional point cloud reconstruction method based on deep learning
CN116097307A (en) Image processing method and related equipment
CN110889868B (en) Monocular image depth estimation method combining gradient and texture features
CN116402851A (en) Infrared dim target tracking method under complex background
CN117934308A (en) Lightweight self-supervision monocular depth estimation method based on graph convolution network
CN112541972A (en) Viewpoint image processing method and related equipment
CN114494699A (en) Image semantic segmentation method and system based on semantic propagation and foreground and background perception
CN112509014B (en) Robust interpolation light stream computing method matched with pyramid shielding detection block
CN117710429A (en) Improved lightweight monocular depth estimation method integrating CNN and transducer
CN112686828A (en) Video denoising method, device, equipment and storage medium
CN112561925A (en) Image segmentation method, system, computer device and storage medium
CN117132651A (en) Three-dimensional human body posture estimation method integrating color image and depth image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant