WO2024082602A1 - 一种端到端视觉里程计方法及装置 - Google Patents

一种端到端视觉里程计方法及装置 Download PDF

Info

Publication number
WO2024082602A1
WO2024082602A1 PCT/CN2023/091529 CN2023091529W WO2024082602A1 WO 2024082602 A1 WO2024082602 A1 WO 2024082602A1 CN 2023091529 W CN2023091529 W CN 2023091529W WO 2024082602 A1 WO2024082602 A1 WO 2024082602A1
Authority
WO
WIPO (PCT)
Prior art keywords
current frame
image information
pooling layer
layer
data processed
Prior art date
Application number
PCT/CN2023/091529
Other languages
English (en)
French (fr)
Inventor
王祎男
梁贵友
关瀛洲
曹礼军
翟诺
王迪
曹容川
张天奇
Original Assignee
中国第一汽车股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国第一汽车股份有限公司 filed Critical 中国第一汽车股份有限公司
Publication of WO2024082602A1 publication Critical patent/WO2024082602A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20048Transform domain processing
    • G06T2207/20056Discrete and fast Fourier transform, [DFT, FFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30252Vehicle exterior; Vicinity of vehicle

Definitions

  • the present application relates to the field of autonomous driving technology, and in particular to an end-to-end visual odometer method and an end-to-end visual odometer device.
  • Simultaneous Localization and Mapping is one of the important research directions in the field of computer vision.
  • SLAM is one of the key core technologies.
  • VO Visual Odemetry
  • VO takes the image sequence collected at adjacent moments as input, generates a preliminary optimized local map while estimating the vehicle motion, and provides it to the back end for further optimization.
  • Traditional VO methods mainly include feature point method and direct method.
  • the feature point method needs to extract the feature points of the image sequence, build a geometric model through feature matching, and estimate the motion of the vehicle.
  • the direct method usually estimates the vehicle motion between adjacent image sequences based on the assumption of photometric invariance.
  • the accuracy of VO pose estimation affects the overall trajectory accuracy of the SLAM system.
  • traditional feature extraction algorithms are easily affected by noise, lighting conditions and viewing angles, and their robustness is poor.
  • the type of feature points extracted by this type of algorithm is relatively single, which will affect the accuracy of subsequent feature matching and further affect the accuracy of the output pose estimation.
  • Deep VO is a widely used end-to-end VO algorithm. This algorithm is a supervised learning method that can directly estimate the corresponding posture of the vehicle from the input image sequence.
  • the object of the present invention is to provide an end-to-end visual odometer method to solve at least one of the above-mentioned technical problems.
  • an end-to-end visual odometer method for obtaining pose estimation information of a camera device on a vehicle, the end-to-end visual odometer method comprising:
  • the image information of the current frame and the brightness image information of the current frame are fused to obtain the fused image information of the current frame;
  • the image information of the previous frame of the current frame and the brightness image information of the previous frame of the current frame are fused to obtain the fused image information of the previous frame of the current frame;
  • the feature extraction of the fused image information of the current frame and the fused image information of the previous frame of the current frame is performed by a jump-fusion-FCNN method to obtain the fused image feature;
  • the position and posture estimation information of the camera device is obtained according to the fused image features.
  • performing grayscale transformation processing on the current frame image information to obtain brightness image information of the current frame includes:
  • the current frame source image sequence is transformed into a grayscale space, and each pixel of the current frame image information is grouped, so that each pixel is divided into three groups, wherein the three groups include a current frame dark pixel group, a current frame medium pixel group, and a current frame bright pixel group;
  • the grayscale conversion process is performed on the image information of the previous frame of the current frame to obtain the brightness image information of the previous frame of the current frame, including:
  • Transforming an image sequence of a frame before the current frame into a grayscale space and performing set division on each pixel of the image information of the frame before the current frame, so as to divide each pixel into three sets, wherein the three sets include a dark pixel set of the frame before the current frame, a medium pixel set of the frame before the current frame, and a bright pixel set of the frame before the current frame;
  • Grayscale transformation is performed on image information of a frame previous to the current frame according to the exposure level, and grayscale values of underexposed pixels are expanded, thereby obtaining brightness image information of a frame previous to the current frame.
  • fusing the current frame image information and the brightness image information of the current frame to obtain the current frame fused image information includes:
  • the current frame image information and the brightness image information of the current frame are fused using the following formula:
  • Fusion(I, I′) ⁇ p *I+(1- ⁇ p )*I′; where ⁇ p represents the weight of the pixel p in the current frame image information.
  • I is the source image sequence of the current frame;
  • I′ is the brightness image information of the current frame;
  • Fusion(I, I′) represents the fused image information of the current frame;
  • G(x) represents a Gaussian filter
  • F and F -1 represent Fourier transform and its inverse transform respectively
  • H n ⁇ n represents an n ⁇ n matrix, and each element in the matrix is 1/n 2 ; and They represent the real part and imaginary part of the complex matrix respectively
  • I′ i (p) represents the pixel value after pixel p is enlarged
  • I(p) represents the grayscale value of pixel p
  • SM(I) is the saliency map.
  • the extracting features of the fused image information of the current frame and the fused image information of the previous frame of the current frame by using the jump-fusion-FCNN method to obtain fused image features includes:
  • FCNN neural network model includes five pooling layers and seven convolutional layers, wherein the five pooling layers are respectively called a first pooling layer, a second pooling layer, a third pooling layer, a fourth pooling layer, and a fifth pooling layer; and the seven convolutional layers are respectively called a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer, a fifth convolutional layer, a sixth convolutional layer, and a seventh convolutional layer;
  • the FCNN neural network model Inputting the final input image information into the FCNN neural network model, so that the final input image information is processed by the first convolution layer, the first pooling layer, the second convolution layer, the second pooling layer, the third convolution layer, the third pooling layer, the fourth convolution layer, the fourth pooling layer, the fifth convolution layer, the fifth pooling layer, the sixth convolution layer and the seventh convolution layer in sequence;
  • the data processed by the first pooling layer the data processed by the second pooling layer, the data processed by the third pooling layer, the data processed by the fourth pooling layer, and the data processed by the seventh convolutional layer generating a third path feature
  • the first path feature, the second path feature and the third path feature are fused to obtain the fused image feature.
  • the first pooling layer, the second pooling layer, the third pooling layer, the fourth pooling layer and the fifth pooling layer have different parameters respectively;
  • the generating the first path feature according to the data processed by the third pooling layer, the data processed by the fourth pooling layer, and the data processed by the seventh convolutional layer includes:
  • the data that has been downsampled 4 times and the data that has been downsampled 2 times are summed with the data processed by the seventh convolutional layer, and the data are added one by one, and the prediction results of the three different depths are merged to obtain the first path feature.
  • generating the second path feature according to the data processed by the second pooling layer, the data processed by the third pooling layer, the data processed by the fourth pooling layer, and the data processed by the seventh convolutional layer includes:
  • the data downsampled 8 times, the data downsampled 4 times, and the data downsampled 2 times are summed with the data processed by the seventh convolutional layer, and the prediction results of the four different depths are merged to obtain the second path feature.
  • generating a third path feature according to the data processed by the first pooling layer, the data processed by the second pooling layer, the data processed by the third pooling layer, the data processed by the fourth pooling layer, and the data processed by the seventh convolutional layer includes:
  • the data downsampled 16 times, the data downsampled 8 times, the data downsampled 4 times, and the data downsampled 2 times are summed with the data processed by the seventh convolutional layer, and the prediction results of the five different depths are merged to obtain the third path feature.
  • the parameters of the pooling layer include image size parameters and the number of channels;
  • the parameters of the convolution layer include image size parameters and the number of channels;
  • the image size parameter of the first pooling layer is (M/2) ⁇ (N/2); the number of channels of the first pooling layer is 64;
  • the image size parameter of the second pooling layer is (M/4) ⁇ (N/4); the number of channels of the second pooling layer is 128;
  • the image size parameter of the third pooling layer is (M/8) ⁇ (N/8); the number of channels of the third pooling layer is 256;
  • the image size parameter of the fourth pooling layer is (M/16) ⁇ (N/16); the number of channels of the fourth pooling layer is 256;
  • the image size parameter of the fifth pooling layer is (M/32) ⁇ (N/32); the number of channels of the fifth pooling layer is 512;
  • the image size parameter of the sixth convolutional layer is 4096 ⁇ (M/32) ⁇ (N/32); the number of channels of the sixth convolutional layer is 512;
  • the image size parameter of the seventh convolutional layer is 4096 ⁇ (M/32) ⁇ (N/32); the number of channels of the seventh convolutional layer is 512.
  • acquiring pose estimation information according to the fused image features includes:
  • the fused image features are input into a long short-term memory neural network to obtain the pose estimation information of the camera device.
  • the present application also provides an end-to-end visual odometer device, the end-to-end visual odometer device comprising:
  • An image acquisition module the image acquisition module is used to acquire image information of a current frame and image information of a frame before the current frame provided by a camera device;
  • a grayscale transformation processing module wherein the grayscale transformation processing module is used to perform grayscale transformation processing on the image information of the current frame and the image information of the previous frame of the current frame, so as to obtain the brightness image information of the current frame and the brightness image information of the previous frame of the current frame;
  • a fusion module is used to fuse the current frame image information and the brightness image information of the current frame, so as to obtain the current frame fused image information, and to fuse the image information of the previous frame of the current frame and the brightness image information of the previous frame of the current frame, so as to obtain the fused image information of the previous frame of the current frame;
  • a feature extraction module wherein the feature extraction module is used to extract features of the fused image information of the current frame and the fused image information of the previous frame of the current frame by using a jump-fusion-FCNN method to obtain fused image features;
  • a posture estimation module is used to obtain posture estimation information of the camera device according to the fused image features.
  • the end-to-end visual odometer method of the present application obtains its brightness image by grayscale transformation of the source image sequence, and designs an image fusion algorithm based on spectral residual theory to merge the image sequence and its brightness image, enhance the contrast of the image, and provide more detail information.
  • the present application designs a feature extraction algorithm based on jump-fusion-FCNN, improves the traditional fully convolutional neural network (fully convolutional neural network, FCNN), proposes a jump-fusion-FCNN network model, and constructs 3 different paths for feature extraction. In each path, the prediction results of different depths are fused by downsampling to obtain a feature map. The 3 different feature maps are merged to obtain the fused image features, while considering the structural information and detail information of the image.
  • FIG1 is a flow chart of an end-to-end visual odometer method according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an electronic device capable of implementing the end-to-end visual odometer method according to an embodiment of the present application.
  • FIG3 is a schematic diagram of the architecture of an end-to-end visual odometer method according to an embodiment of the present application.
  • FIG1 is a flow chart of an end-to-end visual odometer method according to an embodiment of the present application.
  • the end-to-end visual odometer method of the present application is used to obtain the pose estimation information of a camera device on a vehicle.
  • the end-to-end visual odometer method shown in Figures 1 and 3 includes:
  • Step 1 Acquire the current frame image information and the image information of the previous frame provided by the camera device
  • Step 2 performing grayscale transformation processing on the image information of the current frame and the image information of the previous frame of the current frame, so as to obtain the brightness image information of the current frame and the brightness image information of the previous frame of the current frame;
  • Step 3 Fusing the current frame image information and the brightness image information of the current frame to obtain the current frame fused image information
  • Step 4 Fuse the image information of the previous frame of the current frame and the brightness image information of the previous frame of the current frame to obtain the fused image information of the previous frame of the current frame; extract features of the fused image information of the current frame and the fused image information of the previous frame of the current frame by using the jump-fusion-FCNN method to obtain the fused image features;
  • Step 5 Obtain the estimated pose information of the camera device based on the fused image features.
  • the end-to-end visual odometry method of the present application obtains its brightness image by grayscale transformation of the source image sequence, and designs an image fusion algorithm based on spectral residual theory to merge the image sequence and its brightness image, thereby enhancing the contrast of the image and providing more detail information.
  • the present application designs a feature extraction algorithm based on jump-fusion-FCNN, improves the traditional fully convolutional neural network (FCNN), proposes a jump-fusion-FCNN network model, and constructs three different paths for feature extraction. In each path, the prediction results of different depths are fused by downsampling to obtain a feature map. Merge three different feature maps to obtain fused image features, while taking into account the structural information of the image. and detailed information.
  • grayscale conversion processing is performed on the current frame image information to obtain the brightness image information of the current frame, including:
  • the current frame source image sequence is transformed into a grayscale space, and each pixel of the current frame image information is grouped, so that each pixel is divided into three groups, wherein the three groups include a current frame dark pixel group, a current frame medium pixel group, and a current frame bright pixel group;
  • the grayscale of the current frame source image sequence is transformed according to the exposure, and the grayscale value of the underexposed pixel is expanded to obtain the brightness image information of the current frame.
  • the source image sequence is transformed into grayscale space, and the pixels in the source image I are divided into dark class ( ID ), medium class ( IM ) and bright class ( IB ). Assuming p is a pixel in the source image I, p is classified by the following formula.
  • ID represents the dark pixel set
  • IM represents the medium pixel set
  • IB represents the bright pixel set
  • I(p) represents the grayscale value of pixel p.
  • ⁇ 1 and ⁇ 2 represent two thresholds, which can be obtained by the multi-threshold Otsu algorithm.
  • the grayscale transformation is performed on the current frame source image sequence to expand the grayscale value of the underexposed pixels.
  • the calculation method is as follows.
  • spectral residual theory is used to perform saliency detection on the source image and its brightness image to achieve fusion of the two images.
  • fusing the current frame image information and the current frame brightness image information to obtain the current frame fused image information includes:
  • Fusion(I, I′) ⁇ p *I+(1- ⁇ p )*I′; where ⁇ p represents the weight of the pixel p in the current frame image information.
  • I is the source image sequence of the current frame;
  • I′ is the brightness image information of the current frame;
  • Fusion(I, I′) represents the fused image information of the current frame;
  • G(x) represents a Gaussian filter
  • F and F -1 represent Fourier transform and its inverse transform respectively
  • H n ⁇ n represents an n ⁇ n matrix, and each element in the matrix is 1/n 2 ; and They represent the real part and imaginary part of the complex matrix respectively
  • I′ i (p) represents the pixel value after pixel p is enlarged
  • I(p) represents the grayscale value of pixel p
  • SM(I) is the saliency map.
  • performing grayscale transformation processing on image information of a frame before the current frame, thereby obtaining brightness image information of the frame before the current frame includes:
  • Transforming an image sequence of a frame before the current frame into a grayscale space and performing set division on each pixel of the image information of the frame before the current frame, so as to divide each pixel into three sets, wherein the three sets include a dark pixel set of the frame before the current frame, a medium pixel set of the frame before the current frame, and a bright pixel set of the frame before the current frame;
  • Grayscale transformation is performed on image information of a frame previous to the current frame according to the exposure level, and grayscale values of underexposed pixels are expanded, thereby obtaining brightness image information of a frame previous to the current frame.
  • the jump-fusion-FCNN method is used to extract features of the fused image information of the current frame and the fused image information of the previous frame of the current frame to obtain the fused image features, including:
  • Get an FCNN neural network model which includes five pooling layers and seven convolutional layers, where the five pooling layers are respectively called the first pooling layer, the second pooling layer, the third pooling layer, the fourth pooling layer, and the fifth pooling layer; the seven convolutional layers are respectively called the first convolutional layer, the second convolutional layer, the third convolutional layer, the fourth convolutional layer, the fifth convolutional layer, the sixth convolutional layer, and the seventh convolutional layer;
  • the FCNN neural network model Inputting the final input image information into the FCNN neural network model, so that the final input image information is processed by the first convolution layer, the first pooling layer, the second convolution layer, the second pooling layer, the third convolution layer, the third pooling layer, the fourth convolution layer, the fourth pooling layer, the fifth convolution layer, the fifth pooling layer, the sixth convolution layer and the seventh convolution layer in sequence;
  • the first path feature, the second path feature and the third path feature are fused to obtain a fused image feature.
  • the first pooling layer, the second pooling layer, the third pooling layer, the fourth pooling layer and the fifth pooling layer have different parameters respectively;
  • generating the first path feature according to the data processed by the third pooling layer, the data processed by the fourth pooling layer, and the data processed by the seventh convolutional layer includes:
  • the data processed by the third pooling layer is downsampled by 4 times, and the data processed by the fourth pooling layer is downsampled by 2 times;
  • the data after 4 times downsampling and the data after 2 times downsampling are summed with the data processed by the seventh convolutional layer, and the data are added one by one.
  • the prediction results of three different depths are merged to obtain the first path feature.
  • generating the second path feature according to the data processed by the second pooling layer, the data processed by the third pooling layer, the data processed by the fourth pooling layer, and the data processed by the seventh convolutional layer includes:
  • the data that has been downsampled 8 times, the data that has been downsampled 4 times, and the data that has been downsampled 2 times are summed with the data processed by the seventh convolutional layer, and the data are added one by one.
  • the prediction results of the four different depths are merged to obtain the second path features.
  • generating a third path feature according to the data processed by the first pooling layer, the data processed by the second pooling layer, the data processed by the third pooling layer, the data processed by the fourth pooling layer, and the data processed by the seventh convolutional layer includes:
  • the data that has been downsampled 16 times, the data that has been downsampled 8 times, the data that has been downsampled 4 times, and the data that has been downsampled 2 times are summed with the data processed by the seventh convolutional layer, and the data are added one by one.
  • the prediction results of the five different depths are merged to obtain the third path features.
  • the parameters of the pooling layer include image size parameters and the number of channels;
  • the parameters of the convolution layer include image size parameters and the number of channels;
  • the image size parameter of the first pooling layer is (M/2) ⁇ (N/2); the number of channels of the first pooling layer is 64;
  • the image size parameter of the second pooling layer is (M/4) ⁇ (N/4); the number of channels of the second pooling layer is 128;
  • the image size parameter of the third pooling layer is (M/8) ⁇ (N/8); the number of channels of the third pooling layer is 256;
  • the image size parameter of the fourth pooling layer is (M/16) ⁇ (N/16); the number of channels of the fourth pooling layer is 256;
  • the image size parameter of the fifth pooling layer is (M/32) ⁇ (N/32); the number of channels of the fifth pooling layer is 512;
  • the image size parameter of the sixth convolutional layer is 4096 ⁇ (M/32) ⁇ (N/32); the number of channels of the sixth convolutional layer is 512;
  • the image size parameter of the seventh convolutional layer is 4096 ⁇ (M/32) ⁇ (N/32); the number of channels of the seventh convolutional layer is 512.
  • the present application designs an end-to-end visual mileage calculation method to obtain the estimated pose.
  • the present application designs a jump-fusion-FCNN network framework.
  • the feature information of the image sequence at different step sizes is obtained through three different paths, while considering the detail information and structural information of the image, and merging the feature information of the three paths through the fusion idea.
  • the present invention uses a recurrent neural network based on LSTM to serialize and model the dynamic changes and associations between the feature information, and then outputs the estimated pose.
  • the first path focuses on the structural information of the image, and the obtained feature map is robust.
  • the third path fully considers the detailed information of the image, and the obtained feature map is more refined.
  • the feature map obtained by the second path is used to balance the results of the above two paths.
  • the feature maps obtained by the three paths are merged to obtain feature fusion information as the input of the RNN network layer.
  • obtaining pose estimation information according to fused image features includes:
  • the fused image features are input into the long short-term memory neural network to obtain the pose estimation information of the camera device.
  • the Long Short-Term Memory (LSTM) network has memory units and threshold control functions, which can discard or retain the hidden layer state of the previous moment to update the hidden layer state of the current moment, and then output the estimated posture of the current moment.
  • LSTM enables the RNN network to have memory function and strong learning ability.
  • the hidden layer state of LSTM is recorded as h t-1 and the memory unit is recorded as c t-11 .
  • the input is x t
  • the updated hidden layer state and memory unit are defined as,
  • sigmoid and tanh are two activation functions, W represents the corresponding weight matrix, and b represents the bias vector.
  • the LSTM network consists of two network layers, LSTM1 and LSTM2.
  • the hidden layer state of LSTM1 is used as the input of LSTM2.
  • Each LSTM network layer contains 1000 hidden units and outputs the estimated posture corresponding to the current moment, that is, a 6-degree-of-freedom posture vector.
  • the loss function of the network is defined as follows:
  • N represents the number of image sequences in the sample dataset, and They represent the estimated pose and true pose of the image at the jth moment in the ith sequence relative to the image at the previous moment.
  • 2 represents the 2-norm calculation of the matrix. ⁇ >0 is a constant.
  • the pose estimation of the visual odometer is transformed into solving the optimal network parameter ⁇ * , and finally the pose estimation information of the camera device can be obtained.
  • the present application also provides an end-to-end visual odometer device, which includes an image acquisition module, a grayscale transformation processing module, a fusion module, a feature extraction module and a pose estimation module, wherein the image acquisition module is used to acquire the current frame image information and the image information of the previous frame of the current frame provided by the camera device; the grayscale transformation processing module is used to perform grayscale transformation processing on the current frame image information and the image information of the previous frame of the current frame, respectively, so as to acquire the brightness image information of the current frame and the brightness image information of the previous frame of the current frame; the fusion module is used to fuse the current frame image information and the brightness image information of the current frame, so as to acquire the current frame fused image information, and fuse the image information of the previous frame of the current frame and the brightness image information of the previous frame of the current frame, so as to acquire the fused image information of the previous frame of the current frame; the feature extraction module is used to perform feature extraction on the current frame fused image information and the fused image information of the previous frame of
  • the present application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program, the above-mentioned end-to-end visual odometer method based on image fusion and FCNN-LSTM is implemented.
  • the present application also provides a computer-readable storage medium, which stores a computer program.
  • the computer program is executed by a processor, the above end-to-end visual odometer method can be implemented.
  • FIG. 2 is an exemplary structural diagram of an electronic device capable of implementing an end-to-end visual odometer method provided according to an embodiment of the present application.
  • the electronic device includes an input device 501, an input interface 502, a central processing unit 503, a memory 504, an output interface 505, and an output device 506.
  • the input interface 502, the central processing unit 503, the memory 504, and the output interface 505 are interconnected through a bus 507, and the input device 501 and the output device 506 are connected to the bus 507 through the input interface 502 and the output interface 505, respectively, and then connected to other components of the electronic device.
  • the input device 504 receives input information from the outside, and transmits the input information to the central processing unit 503 through the input interface 502; the central processing unit 503 processes the input information based on the computer executable instructions stored in the memory 504 to generate output information, temporarily or permanently stores the output information in the memory 504, and then transmits the output information to the output device 506 through the output interface 505; the output device 506 outputs the output information to the outside of the electronic device for use by the user.
  • the electronic device shown in FIG. 2 may also be implemented to include: a memory storing computer executable instructions; and one or more processors, which, when executing the computer executable instructions, may implement the end-to-end visual odometer method described in conjunction with FIG. 1 .
  • the electronic device shown in Figure 2 can be implemented to include: a memory 504, configured to store executable program code; one or more processors 503, configured to run the executable program code stored in the memory 504 to execute the end-to-end visual odometer method based on image fusion and FCNN-LSTM in the above embodiment.
  • a computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
  • processors CPU
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-permanent storage in a computer-readable medium, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and media can be implemented by any method or technology to store information.
  • Information can be computer-readable instructions, data structures, program modules or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), Electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk-read-only memory (CD-ROM), data versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory
  • EEPROM Electrically erasable programmable read-only memory
  • the embodiments of the present application may be provided as methods, systems or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • each square box in the flow chart or block diagram can represent a module, a program segment or a part of a code, and a module, a program segment or a part of a code includes one or more executable instructions for realizing the specified logical function.
  • the functions marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two boxes marked in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
  • each box in the block diagram and/or flow chart, and the combination of the boxes in the block diagram and/or the overall flow chart can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.
  • the processor referred to in this embodiment may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor or any conventional processor, etc.
  • the memory can be used to store computer programs and/or modules.
  • the processor implements various functions of the device/terminal equipment by running or executing the computer programs and/or modules stored in the memory, and calling the data stored in the memory.
  • the memory may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area may store data created according to the use of the mobile phone (such as audio data, a phone book, etc.), etc.
  • the memory may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, an internal memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), at least A disk storage device, flash memory device, or other volatile solid-state storage device.
  • a non-volatile memory such as a hard disk, an internal memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), at least A disk storage device, flash memory device, or other volatile solid-state storage device.
  • the module/unit integrated in the device/terminal equipment is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the present invention implements all or part of the processes in the above-mentioned embodiment method, and can also be completed by instructing the relevant hardware through a computer program.
  • the computer program is executed by the processor, the steps of the above-mentioned various method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code can be in source code form, object code form, executable file or some intermediate form.
  • the computer-readable medium may include: any entity or device that can carry computer program code, recording medium, U disk, mobile hard disk, disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electric carrier signal telecommunication signal and software distribution medium.
  • the embodiments of the present application may be provided as methods, systems or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种端到端视觉里程计方法及装置,属于自动驾驶技术领域。所述端到端视觉里程计方法包括:获取当前帧图像信息及前一帧图像信息;获取当前帧的亮度图像信息及前一帧的亮度图像信息;获取当前帧融合图像信息;获取当前帧的前一帧的融合图像信息;通过跳跃-融合-FCNN方法对当前帧融合图像信息以及当前帧的前一帧的融合图像信息进行特征提取从而获取融合图像特征;根据融合图像特征获取摄像装置的位姿估计信息。通过本申请的方法能够增强图像的对比度,提供更多的细节信息,从而提高图像特征提取的精度,降低位姿估计过程中的误差。

Description

一种端到端视觉里程计方法及装置 技术领域
本申请涉及自动驾驶技术领域,具体涉及一种端到端视觉里程计方法以及端到端视觉里程计装置。
背景技术
同时定位与建图(Simultaneous Localization And Mapping,SLAM)是计算机视觉领域的重要研究方向之一。在自动驾驶的相关研究中,SLAM是关键核心技术之一。在SLAM系统中,需要执行大量的位姿估计任务。视觉里程计(Visual Odemetry,VO)是SLAM系统框架中的前端,其目的是根据车载导航视频通过计算机视觉技术对图像序列进行分析和处理,输出车辆的估计位姿。VO将相邻时刻采集的图像序列作为输入,在估计车辆运动的同时生成初步优化的局部地图,并提供给后端进行下一步优化。传统的VO方法主要包括特征点法和直接法。特征点法需要提取图像序列的特征点,通过特征匹配来构建几何模型,从而估计车辆的运动。直接法通常是基于光度不变假设来估计相邻图像序列之间的车辆运动。VO进行位姿估计的精度影响着SLAM系统整体的轨迹精度。然而,传统的特征提取算法容易受到噪声、光照条件以及视角的影响,其鲁棒性较差。另外,该类算法提取的特征点的类型较为单一,会影响后续特征匹配的准确度,进而影响输出位姿估计的精度。
随着图像成像技术的成熟以及计算机视觉技术的飞速发展,VO方法得到了深入研究和广泛应用。目前,深度学习技术在计算机视觉领域中发挥着越来越重要的作用,它拥有强大的学习能力以及提取更深层次、更抽象特征的能力,已成为VO中非常重要的特征提取方法之一。深度学习特征提取方法能够学习到图像序列间的内在关联,提取到性能优异的特征点。Deep VO是当前应用较为广泛的一种端到端的VO算法。该算法是一种监督学习方法,能够从输入的图像序列中直接估计出车辆相应的位姿。
然而,在低光照或光照不均匀场景中采集的图像,由于图像对比度较低,缺乏动态移动细节特征,依然无法保证良好的性能。
因此,希望有一种技术方案来解决或至少减轻现有技术的上述不足。
发明内容
本发明的目的在于提供一种端到端视觉里程计方法来至少解决上述的一个技术问题。
本发明的一个方面,提供一种端到端视觉里程计方法,用于获取车辆上的摄像装置的位姿估计信息,所述端到端视觉里程计方法包括:
获取摄像装置所提供的当前帧图像信息以及当前帧的前一帧的图像信息;
分别对当前帧图像信息以及当前帧的前一帧的图像信息进行灰度变换处理,从而获取当前帧的亮度图像信息以及当前帧的前一帧的亮度图像信息;
对所述当前帧图像信息以及当前帧的亮度图像信息进行融合,从而获取当前帧融合图像信息;对所述当前帧的前一帧的图像信息以及当前帧的前一帧的亮度图像信息进行融合,从而获取当前帧的前一帧的融合图像信息;通过跳跃-融合-FCNN方法对所述当前帧融合图像信息以及所述当前帧的前一帧的融合图像信息进行特征提取从而获取融合图像特征;
根据所述融合图像特征获取摄像装置的位姿估计信息。
可选地,所述对当前帧图像信息进行灰度变换处理,从而获取当前帧的亮度图像信息包括:
获取当前帧图像信息中的当前帧源图像序列;
将当前帧源图像序列变换到灰度空间,对当前帧图像信息的各个像素进行集合划分,从而将各个像素分成三组集合,所述三组集合包括当前帧暗类像素集合,当前帧中类像素集合,当前帧亮类像素集合;
计算各组集合中的各个像素点的曝光度;
根据所述曝光度对当前帧源图像序列进行灰度变换,扩大欠曝光像素的灰度值,从而获取当前帧的亮度图像信息;
所述对当前帧的前一帧的图像信息进行灰度变换处理,从而获取当前帧的前一帧的亮度图像信息包括:
获取当前帧的前一帧的图像信息中的当前帧的前一帧的图像序列;
将当前帧的前一帧的图像序列变换到灰度空间,对当前帧的前一帧的图像信息的各个像素进行集合划分,从而将各个像素分成三组集合,所述三组集合包括当前帧的前一帧的暗类像素集合,当前帧的前一帧的中类像素集合,当前帧的前一帧的亮类像素集合;
计算各组集合中的各个像素点的曝光度;
根据所述曝光度对当前帧的前一帧的图像信息进行灰度变换,扩大欠曝光像素的灰度值,从而获取当前帧的前一帧的亮度图像信息。
可选地,对所述当前帧图像信息以及当前帧的亮度图像信息进行融合,从而获取当前帧融合图像信息包括:
采用如下公式对所述当前帧图像信息以及当前帧的亮度图像信息进行融合:
Fusion(I,I′)=ωp*I+(1-ωp)*I′;其中,ωp表示当前帧图像信息中像素p位置处的权重,I为当前帧源图像序列;I′为当前帧亮度图像信息;Fusion(I,I′)表示当前帧融合图像信息;
其中,G(x)表示高斯滤波器,F和F-1分别表示傅里叶变换及其逆变换,Hn×n表示n×n矩阵,矩阵中的每个元素均为1/n2分别表示复数矩阵的实数部分和虚数部分;I′i(p)表示像素p扩大之后的像素值;I(p)表示像素p的灰度值;SM(I)为显著性图谱。
可选地,所述通过跳跃-融合-FCNN方法对所述当前帧融合图像信息以及所述当前帧的前一帧的融合图像信息进行特征提取从而获取融合图像特征包括:
获取FCNN神经网络模型,所述FCNN神经网络模型包括五个池化层以及七个卷积层,其中,所述五个池化层分别称为第一池化层、第二池化层、第三池化层、第四池化层以及第五池化层;所述七个卷积层分别称为第一卷积层、第二卷积层、第三卷积层、第四卷积层、第五卷积层、第六卷积层以及第七卷积层;
将所述当前帧融合图像信息以及所述当前帧的前一帧的融合图像信息进行叠加从而形成最终输入图像信息;
将所述最终输入图像信息输入至所述FCNN神经网络模型,以使所述最终输入图像信息依次经过第一卷积层、第一池化层、第二卷积层、第二池化层、第三卷积层、第三池化层、第四卷积层、第四池化层、第五卷积层、第五池化层、第六卷积层以及第七卷积层处理;
根据经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第一路径特征;
根据经过所述第二池化层处理后的数据、经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第二路径特征;
根据经过所述第一池化层处理后的数据、所述第二池化层处理后的数据、经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据 生成第三路径特征;
将所述第一路径特征、第二路径特征以及第三路径特征进行融合,从而获取所述融合图像特征。
可选地,所述第一池化层、第二池化层、第三池化层、第四池化层以及第五池化层分别具有不同的参数;
所述根据经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第一路径特征包括:
对经过所述第三池化层处理后的数据进行4倍下采样,对经过所述第四池化层处理后的数据进行2倍下采样;
将经过所述4倍下采样的数据以及经过所述2倍下采样的数据与第七卷积层处理后的数据进行求和运算,逐数据相加,将三个不同深度的预测结果进行合并从而获取第一路径特征。
可选地,所述根据经过所述第二池化层处理后的数据、经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第二路径特征包括:
对经过所述第二池化层处理后的数据进行8倍下采样、对经过所述第三池化层处理后的数据进行4倍下采样,对经过所述第四池化层处理后的数据进行2倍下采样;
将经过所述8倍下采样的数据、所述4倍下采样的数据以及经过所述2倍下采样的数据与第七卷积层处理后的数据进行求和运算,逐数据相加,将四个不同深度的预测结果进行合并从而获取第二路径特征。
可选地,所述根据经过所述第一池化层处理后的数据、所述第二池化层处理后的数据、经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第三路径特征包括:
对经过所述第一池化层处理后的数据进行16倍下采样、经过所述第二池化层处理后的数据进行8倍下采样、对经过所述第三池化层处理后的数据进行4倍下采样,对经过所述第四池化层处理后的数据进行2倍下采样;
将经过所述16倍下采样的数据、所述8倍下采样的数据、所述4倍下采样的数据以及经过所述2倍下采样的数据与第七卷积层处理后的数据进行求和运算,逐数据相加,将五个不同深度的预测结果进行合并从而获取第三路径特征。
可选地,所述池化层的参数包括图像尺寸参数以及通道数;所述卷积层的参数包括图像尺寸参数以及通道数;
所述第一池化层的图像尺寸参数为(M/2)×(N/2);所述第一池化层的通道数为64;
所述第二池化层的图像尺寸参数为(M/4)×(N/4);所述第二池化层的通道数为128;
所述第三池化层的图像尺寸参数为(M/8)×(N/8);所述第三池化层的通道数为256;
所述第四池化层的图像尺寸参数为(M/16)×(N/16);所述第四池化层的通道数为256;
所述第五池化层的图像尺寸参数为(M/32)×(N/32);所述第五池化层的通道数为512;
所述第六卷积层的图像尺寸参数为4096×(M/32)×(N/32);所述第六卷积层的通道数为512;
所述第七卷积层的图像尺寸参数为4096×(M/32)×(N/32);所述第七卷积层的通道数为512。
可选地,所述根据所述融合图像特征获取位姿估计信息包括:
将所述融合图像特征输入至长短期记忆神经网络中,从而获取摄像装置的位姿估计信息。
本申请还提供了一种端到端视觉里程计装置,所述端到端视觉里程计装置包括:
图像获取模块,所述图像获取模块用于获取摄像装置所提供的当前帧图像信息以及当前帧的前一帧的图像信息;
灰度变换处理模块,所述灰度变换处理模块用于分别对当前帧图像信息以及当前帧的前一帧的图像信息进行灰度变换处理,从而获取当前帧的亮度图像信息以及当前帧的前一帧的亮度图像信息;
融合模块,所述融合模块用于对所述当前帧图像信息以及当前帧的亮度图像信息进行融合,从而获取当前帧融合图像信息以及对所述当前帧的前一帧的图像信息以及当前帧的前一帧的亮度图像信息进行融合,从而获取当前帧的前一帧的融合图像信息;
特征提取模块,所述特征提取模块用于通过跳跃-融合-FCNN方法对所述当前帧融合图像信息以及所述当前帧的前一帧的融合图像信息进行特征提取从而获取融合图像特征;
位姿估计模块,所述位姿估计模块用于根据所述融合图像特征获取摄像装置的位姿估计信息。
有益效果
本申请的端到端视觉里程计方法通过对源图像序列进行灰度变换获得其亮度图像,设计基于谱残差理论的图像融合算法将图像序列及其亮度图像进行合并,增强图像的对比度,提供更多的细节信息。为了提高图像特征提取的精度,降低位姿估计过程中的误差,本申请设计了基于跳跃-融合-FCNN的特征提取算法,对传统的全卷积神经网络(fully convolutional neural network,FCNN)进行了改进,提出跳跃-融合-FCNN网络模型,构建了3条不同的路径进行特征提取。在每条路径中,通过下采样将不同深度的预测结果进行融合,获得特征图谱。合并3个不同的特征图谱,获得融合图像特征,同时考虑了图像的结构信息和细节信息。
附图说明
图1是本申请一实施例的端到端视觉里程计方法的流程示意图。
图2是能够实现本申请一实施例的端到端视觉里程计方法的电子设备的示意图。
图3是本申请一实施例的端到端视觉里程计方法的架构示意图。
具体实施方式
为使本申请实施的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行更加详细的描述。在附图中,自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。所描述的实施例是本申请一部分实施例,而不是全部的实施例。下面通过参考附图描述的实施例是示例性的,旨在用于解释本申请,而不能理解为对本申请的限制。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。下面结合附图对本申请的实施例进行详细说明。
图1是本申请一实施例的端到端视觉里程计方法的流程示意图。
本申请的端到端视觉里程计方法用于获取车辆上的摄像装置的位姿估计信息。
如图1及图3所示的端到端视觉里程计方法包括:
步骤1:获取摄像装置所提供的当前帧图像信息以及当前帧的前一帧的图像信息;
步骤2:分别对当前帧图像信息以及当前帧的前一帧的图像信息进行灰度变换处理,从而获取当前帧的亮度图像信息以及当前帧的前一帧的亮度图像信息;
步骤3:对当前帧图像信息以及当前帧的亮度图像信息进行融合,从而获取当前帧融合图像信息;
步骤4:对当前帧的前一帧的图像信息以及当前帧的前一帧的亮度图像信息进行融合,从而获取当前帧的前一帧的融合图像信息;通过跳跃-融合-FCNN方法对当前帧融合图像信息以及当前帧的前一帧的融合图像信息进行特征提取从而获取融合图像特征;
步骤5:根据融合图像特征获取摄像装置的位姿估计信息。
本申请的端到端视觉里程计方法通过对源图像序列进行灰度变换获得其亮度图像,设计基于谱残差理论的图像融合算法将图像序列及其亮度图像进行合并,增强图像的对比度,提供更多的细节信息。为了提高图像特征提取的精度,降低位姿估计过程中的误差,本申请设计了基于跳跃-融合-FCNN的特征提取算法,对传统的全卷积神经网络(fully convolutional neural network,FCNN)进行了改进,提出跳跃-融合-FCNN网络模型,构建了3条不同的路径进行特征提取。在每条路径中,通过下采样将不同深度的预测结果进行融合,获得特征图谱。合并3个不同的特征图谱,获得融合图像特征,同时考虑了图像的结构信息 和细节信息。
在本实施中,对当前帧图像信息进行灰度变换处理,从而获取当前帧的亮度图像信息包括:
获取当前帧图像信息中的当前帧源图像序列;
将当前帧源图像序列变换到灰度空间,对当前帧图像信息的各个像素进行集合划分,从而将各个像素分成三组集合,所述三组集合包括当前帧暗类像素集合,当前帧中类像素集合,当前帧亮类像素集合;
计算各组集合中的各个像素点的曝光度;
根据曝光度对当前帧源图像序列进行灰度变换,扩大欠曝光像素的灰度值,从而获取当前帧的亮度图像信息。
具体而言,首先,将源图像序列变换到灰度空间,对源图像I中的像素进行划分,分为暗类(ID),中类(IM)和亮类(IB)。假设p为源图像I中的一个像素,p通过如下公式进行分类。
其中,
ID表示暗类像素集合、IM中类像素集合、IB表示亮类像素集合,I(p)表示像素p的灰度值。τ1和τ2表示两个阈值,可通过多阈值Otsu算法获得。
然后,通过计算3类像素(暗类,中类和亮类)的曝光度,来判断各个像素点是否曝光良好。任意一个像素p的曝光度E(p)计算如下公式。
当p∈Ii时,其中,i=D,M,B;其中,表示Ii类像素的参考曝光值。σi(i=D,M,B)表示Ii类像素的参考标准差,可分别设置为32,64,32。像素的灰度值越接近其参考曝光值,表明该像素曝光越好。通常情况下,当E(p)≥0.8时,表示像素p曝光良好;否则,像素p曝光不足,需要扩大像素p的灰度值。
最后,根据像素的曝光度,对当前帧源图像序列进行灰度变换,扩大欠曝光像素的灰度值,计算方式如下所示。
I′i(p)=I(p)Fi(p),当p∈Ii时,其中,i=D,M,B;其中, I′i(p)表示像素p扩大之后的像素值。Fi(p)表示扩大因子,Fi(p)计算方式如下公式。
当p∈Ii时,其中,i=D,M,B;(其中,i=D,M,B)表示Ii类像素中曝光良好的像素p的灰度值,表示Ii类像素中欠曝光像素p的灰度值。
在本实施例中,采用谱残差理论对源图像及其亮度图像进行显著性检测,实现两幅图像的融合。
具体而言,对当前帧图像信息以及当前帧的亮度图像信息进行融合,从而获取当前帧融合图像信息包括:
采用如下公式对当前帧图像信息以及当前帧的亮度图像信息进行融合:
Fusion(I,I′)=ωp*I+(1-ωp)*I′;其中,ωp表示当前帧图像信息中像素p位置处的权重,I为当前帧源图像序列;I′为当前帧亮度图像信息;Fusion(I,I′)表示当前帧融合图像信息;
其中,G(x)表示高斯滤波器,F和F-1分别表示傅里叶变换及其逆变换,Hn×n表示n×n矩阵,矩阵中的每个元素均为1/n2分别表示复数矩阵的实数部分和虚数部分;I′i(p)表示像素p扩大之后的像素值;I(p)表示像素p的灰度值;SM(I)为显著性图谱。
在本实施例中,对当前帧的前一帧的图像信息进行灰度变换处理,从而获取当前帧的前一帧的亮度图像信息包括:
获取当前帧的前一帧的图像信息中的当前帧的前一帧的图像序列;
将当前帧的前一帧的图像序列变换到灰度空间,对当前帧的前一帧的图像信息的各个像素进行集合划分,从而将各个像素分成三组集合,所述三组集合包括当前帧的前一帧的暗类像素集合,当前帧的前一帧的中类像素集合,当前帧的前一帧的亮类像素集合;
计算各组集合中的各个像素点的曝光度;
根据所述曝光度对当前帧的前一帧的图像信息进行灰度变换,扩大欠曝光像素的灰度值,从而获取当前帧的前一帧的亮度图像信息。
可以理解的是,获取当前帧的亮度图像信息与获取当前帧的前一帧的亮度图像信息所用方法以及所用公式相同,在此不再赘述。
在本实施例中,通过跳跃-融合-FCNN方法对当前帧融合图像信息以及当前帧的前一帧的融合图像信息进行特征提取从而获取融合图像特征包括:
获取FCNN神经网络模型,FCNN神经网络模型包括五个池化层以及七个卷积层,其中,五个池化层分别称为第一池化层、第二池化层、第三池化层、第四池化层以及第五池化层;七个卷积层分别称为第一卷积层、第二卷积层、第三卷积层、第四卷积层、第五卷积层、第六卷积层以及第七卷积层;
将当前帧融合图像信息以及当前帧的前一帧的融合图像信息进行叠加从而形成最终输入图像信息;
将最终输入图像信息输入至FCNN神经网络模型,以使最终输入图像信息依次经过第一卷积层、第一池化层、第二卷积层、第二池化层、第三卷积层、第三池化层、第四卷积层、第四池化层、第五卷积层、第五池化层、第六卷积层以及第七卷积层处理;
根据经过第三池化层处理后的数据、经过第四池化层处理后的数据以及经过第七卷积层处理后的数据生成第一路径特征;
根据经过第二池化层处理后的数据、经过第三池化层处理后的数据、经过第四池化层处理后的数据以及经过第七卷积层处理后的数据生成第二路径特征;
根据经过第一池化层处理后的数据、第二池化层处理后的数据、经过第三池化层处理后的数据、经过第四池化层处理后的数据以及经过第七卷积层处理后的数据生成第三路径特征;
将第一路径特征、第二路径特征以及第三路径特征进行融合,从而获取融合图像特征。
在本实施例中,第一池化层、第二池化层、第三池化层、第四池化层以及第五池化层分别具有不同的参数;
在本实施例中,根据经过第三池化层处理后的数据、经过第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第一路径特征包括:
对经过第三池化层处理后的数据进行4倍下采样,对经过第四池化层处理后的数据进行2倍下采样;
将经过4倍下采样的数据以及经过2倍下采样的数据与第七卷积层处理后的数据进行求和运算,逐数据相加,将三个不同深度的预测结果进行合并从而获取第一路径特征。
在本实施例中,根据经过第二池化层处理后的数据、经过第三池化层处理后的数据、经过第四池化层处理后的数据以及经过第七卷积层处理后的数据生成第二路径特征包括:
对经过第二池化层处理后的数据进行8倍下采样、对经过第三池化层处理后的数据进行4倍下采样,对经过所述第四池化层处理后的数据进行2倍下采样;
将经过8倍下采样的数据、4倍下采样的数据以及经过2倍下采样的数据与第七卷积层处理后的数据进行求和运算,逐数据相加,将四个不同深度的预测结果进行合并从而获取第二路径特征。
在本实施例中,根据经过第一池化层处理后的数据、第二池化层处理后的数据、经过所述第三池化层处理后的数据、经过第四池化层处理后的数据以及经过第七卷积层处理后的数据生成第三路径特征包括:
对经过第一池化层处理后的数据进行16倍下采样、经过第二池化层处理后的数据进行8倍下采样、对经过第三池化层处理后的数据进行4倍下采样,对经过所述第四池化层处理后的数据进行2倍下采样;
将经过16倍下采样的数据、8倍下采样的数据、4倍下采样的数据以及经过2倍下采样的数据与第七卷积层处理后的数据进行求和运算,逐数据相加,将五个不同深度的预测结果进行合并从而获取第三路径特征。
参见下表1,在本实施例中,池化层的参数包括图像尺寸参数以及通道数;卷积层的参数包括图像尺寸参数以及通道数;
第一池化层的图像尺寸参数为(M/2)×(N/2);第一池化层的通道数为64;
第二池化层的图像尺寸参数为(M/4)×(N/4);第二池化层的通道数为128;
第三池化层的图像尺寸参数为(M/8)×(N/8);第三池化层的通道数为256;
第四池化层的图像尺寸参数为(M/16)×(N/16);第四池化层的通道数为256;
第五池化层的图像尺寸参数为(M/32)×(N/32);第五池化层的通道数为512;
第六卷积层的图像尺寸参数为4096×(M/32)×(N/32);第六卷积层的通道数为512;
第七卷积层的图像尺寸参数为4096×(M/32)×(N/32);第七卷积层的通道数为512。
可以理解的是,其他卷积层可以根据自身需要而自行设定图像尺寸参数以及通道数。
表1:
本申请设计了端到端视觉里程计算法获得估计位姿。首先,为了更好地提取图像序列的特征信息,本申请设计了跳跃-融合-FCNN网络框架。通过3条不同的路径获得图像序列在不同步长下的特征信息,同时考虑了图像的细节信息和结构信息,并通过融合思想将3条路径的特征信息进行合并。其次,本发明采用基于LSTM的循环神经网络对特征信息之间的动态变化和关联进行序列化建模,进而输出估计位姿。
第一路径侧重图像的结构信息,获得的特征图谱具有鲁棒性。第三路径充分考虑了图像的细节信息,获得的特征图谱更加精细。第二路径获得的特征图谱用于平衡上述两条路径的结果。将3条路径获得的特征图谱进行合并,获得特征融合信息,作为RNN网络层的输入。
在本实施例中,根据融合图像特征获取位姿估计信息包括:
将融合图像特征输入至长短期记忆神经网络中,从而获取摄像装置的位姿估计信息。
具体而言,当前帧融合图像信息经过FCNN提取获得的当前帧特征输入到RNN网络,对特征之间的动态变化和关联进行序列化建模。长短期记忆(Long Short-Term Memory,LSTM)网络具有记忆单元和门限控制函数,能够丢弃或保留先前时刻的隐藏层状态来对更新当前时刻的隐藏层状态,进而输出当前时刻的估计位姿。LSTM使得RNN网络具有记忆功能和较强的学习能力。
在t-1时刻,LSTM的隐藏层状态记为ht-1,记忆单元记为ct-11。假设在t时刻,输入为xt,则更新后的隐藏层状态和记忆单元定义为,

其中,sigmoid和tanh是两个激活函数,W表示相应的权重矩阵,b表示偏置向量。
LSTM网络包含LSTM1和LSTM2两个网络层,LSTM1的隐藏层状态作为LSTM2的输入。每个LSTM网络层含有1000个隐藏单元,输出当前时刻对应的估计位姿,即一个6自由度的位姿向量。
参数优化
根据位姿坐标平移距离变化和方向变化,定义网络的损失函数如下,
其中,N表示样本数据集中图像序列的数目,分别表示第i个序列中第j个时刻的图像相对于上一时刻图像的估计位姿和真实位姿。||·||2表示矩阵的2范数计算。α>0是一个常数。
因此,视觉里程计的位姿估计转化为求解最优网络参数δ*,最终即可获得摄像装置的位姿估计信息。
本申请还提供了一种端到端视觉里程计装置,端到端视觉里程计装置包括图像获取模块、灰度变换处理模块、融合模块、特征提取模块以及位姿估计模块,其中,图像获取模块用于获取摄像装置所提供的当前帧图像信息以及当前帧的前一帧的图像信息;灰度变换处理模块用于分别对当前帧图像信息以及当前帧的前一帧的图像信息进行灰度变换处理,从而获取当前帧的亮度图像信息以及当前帧的前一帧的亮度图像信息;融合模块用于对当前帧图像信息以及当前帧的亮度图像信息进行融合,从而获取当前帧融合图像信息以及对当前帧的前一帧的图像信息以及当前帧的前一帧的亮度图像信息进行融合,从而获取当前帧的前一帧的融合图像信息;特征提取模块用于通过跳跃-融合-FCNN方法对当前帧融合图像信息以及当前帧的前一帧的融合图像信息进行特征提取从而获取融合图像特征;位姿估计模块用于根据融合图像特征获取摄像装置的位姿估计信息。
可以理解的是,上述对方法的描述,也同样适用于对装置的描述。
本申请还提供了一种电子设备,包括存储器、处理器以及存储在存储器中并能够在处理器上运行的计算机程序,处理器执行计算机程序时实现如上的基于图像融合和FCNN-LSTM的端到端视觉里程计方法。
本申请还提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时能够实现如上的端到端视觉里程计方法。
图2是能够实现根据本申请一个实施例提供的端到端视觉里程计方法的电子设备的示例性结构图。
如图2所示,电子设备包括输入设备501、输入接口502、中央处理器503、存储器504、输出接口505以及输出设备506。其中,输入接口502、中央处理器503、存储器504以及输出接口505通过总线507相互连接,输入设备501和输出设备506分别通过输入接口502和输出接口505与总线507连接,进而与电子设备的其他组件连接。具体地,输入设备504接收来自外部的输入信息,并通过输入接口502将输入信息传送到中央处理器503;中央处理器503基于存储器504中存储的计算机可执行指令对输入信息进行处理以生成输出信息,将输出信息临时或者永久地存储在存储器504中,然后通过输出接口505将输出信息传送到输出设备506;输出设备506将输出信息输出到电子设备的外部供用户使用。
也就是说,图2所示的电子设备也可以被实现为包括:存储有计算机可执行指令的存储器;以及一个或多个处理器,该一个或多个处理器在执行计算机可执行指令时可以实现结合图1描述的端到端视觉里程计方法。
在一个实施例中,图2所示的电子设备可以被实现为包括:存储器504,被配置为存储可执行程序代码;一个或多个处理器503,被配置为运行存储器504中存储的可执行程序代码,以执行上述实施例中的基于图像融合和FCNN-LSTM的端到端视觉里程计方法。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动,媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、 电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数据多功能光盘(DVD)或其他光学存储、磁盒式磁带、磁带磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
此外,显然“包括”一词不排除其他单元或步骤。装置权利要求中陈述的多个单元、模块或装置也可以由一个单元或总装置通过软件或硬件来实现。
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,模块、程序段、或代码的一部分包括一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地标识的方框实际上可以基本并行地执行,他们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或总流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
在本实施例中所称处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器可用于存储计算机程序和/或模块,处理器通过运行或执行存储在存储器内的计算机程序和/或模块,以及调用存储在存储器内的数据,实现装置/终端设备的各种功能。存储器可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少 一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
在本实施例中,装置/终端设备集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,计算机程序包括计算机程序代码,计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读介质可以包括:能够携带计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减。本申请虽然以较佳实施例公开如上,但其实并不是用来限定本申请,任何本领域技术人员在不脱离本申请的精神和范围内,都可以做出可能的变动和修改,因此,本申请的保护范围应当以本申请权利要求所界定的范围为准。
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
此外,显然“包括”一词不排除其他单元或步骤。装置权利要求中陈述的多个单元、模块或装置也可以由一个单元或总装置通过软件或硬件来实现。
虽然,上文中已经用一般性说明及具体实施方案对本发明作了详尽的描述,但在本发明基础上,可以对之作一些修改或改进,这对本领域技术人员而言是显而易见的。因此,在不偏离本发明精神的基础上所做的这些修改或改进,均属于本发明要求保护的范围。

Claims (8)

  1. 一种端到端视觉里程计方法,用于获取车辆上的摄像装置的位姿估计信息,其特征在于,所述端到端视觉里程计方法包括:
    获取摄像装置所提供的当前帧图像信息以及当前帧的前一帧的图像信息;
    分别对当前帧图像信息以及当前帧的前一帧的图像信息进行灰度变换处理,从而获取当前帧的亮度图像信息以及当前帧的前一帧的亮度图像信息;
    对所述当前帧图像信息以及当前帧的亮度图像信息进行融合,从而获取当前帧融合图像信息;
    对所述当前帧的前一帧的图像信息以及当前帧的前一帧的亮度图像信息进行融合,从而获取当前帧的前一帧的融合图像信息;
    通过跳跃-融合-FCNN方法对所述当前帧融合图像信息以及所述当前帧的前一帧的融合图像信息进行特征提取从而获取融合图像特征;
    根据所述融合图像特征获取摄像装置的位姿估计信息;其中,
    对所述当前帧图像信息以及当前帧的亮度图像信息进行融合,从而获取当前帧融合图像信息包括:
    采用如下公式对所述当前帧图像信息以及当前帧的亮度图像信息进行融合:
    Fusion(I,I′)=ωp*I+(1-ωp)*I′;其中,
    ωp表示当前帧图像信息中像素p位置处的权重,I为当前帧源图像序列;I′为当前帧亮度图像信息;Fusion(I,I′)表示当前帧融合图像信息;
    其中,G(x)表示高斯滤波器,F和F-1分别表示傅里叶变换及其逆变换,Hn×n表示n×n矩阵,矩阵中的每个元素均为1/n2分别表示复数矩阵的实数部分和虚数部分;SM(I)为显著性图谱;
    所述通过跳跃-融合-FCNN方法对所述当前帧融合图像信息以及所述当前帧的前一帧的融合图像信息进行特征提取从而获取融合图像特征包括:
    获取FCNN神经网络模型,所述FCNN神经网络模型包括五个池化层以及七个卷积层,其中,所述五个池化层分别称为第一池化层、第二池化层、第三池化层、第四池化层以及第五池化层;所述七个卷积层分别称为第一卷积层、第二卷积层、第三卷积层、第四卷积层、第五卷积层、第六卷积层以及第七卷积层;
    将所述当前帧融合图像信息以及所述当前帧的前一帧的融合图像信息进行叠加从而形成最终输入图像信息;
    将所述最终输入图像信息输入至所述FCNN神经网络模型,以使所述最终输入图像信息依次经过第一卷积层、第一池化层、第二卷积层、第二池化层、第三卷积层、第三池化层、第四卷积层、第四池化层、第五卷积层、第五池化层、第六卷积层以及第七卷积层处理;
    根据经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第一路径特征;
    根据经过所述第二池化层处理后的数据、经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第二路径特征;
    根据经过所述第一池化层处理后的数据、所述第二池化层处理后的数据、经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第三路径特征;
    将所述第一路径特征、第二路径特征以及第三路径特征进行融合,从而获取所述融合图像特征。
  2. 如权利要求1所述的端到端视觉里程计方法,其特征在于,所述对当前帧图像信息进行灰度变换处理,从而获取当前帧的亮度图像信息包括:
    获取当前帧图像信息中的当前帧源图像序列;
    将当前帧源图像序列变换到灰度空间,对当前帧图像信息的各个像素进行集合划分,从而将各个像素分成三组集合,所述三组集合包括当前帧暗类像素集合,当前帧中类像素集合,当前帧亮类像素集合;
    计算各组集合中的各个像素点的曝光度;
    根据所述曝光度对当前帧源图像序列进行灰度变换,扩大欠曝光像素的灰度值,从而获取当前帧的亮度图像信息;
    所述对当前帧的前一帧的图像信息进行灰度变换处理,从而获取当前帧的前一帧的亮度图像信息包括:
    获取当前帧的前一帧的图像信息中的当前帧的前一帧的图像序列;
    将当前帧的前一帧的图像序列变换到灰度空间,对当前帧的前一帧的图像信息的各个像素进行集合划分,从而将各个像素分成三组集合,所述三组集合包括当前帧的前一帧的暗类像素集合,当前帧的前一帧的中类像素集合,当前帧的前一帧的亮类像素集合;
    计算各组集合中的各个像素点的曝光度;
    根据所述曝光度对当前帧的前一帧的图像信息进行灰度变换,扩大欠曝光像素的灰度值,从而获取当前帧的前一帧的亮度图像信息。
  3. 如权利要求2所述的端到端视觉里程计方法,其特征在于,所述第一池化层、第二池化层、第三池化层、第四池化层以及第五池化层分别具有不同的参数;
    所述根据经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第一路径特征包括:
    对经过所述第三池化层处理后的数据进行4倍下采样,对经过所述第四池化层处理后的数据进行2倍下采样;
    将经过所述4倍下采样的数据以及经过所述2倍下采样的数据与第七卷积层处理后的数据进行求和运算,逐数据相加,将三个不同深度的预测结果进行合并从而获取第一路径特征。
  4. 如权利要求3所述的端到端视觉里程计方法,其特征在于,所述根据经过所述第二池化层处理后的数据、经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第二路径特征包括:
    对经过所述第二池化层处理后的数据进行8倍下采样、对经过所述第三池化层处理后的数据进行4倍下采样,对经过所述第四池化层处理后的数据进行2倍下采样;
    将经过所述8倍下采样的数据、所述4倍下采样的数据以及经过所述2倍下采样的数据与第七卷积层处理后的数据进行求和运算,逐数据相加,将四个不同深度的预测结果进行合并从而获取第二路径特征。
  5. 如权利要求4所述的端到端视觉里程计方法,其特征在于,所述根据经过所述第一池化层处理后的数据、所述第二池化层处理后的数据、经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第三路径特征包括:对经过所述第一池化层处理后的数据进行16倍下采样、经过所述第二池化层处理后的数据进 行8倍下采样、对经过所述第三池化层处理后的数据进行4倍下采样,对经过所述第四池化层处理后的数据进行2倍下采样;
    将经过所述16倍下采样的数据、所述8倍下采样的数据、所述4倍下采样的数据以及经过所述2倍下采样的数据与第七卷积层处理后的数据进行求和运算,逐数据相加,将五个不同深度的预测结果进行合并从而获取第三路径特征。
  6. 如权利要求5所述的端到端视觉里程计方法,其特征在于,所述池化层的参数包括图像尺寸参数以及通道数;所述卷积层的参数包括图像尺寸参数以及通道数;
    所述第一池化层的图像尺寸参数为(M/2)×(N/2);所述第一池化层的通道数为64;
    所述第二池化层的图像尺寸参数为(M/4)×(N/4);所述第二池化层的通道数为128;
    所述第三池化层的图像尺寸参数为(M/8)×(N/8);所述第三池化层的通道数为256;
    所述第四池化层的图像尺寸参数为(M/16)×(N/16);所述第四池化层的通道数为256;
    所述第五池化层的图像尺寸参数为(M/32)×(N/32);所述第五池化层的通道数为512;
    所述第六卷积层的图像尺寸参数为4096×(M/32)×(N/32);所述第六卷积层的通道数为512;
    所述第七卷积层的图像尺寸参数为4096×(M/32)×(N/32);所述第七卷积层的通道数为512。
  7. 如权利要求6所述的端到端视觉里程计方法,其特征在于,所述根据所述融合图像特征获取位姿估计信息包括:
    将所述融合图像特征输入至长短期记忆神经网络中,从而获取摄像装置的位姿估计信息。
  8. 一种端到端视觉里程计装置,用于实现如权利要求1至7中任意一项所述的端到端视觉里程计方法,其特征在于,所述端到端视觉里程计装置包括:
    图像获取模块,所述图像获取模块用于获取摄像装置所提供的当前帧图像信息以及当前帧的前一帧的图像信息;
    灰度变换处理模块,所述灰度变换处理模块用于分别对当前帧图像信息以及当前帧的前一帧的图像信息进行灰度变换处理,从而获取当前帧的亮度图像信息以及当前帧的前一帧的亮度图像信息;
    融合模块,所述融合模块用于对所述当前帧图像信息以及当前帧的亮度图像信息进行融合,从而获取当前帧融合图像信息以及对所述当前帧的前一帧的图像信息以及当前帧的前一帧的亮度图像信息进行融合,从而获取当前帧的前一帧的融合图像信息;
    特征提取模块,所述特征提取模块用于通过跳跃-融合-FCNN方法对所述当前帧融合图像信息以及所述当前帧的前一帧的融合图像信息进行特征提取从而获取融合图像特征;
    位姿估计模块,所述位姿估计模块用于根据所述融合图像特征获取摄像装置的位姿估计信息;其中,
    对所述当前帧图像信息以及当前帧的亮度图像信息进行融合,从而获取当前帧融合图像信息包括:
    采用如下公式对所述当前帧图像信息以及当前帧的亮度图像信息进行融合:
    Fusion(I,I′)=ωp*I+(1-ωp)*I′;其中,
    ωp表示当前帧图像信息中像素p位置处的权重,I为当前帧源图像序列;I′为当前帧亮度图像信息;Fusion(I,I′)表示当前帧融合图像信息;
    其中,G(x)表示高斯滤波器,F和F-1分别表示傅里叶变换及其逆变换,Hn×n表示n×n矩阵,矩阵中的每个元素均为1/n2分别表示复数矩阵的实数部分和虚数部分;SM(I)为显著性图谱;
    所述通过跳跃-融合-FCNN方法对所述当前帧融合图像信息以及所述当前帧的前一帧的融合图像信息进行特征提取从而获取融合图像特征包括:
    获取FCNN神经网络模型,所述FCNN神经网络模型包括五个池化层以及七个卷积层,其中,所述五个池化层分别称为第一池化层、第二池化层、第三池化层、第四池化层以及第五池化层;所述七个卷积层分别称为第一卷积层、第二卷积层、第三卷积层、第四卷积层、第五卷积层、第六卷积层以及第七卷积层;
    将所述当前帧融合图像信息以及所述当前帧的前一帧的融合图像信息进行叠加从而形成最终输入图像信息;
    将所述最终输入图像信息输入至所述FCNN神经网络模型,以使所述最终输入图像信息依次 经过第一卷积层、第一池化层、第二卷积层、第二池化层、第三卷积层、第三池化层、第四卷积层、第四池化层、第五卷积层、第五池化层、第六卷积层以及第七卷积层处理;
    根据经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第一路径特征;
    根据经过所述第二池化层处理后的数据、经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第二路径特征;
    根据经过所述第一池化层处理后的数据、所述第二池化层处理后的数据、经过所述第三池化层处理后的数据、经过所述第四池化层处理后的数据以及经过所述第七卷积层处理后的数据生成第三路径特征;
    将所述第一路径特征、第二路径特征以及第三路径特征进行融合,从而获取所述融合图像特征。
PCT/CN2023/091529 2022-10-18 2023-04-28 一种端到端视觉里程计方法及装置 WO2024082602A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211269544.9A CN115358962B (zh) 2022-10-18 2022-10-18 一种端到端视觉里程计方法及装置
CN202211269544.9 2022-10-18

Publications (1)

Publication Number Publication Date
WO2024082602A1 true WO2024082602A1 (zh) 2024-04-25

Family

ID=84007720

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/091529 WO2024082602A1 (zh) 2022-10-18 2023-04-28 一种端到端视觉里程计方法及装置

Country Status (2)

Country Link
CN (1) CN115358962B (zh)
WO (1) WO2024082602A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115358962B (zh) * 2022-10-18 2023-01-10 中国第一汽车股份有限公司 一种端到端视觉里程计方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190065885A1 (en) * 2017-08-29 2019-02-28 Beijing Samsung Telecom R&D Center Object detection method and system
CN110246147A (zh) * 2019-05-14 2019-09-17 中国科学院深圳先进技术研究院 视觉惯性里程计方法、视觉惯性里程计装置及移动设备
CN111080699A (zh) * 2019-12-11 2020-04-28 中国科学院自动化研究所 基于深度学习的单目视觉里程计方法及系统
CN112648994A (zh) * 2020-12-14 2021-04-13 首都信息发展股份有限公司 基于深度视觉里程计和imu的相机位姿估计方法及装置
CN115358962A (zh) * 2022-10-18 2022-11-18 中国第一汽车股份有限公司 一种端到端视觉里程计方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11288818B2 (en) * 2019-02-19 2022-03-29 The Trustees Of The University Of Pennsylvania Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning
CN111127557B (zh) * 2019-12-13 2022-12-13 中国电子科技集团公司第二十研究所 一种基于深度学习的视觉slam前端位姿估计方法
CN114612556A (zh) * 2022-03-01 2022-06-10 北京市商汤科技开发有限公司 视觉惯性里程计模型的训练方法、位姿估计方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190065885A1 (en) * 2017-08-29 2019-02-28 Beijing Samsung Telecom R&D Center Object detection method and system
CN110246147A (zh) * 2019-05-14 2019-09-17 中国科学院深圳先进技术研究院 视觉惯性里程计方法、视觉惯性里程计装置及移动设备
CN111080699A (zh) * 2019-12-11 2020-04-28 中国科学院自动化研究所 基于深度学习的单目视觉里程计方法及系统
CN112648994A (zh) * 2020-12-14 2021-04-13 首都信息发展股份有限公司 基于深度视觉里程计和imu的相机位姿估计方法及装置
CN115358962A (zh) * 2022-10-18 2022-11-18 中国第一汽车股份有限公司 一种端到端视觉里程计方法及装置

Also Published As

Publication number Publication date
CN115358962A (zh) 2022-11-18
CN115358962B (zh) 2023-01-10

Similar Documents

Publication Publication Date Title
Tang et al. Learning guided convolutional network for depth completion
Uittenbogaard et al. Privacy protection in street-view panoramas using depth and multi-view imagery
Park et al. High-precision depth estimation with the 3d lidar and stereo fusion
CN110349215B (zh) 一种相机位姿估计方法及装置
CN111696110B (zh) 场景分割方法及系统
CN112288628B (zh) 基于光流跟踪和抽帧映射的航拍图像拼接加速方法及系统
CN112183675B (zh) 一种基于孪生网络的针对低分辨率目标的跟踪方法
WO2024077935A1 (zh) 一种基于视觉slam的车辆定位方法及装置
WO2024082602A1 (zh) 一种端到端视觉里程计方法及装置
CN111382647B (zh) 一种图片处理方法、装置、设备及存储介质
CN111914756A (zh) 一种视频数据处理方法和装置
CN116486288A (zh) 基于轻量级密度估计网络的航拍目标计数与检测方法
CN114926514B (zh) 一种事件图像与rgb图像的配准方法及装置
Zhou et al. PADENet: An efficient and robust panoramic monocular depth estimation network for outdoor scenes
CN112270748B (zh) 基于图像的三维重建方法及装置
CN114677422A (zh) 深度信息生成方法、图像虚化方法和视频虚化方法
CN117132737B (zh) 一种三维建筑模型构建方法、系统及设备
CN116977200A (zh) 视频去噪模型的处理方法、装置、计算机设备和存储介质
CN112288817B (zh) 基于图像的三维重建处理方法及装置
CN115410133A (zh) 视频密集预测方法及其装置
CN116188535A (zh) 基于光流估计的视频跟踪方法、装置、设备及存储介质
Zhang et al. Depth Monocular Estimation with Attention-based Encoder-Decoder Network from Single Image
CN114372944B (zh) 一种多模态和多尺度融合的候选区域生成方法及相关装置
Lazcano et al. Anisotropic Operator Based on Adaptable Metric-Convolution Stage-Depth Filtering Applied to Depth Completion
CN114596580B (zh) 一种多人体目标识别方法、系统、设备及介质