CN112465021B

CN112465021B - Pose track estimation method based on image frame interpolation method

Info

Publication number: CN112465021B
Application number: CN202011352019.4A
Authority: CN
Inventors: 梁志伟; 郭强; 周鼎宇
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2022-08-05
Anticipated expiration: 2040-11-27
Also published as: CN112465021A

Abstract

The invention provides a pose track estimation method based on an image frame interpolation method, wherein a new frame is inserted between two image frames, semantics is used as representation of an invariant scene and pose track estimation is constrained together with feature points, tracking loss is reduced by increasing the number of feature point matching between the frames, influence of dynamic object feature points and matching of constraint feature points are reduced by fusing semantic information, and the accuracy of pose estimation and track estimation is improved. Experiments on the public data set show that the method keeps higher precision, has strong robustness on the conditions of moving objects and sparse textures, and obtains good results in the aspect of improving the identification precision of the visual odometer.

Description

Pose track estimation method based on image frame interpolation method

Technical Field

The invention relates to the technical field of computer vision, in particular to a pose track estimation method based on an image frame interpolation method.

Background

The goal of visual odometry is to estimate the motion of the camera from the captured images, and there are two methods commonly used at present, namely, a feature point method and a direct method. The characteristic point method is the mainstream at present, good results can be obtained in places where the camera moves fast, the illumination change is not obvious and the environment is various, but characteristic points are easy to be lost in places where the change is not obvious like a tunnel, and bad results are generated; the direct method does not need to mention features, but is not suitable for an environment where the camera motion is fast. The core of visual odometry is the problem of data correlation, as it establishes pixel-level correlation between images. These correspondingly associated pixels are used to construct a three-dimensional map of the scene and track the pose of the current camera. Such local tracking and mapping may introduce small errors in each frame, which may be larger if the two images are taken too far apart, and distant objects, when viewed close together, may have significant changes in characteristics. This is mainly the case when the frame rate of camera shots is too low and there is a lack of invariance representation for the features.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a pose track estimation method based on an image interpolation method, which solves the problem that the recognition accuracy of pose track estimation of a visual odometer according to a shot image is low in the prior art.

The technical scheme adopted by the invention is as follows: a pose track estimation method based on an image frame interpolation method is shown in FIG. 1 and comprises the following steps:

s01 feature detection: the system acquires a series of pictures and then carries out ORB feature detection on the pictures;

s02 semantic recognition: performing semantic recognition on the image if the matching number of the feature points obtained according to the previous and next frames meets the threshold requirement; otherwise, performing S021 image frame interpolation;

the S021 image frame inserting process: performing image frame interpolation between two adjacent frames, and then performing detection matching of the feature points again; when the image is collected, the problem that the common view part between two adjacent frames is too little due to the fact that the camera moves too fast or the frame rate of the camera is too low is avoided, so that the problem can be solved by applying a video frame interpolation technology, and under the condition that the feature points of the two frames of images are less in matching, the common identification part between the two adjacent frames is increased by using the image frame interpolation technology, so that the number of matched feature points is increased;

and S03 image information fusion: after semantic recognition, fusing semantic image information and feature point image information, and removing feature points detected on the dynamic object; if the number of the characteristic points does not meet the threshold requirement after the characteristic points of the dynamic object are removed, carrying out S021 image frame interpolation;

and S04, after the threshold requirement is met, finally carrying out pose estimation.

Further, in the S021 image interpolation process, a deep convolutional neural network method is used to estimate an appropriate convolution kernel to synthesize each output pixel in the interpolated image.

Further preferably, the output coefficients of the convolution kernel are non-negative and sum to 1.

Furthermore, in the process of frame interpolation of the S021 image, a method of fusing a loss function and a gradient loss function between the interpolated pixel color and the background color is used, so that the generated image is clearer.

Further, in the semantic recognition process of S02, the YOLO algorithm is used to extract semantic information and divide the semantic information into static objects and dynamic objects; removing the feature points detected on the dynamic object, reserving the feature points of the static object, and constructing a new loss function, wherein the new loss function is a constraint added with a semantic loss function on the basis of a classical feature point loss function.

Further, in the fusion process of the image information of S03, after removing the feature points detected on the dynamic object by using the semantic image information, the feature points of the static object are used to construct semantic error and reprojection error fusion, thereby improving the accuracy of pose estimation.

Furthermore, in the pose estimation process of S04, different weights are adopted for semantic errors and reprojection errors in the joint optimization function, and the robustness of the system is improved.

Further, in the pose estimation process of S04, an expectation maximization algorithm is used to minimize the error function, so as to ensure the accuracy of the pose estimation.

Compared with the prior art, the invention has the beneficial effects that:

according to the pose track estimation method based on the image frame interpolation method, a new frame is inserted between two image frames, semantics are used as representation of an invariant scene, pose track estimation is constrained together with feature points, tracking loss is reduced by increasing the number of feature point matching between the frames, influence of dynamic object feature points and matching of constraint feature points are reduced by fusing semantic information, and accuracy of pose estimation and track estimation is improved. Experiments on the public data set show that the method keeps higher precision, has strong robustness on the conditions of moving objects and sparse textures, and obtains good results in the aspect of improving the identification precision of the visual odometer.

Drawings

FIG. 1 is a flow chart of a pose trajectory estimation method based on an image frame interpolation method according to the present invention;

FIG. 2 illustrates the convolution pixel interpolation process according to an embodiment of the present invention;

FIG. 3 illustrates the convolution interpolation process according to an embodiment of the present invention;

FIG. 4 is a visual image contrast using additive gradient loss in an embodiment of the present invention;

fig. 5 shows the effect of using additive gradient loss in an embodiment of the present invention, where (a) is a semantic segmentation image, (b) is a binary image, and (c) and (d) are semantic probabilities when sigma is 10 and sigma is 40, respectively, red indicates 1 and blue indicates 0;

FIG. 6 shows the absolute track error of KITTI05 and KITTI07 in the detailed description;

FIG. 7 is an absolute trajectory error curve of the TUM data sets frl _ xyz, frl _ floor and frl _ long _ office _ house hold under ORB-SLAM and the present algorithm, in accordance with an embodiment.

Detailed Description

Reference will now be made in detail to the present embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative and intended to explain the present invention and should not be construed as limiting the present invention.

1 image interpolation

When an image is acquired, too fast camera movement or too low frame rate of the camera causes too few common vision parts between two adjacent frames, so that the problem can be effectively solved by applying a video frame interpolation technology.

As a preferred approach, a robust video frame interpolation method is used that utilizes a deep convolutional neural network to achieve frame interpolation without the need to explicitly divide it into multiple steps. The method treats pixel interpolation as convolution of corresponding image blocks in two input image frames and uses depth complete convolutionThe neural network estimates a spatially adaptive convolution kernel. Specifically, for a pixel (x, y) in the interpolated frame, the deep neural network centers on two receptive field blocks R of the pixel ₁ And R ₂ As input, a convolution kernel K is estimated. The convolution kernel is used for matching with the input block P ₁ And P ₂ Convolved to synthesize the output pixel, as shown in fig. 2.

1.1 principle of the Algorithm

Given two image frames I ₁ And I ₂ Intended to temporarily insert a frame in the middle of two input frames

Combining motion estimation and pixel synthesis into one step and interpolating the pixels as the input image I ₁ And I ₂ The partial convolution of the patch in (1). As shown in fig. 3, by convolving the appropriate kernel K to the input patch P ₁ (x, y) and P ₂ (x, y), the color of the pixel (x, y) in the target image to be interpolated may be obtained, and the input patches are also centered at (x, y) in the respective input images. The convolution kernel K captures the motion and resampling coefficients for pixel synthesis.

It is important to estimate the appropriate convolution kernel, which is estimated using a deep convolutional neural network approach to synthesize each output pixel in the interpolated image. The convolution kernel of a single pixel varies depending on local motion and image structure to provide high quality interpolation results. The deep neural network for the kernel estimation will be described below.

1.2 convolution kernel estimation

As a preferred approach, a fully convolutional neural network is used to estimate the convolution kernel for a single output pixel, the structure of which is detailed in table 1. Specifically, to estimate the convolution kernel K of the output pixel (x, y), the neural network operates to sense the outlier R ₁ (x, y) and R ₂ (x, y) is an input; r ₁ (x, y) and R ₂ (x, y) is located at the center of (x, y) in the respective input images; patch P convolved by output kernel for generating color of output pixel (x, y) ₁ And P ₂ Centered together at the same location as the receptive fields, but smaller in size, as shown in fig. 2. A larger block of receptive fields is used to better handle the aperture problem in motion estimation. In an implementation, the default receptive field size is 79 × 79 pixels. The convolution slice size is 41 × 41 and the kernel size is 41 × 82 for two-slice convolution. The same convolution kernel is applied to each of the three color channels.

TABLE 1 convolutional neural network architecture

As shown in table 1, the convolutional neural network consists of several convolutional layers and a lower convolutional layer as an alternative to the max pooling layer. The corrected linear cells are used as activation functions and regularized using batch normalization. The neural network of (a) can be trained end-to-end using widely available video data, which provides a sufficiently large training data set. Data augmentation can also be widely utilized by flipping the training samples horizontally and vertically and reversing their order. Is fully convoluted. Thus, it is not limited to a fixed size input, but is accelerated by using shift and stitch techniques to generate kernels of multiple pixels simultaneously.

One key constraint is that the coefficients of the output convolution kernel should be non-negative and sum to 1. Thus, the final convolution layer is connected to a spatial sofimax layer to output convolution kernels.

1.3 loss function

For clarity, the symbols are first defined. The ith training example includes two input receptive field blocks R _i，1 And R _i，2 Is located in (x) _i ，y _i ) Center, corresponding input patch P _i，1 And P _i，2 Smaller than the receptive field block and also concentrated at the same position, true-to-earth color

And true gradient of the earth

In (x) _i ，y _i ) In the interpolated frame. For simplicity, (x) has been omitted from the definition of the penalty function _i ，y _i )。

One possible loss function for a deep convolutional neural network is the difference between the interpolated pixel color and the ground color, as follows:

where the index i denotes the ith training example, K _i Is the convolution kernel of the neural network output. Experiments have shown that light is such a color loss that even with a 1 standard value results in a blurred result, as shown in fig. 4. Since differentiation is also a convolution, the combination property of convolution is used to solve this problem, assuming that the kernel varies slowly in the local region: the gradient of the input block is first computed and then convolved with the estimated kernel, which will result in a gradient of the interpolated image at the pixel of interest. Since one pixel (x, y) has eight neighboring pixels, eight different gradients are calculated by finite difference method and all of them are combined into a gradient loss function.

Where k represents one of eight methods of calculating the gradient.

And

is an input block P _i，1 And P _i，2 The gradient of (a) of (b) is,

is the substantially true gradient. The above color and gradient losses are combined as the final loss E _c +λ·E _g λ ═ 1 was found to work well and used. This color plus gradient penalty can produce sharper interpolation results, as shown in fig. 4.

2 semantic fusion

In a visual odometer based on feature points or optical flow, moving objects in an image can have a great influence on the whole system. And the change of object illumination and the difference of viewpoints influence the extraction of feature points and the estimation of optical flow, however, semantic information can be used as an invariant for scene representation. Although the change of the viewpoint, the illumination and the scale can affect the low-level appearance of the object, the semantic representation of the object is not affected, and the semantic information of the image can identify moving objects (people, vehicles, animals and the like) and help to remove the influence of the dynamic object, so the semantic information of the image is integrated into the system.

2.1 extracting semantic information

Semantic information is acquired by using the YOLO, and each individual step of target detection is integrated into a neural network, so that the network predicts all bounding boxes of all classes based on the characteristics of the whole image (the whole image and all targets in the image are fully concerned), and the purposes of end-to-end training and real-time detection are achieved. As shown in image 5(a), different categories may be represented by different colors.

2.2 semantic visual odometer framework

First, in a standard window-based visual odometry system, a set of input images is given

In this case, the visual odometer uses a given set of corresponding observations Z _i，k Processing a jointly optimized set of camera poses

T _k E.g. SE (3), map points

To a problem of (a). The observation value may be defined as a key point in the imageLocation. The odometer objective function formula is thus as follows

E _base ＝∑ _k ∑ _i e _base (k，i) (3)

For each input image I _k Requiring a dense semantic segmentation of the pixels

Where each pixel is labeled as one of the | C | classes in the set C, and thus each mapped point is also associated with a domain variable Z _i And e C. p (Z) _i ＝c|X _i ) Is located at position X _i Point P of _i Probability of belonging to class C. Each point P _i Is represented as a labeled probability vector of

Wherein

Is a point P _i Probability of belonging to class c. To incorporate semantic constraints into the mileage optimization function, a semantic cost function is defined

E _sem ＝∑ _k ∑ _i e _sem (k，i) (4)

Wherein each term relates the camera pose T _k And point P _i (tag Z thereof) _i And position X _i Representation) and semantic image observation S _k And (4) associating. Optimizing base and semantic costs in a join function

Where λ represents the weight of the different terms, as described in the following section

2.3 semantic cost function

By a probabilistic method, an observation likelihood model p is first defined (S) _k |T _k ，X _i ，Z _i C), the semantic observation value S is observed _k With camera attitude T _k And point P _i Are linked together.The intuition behind the viewing model is that if X is involved _i At S _k Projection of (n) to (n) ([ pi ] (T)) _k ，X _i ) The corresponding pixel is marked with c, then the semantic point observes p (S) _k |T _k ，X _i ，Z _i C) should be possible. This probability should follow pi (T) _k ，X _i ) Distance to the nearest region marked c is reduced, using distance transformation

Wherein

Is the pixel position and B is the binary image defining the distance transform as shown in fig. 5. More precisely, a binary image is computed for each semantic class c

So that S _k The pixel with the middle label c has a value of 1 and all other pixels have a value of 0 (fig. 5 (b)). A distance transform is then defined based on this binary image (FIG. 5(c))

By using

Defining the observation probability as

Where π is again the projection operator from the world to the image space, and σ represents the uncertainty in the semantic image classification. For the sake of brevity, the normalization factor that ensures a sum of 1 over the probability space is omitted. For the point labeled c, the likelihood decreases in proportion to the distance from the image region labeled c. Intuitively, maximizing likelihood corresponds to adjusting camera pose and point position to move the point projection towards the correctly labeled image area.

Using the observation likelihood (equation 4), the semantic cost term is defined as

Wherein

Is also P _i Is the probability of C ∈ C. Intuitively speaking, a semantic image S is given _k And point P _i Semantic cost e of _sem (k, i) is a weighted average of the 2D distances. Point projection pi (T) _k ，X _i ) Each distance to the c-type nearest region

Are weighted, i.e. P _i Probability w of being of class c _i . For example, if P _i With a highly determined car tag, the cost is that the point projects to S _k The distance of the nearest region labeled car. If P is _i Pavement and road labels with the same probability have the lowest cost on both types of boundaries.

Point P _i Is marked with a probability vector w _i Is calculated by taking into account all of its observations. Specifically, if P _i Is composed of a set of cameras T _i It is observed that

Constant alpha ensures

The rule allows for the addition of a tag vector w by accumulating semantic observations _i And performing increment thinning. If the observations have the same pattern, i.e. they have their maximum in the same class, the elemental multiplication and normalization will result in a vector w _i Converge to a single pattern corresponding to a true tag.

2.4 optimizing the objective function

Visual semantic odometer uses EM (expectation visualization) to minimize the error function E _joint . The optimization method comprises the following steps:

(1) in E-step, P is held _i And T _k Constant, calculated by equation (8)

(2) In M-step, hold

Invariant, optimized three-dimensional point P _i And camera pose T _k Due to E _sem The sparsity of (a) and (b) the optimization in M-step can be quickly achieved. It should be noted that if only semantic information is used to optimize the three-dimensional points and camera pose, the constraints provided are weak because the probability distribution inside the object boundary is uniform. To avoid this, E _sem The optimization method comprises the following steps:

(1) semantic constraints are optimized together with the basic visual odometer;

(2) optimizing a camera pose using a plurality of points and semantic constraints;

(3) because semantic cost constraint is weak, the three-dimensional point is not optimized in a basic system (namely the original BA cost function), and only the camera posture is optimized to reduce drift;

(4) with frequent semantic optimization, the visual semantic odometer can reduce the probability of a three-dimensional point being re-projected onto the wrong object.

Results and analysis of the experiments

The CPU configured by the algorithm platform is an Inter i7-4720HQ processor, the main frequency is 2.6GHz, the memory is 16G, the GPU is not used for acceleration, the system is Ubuntu18.04, a KITTI data set and a TUM data set are respectively used, and the comparison with an ORB-SLAM is carried out.

Table 2 is the Root Mean Square Error (RMSE) for the KITTI and TUM data sets for the algorithm and ORB-SLAM herein. And the processing time (meanime) of each frame, as can be seen from table 2, compared with ORB-SLAM, the accuracy is significantly improved due to the constraints of the picture frame interpolation method and semantic information introduced herein, and the time difference is small although the amount of calculation is increased.

TABLE 2 RMSE and MeAntime

FIG. 6 is an absolute trajectory error diagram of KITTI data sets 05 and 07 sequences under ORB-SLAM and the algorithm, which can intuitively sense that the error value between the pose estimation and the real trajectory of the algorithm is small.

FIG. 7 is an absolute trajectory error curve of TUM data sets frl _ xyz, frl _ floor and frl _ long _ office _ house hold under ORB-SLAM and the present algorithm, and it can be seen from the figure that the ORB-SLAM algorithm has substantially the same effect as the present algorithm on rl _ xyz and frl _ long _ office _ house hold on data sets with complex environment and more feature points, but the present algorithm has a significantly better effect than ORB-SLAM on data sets frl _ floor with less texture.

Claims

1. The pose track estimation method based on the image frame interpolation method is characterized by comprising the following steps of:

s02 semantic recognition: acquiring the matching number of the feature points according to the front frame and the back frame, and performing semantic recognition on the image if the threshold requirement is met; otherwise, performing S021 image frame interpolation;

the S021 image frame inserting process: performing image frame interpolation between two adjacent frames, and then performing detection matching of the feature points again;

s04, after the threshold requirement is met, finally carrying out pose estimation;

the S021 image frame interpolation is carried out between two adjacent frames, and then the detection and matching of the characteristic points are carried out again, and the operation steps comprise: the pixel interpolation is regarded as convolution of corresponding image blocks in two input image frames, and a depth complete convolution neural network is used for estimating a space self-adaptive convolution kernel; specifically, for one pixel in the interpolated frame

Two receptive field blocks of the deep neural network centered on the pixel

And

as input, a convolution kernel K is estimated; the convolution kernel is used for matching with the input block

And

convolving to synthesize an output pixel;

given two image frames

And

temporarily inserting a frame in the middle of two input frames

(ii) a Combining motion estimation and pixel synthesis into one step and interpolating pixels as input image

And

partial convolution of the patches in (1); by convolving the appropriate kernel K onto the input patch

And

obtaining the pixels in the target image to be interpolated

In a respective input image, with a color of

Is taken as the center; the convolution kernel K captures the motion and resampling coefficients for pixel synthesis.

2. The image interpolation frame-based pose trajectory estimation method according to claim 1, wherein in the S021 image interpolation frame process, a deep convolutional neural network method is used to estimate a suitable convolution kernel to synthesize each output pixel in the interpolated image.

3. The image interpolation-based pose trajectory estimation method according to claim 1, wherein output coefficients of the convolution kernel are non-negative and the sum is 1.

4. The image interpolation frame method-based pose trajectory estimation method according to claim 1, wherein a loss function and a gradient loss function between an interpolated pixel color and an under color are fused in an S021 image interpolation frame process.

5. The image interpolation-based pose trajectory estimation method according to claim 1, wherein in the S02 semantic recognition process, a YOLO algorithm is used to extract semantic information and divide the semantic information into static objects and dynamic objects; and removing the characteristic points detected on the dynamic object, reserving the characteristic points of the static object, and constructing a new loss function.

6. The image interpolation-based pose trajectory estimation method according to claim 1, wherein in the S03 image information fusion process, after removing the feature points detected on the dynamic object by using the semantic image information, the feature points of the static object are used to construct semantic error and reprojection error fusion.

7. The image interpolation-based pose trajectory estimation method of claim 1, wherein semantic errors and reprojection errors in the joint optimization function adopt different weights in the pose estimation process of S04.

8. The image interpolation-based pose trajectory estimation method of claim 1, wherein an expectation maximization algorithm is used to minimize an error function in the S04 pose estimation process.