CN114820792A

CN114820792A - Camera positioning method based on mixed attention

Info

Publication number: CN114820792A
Application number: CN202210466169.0A
Authority: CN
Inventors: 宋霄罡; 李宏娟; 梁莉; 黑新宏
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-29

Abstract

The invention discloses a camera positioning method based on mixed attention, which comprises the following steps of 1, constructing a convolutional neural network based on non-local self-attention camera positioning; step 2, training the neural network established in the step 1; step 3, testing the network trained in the step 2; through tests, the positioning precision of the invention on the 7Scenes and Oxford RobotCar data sets is obviously improved.

Description

Camera positioning method based on mixed attention

Technical Field

The invention belongs to the technical field of computer vision and artificial intelligence, and relates to a camera positioning method based on mixed attention.

Background

Position information is critical to a wide variety of applications, from virtual reality to unmanned aircraft to autopilot. One particularly promising direction of investigation is the problem of camera pose regression or localization, i.e. the recovery of the three-dimensional position and orientation of the camera from an image or set of images. The most advanced methods of visual localization are geometry-based and image-retrieval-based methods to achieve localization, which mainly rely on establishing a match between 2D pixel locations and 3D points in the scene, followed by pose estimation using PnP solvers. Performance degradation occurs when deployed in the field because handcrafted features are susceptible to poor global matching due to illumination, blurring, and scene dynamics. The recent positioning method based on deep learning can automatically extract features and directly recover the absolute camera pose from a single image without manually constructing a map or a landmark feature database. Since they can automatically learn local features and feature matching and perform outlier filtering, large scenes with complex geometries and appearances that change over time can be processed. However, the most advanced feature-based learning methods have many parameters and multiple complex components, which may require a lot of experience to adjust. Training feature-based positioning methods in an end-to-end manner is a challenging problem due to complexity and stability issues. Therefore, an accurate and stable end-to-end pose estimation method for directly estimating the pose of the camera by using a convolutional neural network only by taking a single image as input is needed.

Disclosure of Invention

The invention aims to provide a camera positioning method based on mixed attention, which solves the problem that accurate and stable positioning can be carried out in a dynamic scene.

The technical scheme adopted by the invention is that a camera positioning method based on mixed attention uses monocular images as input, firstly, a visual encoder is utilized to extract features required by a pose regression task, and ResNet34 is adopted as the basis of the encoder, because a residual error network allows training of deeper features of a neural network, a more stable and more accurate positioning result than other architectures is realized; then, combining the channel attention and the non-local self-attention to count the global information of the image from the channel level and the picture context, neglecting some dynamic objects and useless features, screening out the features beneficial to the regression of the camera pose, and specifically implementing the method according to the following steps:

step 1, constructing a convolutional neural network based on non-local self-attention camera positioning;

step 2, training the neural network established in the step 1;

and 3, testing the network trained in the step 2.

The invention is also characterized in that:

the convolutional neural network in the step 1 comprises a feature coding module, a mixed attention module and a pose regression module, and is implemented according to the following steps:

step 1.1, inputting an image into a network, and performing downsampling by a feature coding module to extract features;

step 1.2, capturing the dependency relationship between a channel level and a space level on the characteristic diagram through a channel attention and non-local self-attention module, and outputting an attention weight diagram with the dependency relationship;

step 1.3, inputting the calculated attention weight into a pose regressor to regress the pose of the camera;

wherein the step 1.1 is implemented according to the following steps:

step 1.1.1, inputting an RGB image, and setting the size of a picture to be 256 multiplied by 256, namely the size of the picture input into a network is 256 multiplied by 3;

step 1.1.2, performing a common 7 × 7 convolution operation on an input image once, adjusting the size of the image to 128 × 128, adjusting the number of channels to 64 channels, and performing batch normalization and Relu function activation;

step 1.1.3, transmitting the feature graph obtained in step 1.1.2 into a residual convolution block to carry out 16 times of residual convolution, wherein the convolution kernel is 3 multiplied by 3, the size of an output picture is 8 multiplied by 8, and the number of channels is 512;

step 1.1.4, carrying out average pooling and full connection operation on the characteristic diagram obtained in the step 1.1.3, and finally outputting a characteristic diagram of a 2048 channel;

wherein the construction of the residual volume block in step 1.1.3 is: firstly, performing channel expansion through 3 × 3 convolution, and performing BN and Relu activation; then, performing feature extraction through 3 × 3 convolution, and activating BN and Relu;

wherein the step 1.2 is implemented according to the following steps:

step 1.2.1, introducing the feature map obtained by the feature extraction module into a mixed attention module, and simultaneously constructing channel attention and non-local self-attention;

step 1.2.2, a feature map output by a feature extraction module is transmitted into a channel attention module, and global features of an image are counted from a feature channel level;

step 1.2.3, the feature map output in step 1.2.2 is transmitted into a non-local self-attention module, the dependency of the long-range picture features is captured, and finally, a feature map with 2048 channels is output;

wherein the step 1.3 is implemented according to the following steps:

step 1.3.1, inputting the 2048-dimensional feature map obtained in the step 1.2.3 into a pose regressor to construct a multilayer perceptron module;

step 1.3.2, inputting the feature map into a full connection layer to obtain a feature map with the size of 1 multiplied by 2048;

step 1.3.3, respectively inputting the obtained feature maps into two full-connection layers to obtain two three-dimensional feature vectors respectively representing translation and rotation;

step 1.3.4, superposing the two obtained three-dimensional vectors to finally obtain a six-dimensional pose vector;

the data set of the network training in the step 2 is divided into an indoor data set and an outdoor data set, the indoor data set is 7Scenes, the outdoor data set is Oxford RobotCar, and the method is implemented according to the following steps:

step 2.1, loading a data set and initializing weight parameters;

step 2.2, segmenting data set data, using 70% of images for training and 30% of images for estimation;

step 2.3, outputting a training loss value after every 5 epochs by adopting an L1 loss function;

step 2.4, setting the initial learning rate to be 5e-5, and training in a mode of automatic decline of the learning rate;

step 2.5, stopping training and storing the model when the loss value is not reduced after the training reaches 600 epoch;

wherein the step 2.2 is implemented according to the following steps:

firstly, inputting a training set into a network according to a preset batch, then, setting a picture resize in a data set to be 256 pixels, normalizing the images to enable the pixel intensity to be within a range of (-1,1), and setting the brightness, the contrast and the saturation to be 0.7 and the hue to be 0.5 on an Oxford RobotCar data set;

wherein the step 3 is implemented according to the following steps:

step 3.1, loading a test picture in the data set, and setting a regression dimension of the pose of the camera;

step 3.2, loading the trained model parameters and reading a test data set;

step 3.3, each frame of the data set image is transmitted into a camera regression model, and regression prediction is carried out on pixel points;

and 3.4, calculating the translation and rotation errors of the regression pose.

The invention has the beneficial effects that:

the camera positioning method based on mixed attention of the invention uses monocular images as input, learns to reject dynamic objects and lighting conditions to obtain better performance, and can efficiently operate in indoor and outdoor data sets. In addition, an end-to-end pose estimation algorithm framework is convenient to improve and has a promoted space, and the work of the text makes a meaningful exploration for the application of deep learning in the field of visual SLAM.

Drawings

FIG. 1 is a block diagram of a hybrid attention based camera positioning network of the present invention;

FIG. 2 is a block diagram of residual errors in a hybrid attention-based camera positioning network feature encoder according to the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention provides a camera positioning method based on mixed attention, which is implemented by the following steps:

step 1, constructing a camera positioning network based on mixed attention, extracting image features through a feature encoder, introducing the extracted image features into a channel attention module and a non-local self-attention module to screen geometric features with robustness, and inputting the screened features into a pose regressor to regress translation and rotation vectors, wherein the specific structure of the network is shown in FIG. 1;

the network structure is divided into 3 modules: 1) a feature encoding module; 2) a hybrid attention module; 3) a pose regression module;

after the image is input into the network, firstly, downsampling is carried out through a feature coding module to extract features, feature information on a space and a channel is fused through a mixed attention module, and finally the fused features are input into a pose regression module to guide pose regression;

step 2, network training: the method uses a PyTorch frame to build a network structure, uses an L1 function as a loss function, uses an Adam algorithm to optimize training parameters, and adopts an early-stopping strategy to prevent over-fitting of network training in the training process so as to achieve the optimal training effect;

step 3, network testing: and inputting the test image into a network to obtain a pose estimation result, calculating loss values of translation and rotation, and evaluating the network performance.

1) A feature encoding module;

the module is used for extracting the linear features of the image from low dimension to high dimension abstract features, most parameters and calculated amount of the pose regression network come from the module, in order to ensure the accuracy and light weight and simultaneously extract the deep features, a network ResNet34 suitable for classification and segmentation is used as the backbone of the network, and the network can extract the deeper features;

ResNet has 2 basic blocks, one is a residual block, and the input and output dimensions are the same, so multiple can be connected in series. Another basic block is a convolution block, the input and output dimensions are different, and therefore they cannot be connected in series, and its role is to change the dimension of the feature vector, because CNN is to gradually convert the input image into a feature map with a small size and a deep depth, generally using a uniform and relatively small convolution kernel, but as the depth of the network increases, the number of output channels increases, and the networking becomes more and more complex, so it is necessary to convert the dimension with the convolution block before entering the residual block, so that the network can continuously connect the residual block, and the detailed structure is as shown in table 1 below.

The characteristic coding module is implemented according to the following steps:

step 1.1.1, inputting an image, and setting the size of a picture to be 256 multiplied by 256, namely the size of the picture input into a network is 256 multiplied by 3;

step 1.1.2, performing a common convolution operation on an input image, compressing h multiplied by w for 1 time, adjusting the number of channels to 64 channels, and activating BN and Relu;

step 1.1.3, transmitting the characteristic diagram obtained in the step 1.1.2 into a residual convolution block for 16 times of residual convolution to obtain a characteristic diagram with 512 channels;

step 1.1.4, carrying out primary average pooling operation and full connection operation on the feature map obtained in the step 1.1.3, and finally outputting the feature map with the picture size of 1 × 1 × 2048;

the residual block construction method comprises the following steps:

firstly, performing channel expansion through 3 × 3 convolution, and performing BN and Relu activation; then, performing feature extraction through 3 × 3 convolution, and performing BN and Relu activation, wherein the detailed structure is shown in FIG. 2;

TABLE 1 characteristic encoder architecture

2) A mixed attention module; the module comprises a channel attention part and a non-local self-attention part and is used for extracting the characteristic dependence relation of a channel level and a position level so as to screen out characteristics beneficial to the regression of the pose of the camera;

the mixed attention module is implemented by the following steps:

step 1.2.1, transmitting the features extracted by the feature coding module into a mixed attention module, and simultaneously constructing channel attention and non-local self-attention;

step 1.2.2, a feature graph output by a feature extraction module is transmitted into a channel attention module, and global features of an image are counted from a feature channel level;

wherein the specific operation process of the step 1.2.2 comprises the following steps:

firstly, compressing the features of the attention module of the incoming channel along the dimension of the channel by a compression factor of 16 to obtain a feature map with the channel number of 128, then inputting the feature map with the channel number of 128 into a full connection layer, and finally obtaining the feature map with the channel number of 2048.

Wherein the specific operation process of the step 1.2.3 comprises the following steps:

firstly, the feature graph output in the step 1.2.2 is transmitted into a non-local self-attention module, eight-time down-sampling is firstly carried out for reducing the number of channels, so that the calculated amount is reduced, then the similarity between features is calculated by utilizing matrix multiplication, then softmax operation is carried out, finally, the output channels are restored through 1 multiplied by 1 convolution, the input and output scales are ensured to be completely the same, the essence of the step is that each output position value is a weighted average value of all other positions, and the commonality can be further highlighted through the softmax operation;

3) a pose regression module; the module transmits the characteristic diagram output by the mixed attention module into a multilayer perceptron to carry out full-connection operation, and outputs two three-dimensional vectors respectively representing positions and directions;

the pose estimation module comprises the following construction steps:

step 1.3.1, replacing fusion characteristics of the attention module to obtain a 2048-dimensional characteristic diagram, and constructing a multilayer perceptron module;

step 1.3.3, respectively inputting the obtained characteristic graphs into two full-connection layers to obtain two three-dimensional characteristic vectors representing translation and rotation;

step 1.3.4, Concat is carried out on the two obtained three-dimensional vectors, and finally a six-dimensional pose vector is obtained;

the data set for training the network in the step 2 is divided into an indoor data set and an outdoor data set, the indoor data set is a 7Scenes data set, the outdoor data set is an Oxford RobotCar data set, and detailed information of the Oxford RobotCar data set is as follows in the following table 2, and the method is implemented specifically according to the following steps:

step 2.1, loading a data set and initializing weight parameters;

TABLE 2 training and testing sequences on Oxford RobotCar dataset

The specific operation process of the step 2.2 comprises the following steps:

firstly, inputting a training set into a network according to a preset batch, then reducing a picture in a data set into 256 pixels, normalizing the images to enable the intensity of the pixels to be within a range of (-1,1), and setting the brightness, the contrast and the saturation to be 0.7 and the hue to be 0.5 on an Oxford RobotCar data set, wherein the enhancement step is favorable for improving the generalization capability of the model under various weather and climate conditions;

step 3 is to test the result predicted in step 2 to evaluate the network performance, and the specific steps are as follows:

step 3.2, loading the trained model parameters and reading a test data set;

step 3.4, calculating the translation and rotation errors of the regression pose;

the following illustrates the effect of the invention on the test set:

table 3 summarizes the performance of all methods on the 7Scenes dataset, and it is clear that the method of the present invention is superior to other monocular image-based methods, with a 53% improvement in position accuracy and a 19% improvement in rotation accuracy over the baseline PoseNet based on monocular images, especially achieving the best performance in Scenes with highly repetitive textures (such as Chess and Stairs); this is a significant improvement over the prior art, and still achieves greater accuracy than the baseline in other conventional scenarios.

TABLE 3 network Performance comparison on 7Scenes dataset

Table 4 shows a quantitative comparison of PoseNet, MapNet, LsG and our method; because the training and test sequences are captured at different times and under different conditions, PoseNet has difficulty in dealing with these changes and outputting inaccurate estimates of a large number of outliers; MapNet generates more accurate results and reduces many outliers by introducing relative poses between successive frames as additional constraints. However, larger areas may contain more locally similar appearances, thereby reducing the ability to position the system. By employing content enhancement, LsG reduces accuracy, although it improves the problem to some extent; in contrast, the model of the present invention addresses these challenges more effectively in view of content and movement, with 83% improvement in position accuracy and 76% improvement in rotation accuracy compared to PoseNet +.

TABLE 4 network Performance comparison on Oxford RobotCar dataset

Camera positioning is a challenging task in computer vision due to the high variability of scene dynamics and environmental appearance; the invention relates to a camera positioning method based on mixed attention, wherein ResNet34 is used as a backbone network in a feature coding module for extracting deeper features; a mixed attention module is introduced in the pose regression process, the module weights the channel characteristics and the picture context characteristics, the geometrically stable characteristics are screened out, and the influence of dynamic objects and illumination changes is reduced. And finally, inputting the weighted characteristics into a pose regression device for guiding pose regression, and obviously improving the positioning accuracy of the model on outdoor and indoor data sets through experimental analysis.

Claims

1. A camera positioning method based on mixed attention is characterized by comprising the following steps:

step 2, training the neural network established in the step 1;

and 3, testing the network trained in the step 2.

2. The method according to claim 1, wherein the convolutional neural network of step 1 comprises three parts, namely a feature coding module, a mixed attention module and a pose regression module, and is implemented by the following steps:

and step 1.3, inputting the calculated attention weight into a pose regressor for regressing the pose of the camera.

3. The hybrid attention-based camera positioning method according to claim 2, wherein the step 1.1 is specifically implemented as the following steps:

and step 1.1.4, carrying out average pooling and full connection operation on the characteristic diagram obtained in the step 1.1.3, and finally outputting the characteristic diagram of the 2048 channel.

4. A hybrid attention-based camera localization method according to claim 3, wherein the residual volume block in step 1.1.3 is configured as: firstly, performing channel expansion through 3 × 3 convolution, and performing BN and Relu activation; feature extraction was then performed by 3 × 3 convolution and BN and Relu activation was performed.

5. The hybrid attention-based camera positioning method according to claim 2, wherein the step 1.2 is specifically implemented as the following steps:

step 1.2.1, transmitting the feature map obtained by the feature extraction module into a mixed attention module, and simultaneously constructing channel attention and non-local self-attention;

and step 1.2.3, transmitting the feature map output in the step 1.2.2 into a non-local self-attention module, capturing the dependency of the features of the long-range picture, and finally outputting the feature map with 2048 channels.

6. A hybrid attention-based camera localization method according to claim 2 or 5, characterized in that the step 1.3 is implemented by the following steps:

and step 1.3.4, superposing the two obtained three-dimensional vectors to finally obtain a six-dimensional pose vector.

7. The method according to claim 1, wherein the data set of the network training in the step 2 is divided into an indoor data set and an outdoor data set, the indoor data set is 7Scenes, the outdoor data set is Oxford RobotCar, and the method is implemented by the following steps:

step 2.1, loading a data set and initializing weight parameters;

and 2.5, stopping training and storing the model when the loss value is not reduced after the training reaches 600 epoch.

8. The hybrid attention-based camera positioning method according to claim 7, wherein the step 2.2 is specifically implemented as the following steps:

firstly, inputting a training set into a network according to a preset batch, then, reducing a picture in a data set to 256 pixels, normalizing the image to enable the pixel intensity to be within a range of (-1,1), and setting the brightness, the contrast and the saturation to be 0.7 and the hue to be 0.5 on an Oxford RobotCar data set.

9. The hybrid attention-based camera positioning method according to claim 1, wherein the step 3 is specifically implemented by the following steps:

step 3.2, loading the trained model parameters and reading a test data set;