CN114820792A - Camera positioning method based on mixed attention - Google Patents

Camera positioning method based on mixed attention Download PDF

Info

Publication number
CN114820792A
CN114820792A CN202210466169.0A CN202210466169A CN114820792A CN 114820792 A CN114820792 A CN 114820792A CN 202210466169 A CN202210466169 A CN 202210466169A CN 114820792 A CN114820792 A CN 114820792A
Authority
CN
China
Prior art keywords
attention
data set
module
feature
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210466169.0A
Other languages
Chinese (zh)
Inventor
宋霄罡
李宏娟
梁莉
黑新宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202210466169.0A priority Critical patent/CN114820792A/en
Publication of CN114820792A publication Critical patent/CN114820792A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a camera positioning method based on mixed attention, which comprises the following steps of 1, constructing a convolutional neural network based on non-local self-attention camera positioning; step 2, training the neural network established in the step 1; step 3, testing the network trained in the step 2; through tests, the positioning precision of the invention on the 7Scenes and Oxford RobotCar data sets is obviously improved.

Description

Camera positioning method based on mixed attention
Technical Field
The invention belongs to the technical field of computer vision and artificial intelligence, and relates to a camera positioning method based on mixed attention.
Background
Position information is critical to a wide variety of applications, from virtual reality to unmanned aircraft to autopilot. One particularly promising direction of investigation is the problem of camera pose regression or localization, i.e. the recovery of the three-dimensional position and orientation of the camera from an image or set of images. The most advanced methods of visual localization are geometry-based and image-retrieval-based methods to achieve localization, which mainly rely on establishing a match between 2D pixel locations and 3D points in the scene, followed by pose estimation using PnP solvers. Performance degradation occurs when deployed in the field because handcrafted features are susceptible to poor global matching due to illumination, blurring, and scene dynamics. The recent positioning method based on deep learning can automatically extract features and directly recover the absolute camera pose from a single image without manually constructing a map or a landmark feature database. Since they can automatically learn local features and feature matching and perform outlier filtering, large scenes with complex geometries and appearances that change over time can be processed. However, the most advanced feature-based learning methods have many parameters and multiple complex components, which may require a lot of experience to adjust. Training feature-based positioning methods in an end-to-end manner is a challenging problem due to complexity and stability issues. Therefore, an accurate and stable end-to-end pose estimation method for directly estimating the pose of the camera by using a convolutional neural network only by taking a single image as input is needed.
Disclosure of Invention
The invention aims to provide a camera positioning method based on mixed attention, which solves the problem that accurate and stable positioning can be carried out in a dynamic scene.
The technical scheme adopted by the invention is that a camera positioning method based on mixed attention uses monocular images as input, firstly, a visual encoder is utilized to extract features required by a pose regression task, and ResNet34 is adopted as the basis of the encoder, because a residual error network allows training of deeper features of a neural network, a more stable and more accurate positioning result than other architectures is realized; then, combining the channel attention and the non-local self-attention to count the global information of the image from the channel level and the picture context, neglecting some dynamic objects and useless features, screening out the features beneficial to the regression of the camera pose, and specifically implementing the method according to the following steps:
step 1, constructing a convolutional neural network based on non-local self-attention camera positioning;
step 2, training the neural network established in the step 1;
and 3, testing the network trained in the step 2.
The invention is also characterized in that:
the convolutional neural network in the step 1 comprises a feature coding module, a mixed attention module and a pose regression module, and is implemented according to the following steps:
step 1.1, inputting an image into a network, and performing downsampling by a feature coding module to extract features;
step 1.2, capturing the dependency relationship between a channel level and a space level on the characteristic diagram through a channel attention and non-local self-attention module, and outputting an attention weight diagram with the dependency relationship;
step 1.3, inputting the calculated attention weight into a pose regressor to regress the pose of the camera;
wherein the step 1.1 is implemented according to the following steps:
step 1.1.1, inputting an RGB image, and setting the size of a picture to be 256 multiplied by 256, namely the size of the picture input into a network is 256 multiplied by 3;
step 1.1.2, performing a common 7 × 7 convolution operation on an input image once, adjusting the size of the image to 128 × 128, adjusting the number of channels to 64 channels, and performing batch normalization and Relu function activation;
step 1.1.3, transmitting the feature graph obtained in step 1.1.2 into a residual convolution block to carry out 16 times of residual convolution, wherein the convolution kernel is 3 multiplied by 3, the size of an output picture is 8 multiplied by 8, and the number of channels is 512;
step 1.1.4, carrying out average pooling and full connection operation on the characteristic diagram obtained in the step 1.1.3, and finally outputting a characteristic diagram of a 2048 channel;
wherein the construction of the residual volume block in step 1.1.3 is: firstly, performing channel expansion through 3 × 3 convolution, and performing BN and Relu activation; then, performing feature extraction through 3 × 3 convolution, and activating BN and Relu;
wherein the step 1.2 is implemented according to the following steps:
step 1.2.1, introducing the feature map obtained by the feature extraction module into a mixed attention module, and simultaneously constructing channel attention and non-local self-attention;
step 1.2.2, a feature map output by a feature extraction module is transmitted into a channel attention module, and global features of an image are counted from a feature channel level;
step 1.2.3, the feature map output in step 1.2.2 is transmitted into a non-local self-attention module, the dependency of the long-range picture features is captured, and finally, a feature map with 2048 channels is output;
wherein the step 1.3 is implemented according to the following steps:
step 1.3.1, inputting the 2048-dimensional feature map obtained in the step 1.2.3 into a pose regressor to construct a multilayer perceptron module;
step 1.3.2, inputting the feature map into a full connection layer to obtain a feature map with the size of 1 multiplied by 2048;
step 1.3.3, respectively inputting the obtained feature maps into two full-connection layers to obtain two three-dimensional feature vectors respectively representing translation and rotation;
step 1.3.4, superposing the two obtained three-dimensional vectors to finally obtain a six-dimensional pose vector;
the data set of the network training in the step 2 is divided into an indoor data set and an outdoor data set, the indoor data set is 7Scenes, the outdoor data set is Oxford RobotCar, and the method is implemented according to the following steps:
step 2.1, loading a data set and initializing weight parameters;
step 2.2, segmenting data set data, using 70% of images for training and 30% of images for estimation;
step 2.3, outputting a training loss value after every 5 epochs by adopting an L1 loss function;
step 2.4, setting the initial learning rate to be 5e-5, and training in a mode of automatic decline of the learning rate;
step 2.5, stopping training and storing the model when the loss value is not reduced after the training reaches 600 epoch;
wherein the step 2.2 is implemented according to the following steps:
firstly, inputting a training set into a network according to a preset batch, then, setting a picture resize in a data set to be 256 pixels, normalizing the images to enable the pixel intensity to be within a range of (-1,1), and setting the brightness, the contrast and the saturation to be 0.7 and the hue to be 0.5 on an Oxford RobotCar data set;
wherein the step 3 is implemented according to the following steps:
step 3.1, loading a test picture in the data set, and setting a regression dimension of the pose of the camera;
step 3.2, loading the trained model parameters and reading a test data set;
step 3.3, each frame of the data set image is transmitted into a camera regression model, and regression prediction is carried out on pixel points;
and 3.4, calculating the translation and rotation errors of the regression pose.
The invention has the beneficial effects that:
the camera positioning method based on mixed attention of the invention uses monocular images as input, learns to reject dynamic objects and lighting conditions to obtain better performance, and can efficiently operate in indoor and outdoor data sets. In addition, an end-to-end pose estimation algorithm framework is convenient to improve and has a promoted space, and the work of the text makes a meaningful exploration for the application of deep learning in the field of visual SLAM.
Drawings
FIG. 1 is a block diagram of a hybrid attention based camera positioning network of the present invention;
FIG. 2 is a block diagram of residual errors in a hybrid attention-based camera positioning network feature encoder according to the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention provides a camera positioning method based on mixed attention, which is implemented by the following steps:
step 1, constructing a camera positioning network based on mixed attention, extracting image features through a feature encoder, introducing the extracted image features into a channel attention module and a non-local self-attention module to screen geometric features with robustness, and inputting the screened features into a pose regressor to regress translation and rotation vectors, wherein the specific structure of the network is shown in FIG. 1;
the network structure is divided into 3 modules: 1) a feature encoding module; 2) a hybrid attention module; 3) a pose regression module;
after the image is input into the network, firstly, downsampling is carried out through a feature coding module to extract features, feature information on a space and a channel is fused through a mixed attention module, and finally the fused features are input into a pose regression module to guide pose regression;
step 2, network training: the method uses a PyTorch frame to build a network structure, uses an L1 function as a loss function, uses an Adam algorithm to optimize training parameters, and adopts an early-stopping strategy to prevent over-fitting of network training in the training process so as to achieve the optimal training effect;
step 3, network testing: and inputting the test image into a network to obtain a pose estimation result, calculating loss values of translation and rotation, and evaluating the network performance.
1) A feature encoding module;
the module is used for extracting the linear features of the image from low dimension to high dimension abstract features, most parameters and calculated amount of the pose regression network come from the module, in order to ensure the accuracy and light weight and simultaneously extract the deep features, a network ResNet34 suitable for classification and segmentation is used as the backbone of the network, and the network can extract the deeper features;
ResNet has 2 basic blocks, one is a residual block, and the input and output dimensions are the same, so multiple can be connected in series. Another basic block is a convolution block, the input and output dimensions are different, and therefore they cannot be connected in series, and its role is to change the dimension of the feature vector, because CNN is to gradually convert the input image into a feature map with a small size and a deep depth, generally using a uniform and relatively small convolution kernel, but as the depth of the network increases, the number of output channels increases, and the networking becomes more and more complex, so it is necessary to convert the dimension with the convolution block before entering the residual block, so that the network can continuously connect the residual block, and the detailed structure is as shown in table 1 below.
The characteristic coding module is implemented according to the following steps:
step 1.1.1, inputting an image, and setting the size of a picture to be 256 multiplied by 256, namely the size of the picture input into a network is 256 multiplied by 3;
step 1.1.2, performing a common convolution operation on an input image, compressing h multiplied by w for 1 time, adjusting the number of channels to 64 channels, and activating BN and Relu;
step 1.1.3, transmitting the characteristic diagram obtained in the step 1.1.2 into a residual convolution block for 16 times of residual convolution to obtain a characteristic diagram with 512 channels;
step 1.1.4, carrying out primary average pooling operation and full connection operation on the feature map obtained in the step 1.1.3, and finally outputting the feature map with the picture size of 1 × 1 × 2048;
the residual block construction method comprises the following steps:
firstly, performing channel expansion through 3 × 3 convolution, and performing BN and Relu activation; then, performing feature extraction through 3 × 3 convolution, and performing BN and Relu activation, wherein the detailed structure is shown in FIG. 2;
TABLE 1 characteristic encoder architecture
Figure BDA0003624259020000071
Figure BDA0003624259020000081
2) A mixed attention module; the module comprises a channel attention part and a non-local self-attention part and is used for extracting the characteristic dependence relation of a channel level and a position level so as to screen out characteristics beneficial to the regression of the pose of the camera;
the mixed attention module is implemented by the following steps:
step 1.2.1, transmitting the features extracted by the feature coding module into a mixed attention module, and simultaneously constructing channel attention and non-local self-attention;
step 1.2.2, a feature graph output by a feature extraction module is transmitted into a channel attention module, and global features of an image are counted from a feature channel level;
step 1.2.3, the feature map output in step 1.2.2 is transmitted into a non-local self-attention module, the dependency of the long-range picture features is captured, and finally, a feature map with 2048 channels is output;
wherein the specific operation process of the step 1.2.2 comprises the following steps:
firstly, compressing the features of the attention module of the incoming channel along the dimension of the channel by a compression factor of 16 to obtain a feature map with the channel number of 128, then inputting the feature map with the channel number of 128 into a full connection layer, and finally obtaining the feature map with the channel number of 2048.
Wherein the specific operation process of the step 1.2.3 comprises the following steps:
firstly, the feature graph output in the step 1.2.2 is transmitted into a non-local self-attention module, eight-time down-sampling is firstly carried out for reducing the number of channels, so that the calculated amount is reduced, then the similarity between features is calculated by utilizing matrix multiplication, then softmax operation is carried out, finally, the output channels are restored through 1 multiplied by 1 convolution, the input and output scales are ensured to be completely the same, the essence of the step is that each output position value is a weighted average value of all other positions, and the commonality can be further highlighted through the softmax operation;
3) a pose regression module; the module transmits the characteristic diagram output by the mixed attention module into a multilayer perceptron to carry out full-connection operation, and outputs two three-dimensional vectors respectively representing positions and directions;
the pose estimation module comprises the following construction steps:
step 1.3.1, replacing fusion characteristics of the attention module to obtain a 2048-dimensional characteristic diagram, and constructing a multilayer perceptron module;
step 1.3.2, inputting the feature map into a full connection layer to obtain a feature map with the size of 1 multiplied by 2048;
step 1.3.3, respectively inputting the obtained characteristic graphs into two full-connection layers to obtain two three-dimensional characteristic vectors representing translation and rotation;
step 1.3.4, Concat is carried out on the two obtained three-dimensional vectors, and finally a six-dimensional pose vector is obtained;
the data set for training the network in the step 2 is divided into an indoor data set and an outdoor data set, the indoor data set is a 7Scenes data set, the outdoor data set is an Oxford RobotCar data set, and detailed information of the Oxford RobotCar data set is as follows in the following table 2, and the method is implemented specifically according to the following steps:
step 2.1, loading a data set and initializing weight parameters;
step 2.2, segmenting data set data, using 70% of images for training and 30% of images for estimation;
step 2.3, outputting a training loss value after every 5 epochs by adopting an L1 loss function;
step 2.4, setting the initial learning rate to be 5e-5, and training in a mode of automatic decline of the learning rate;
step 2.5, stopping training and storing the model when the loss value is not reduced after the training reaches 600 epoch;
TABLE 2 training and testing sequences on Oxford RobotCar dataset
Figure BDA0003624259020000101
The specific operation process of the step 2.2 comprises the following steps:
firstly, inputting a training set into a network according to a preset batch, then reducing a picture in a data set into 256 pixels, normalizing the images to enable the intensity of the pixels to be within a range of (-1,1), and setting the brightness, the contrast and the saturation to be 0.7 and the hue to be 0.5 on an Oxford RobotCar data set, wherein the enhancement step is favorable for improving the generalization capability of the model under various weather and climate conditions;
step 3 is to test the result predicted in step 2 to evaluate the network performance, and the specific steps are as follows:
step 3.1, loading a test picture in the data set, and setting a regression dimension of the pose of the camera;
step 3.2, loading the trained model parameters and reading a test data set;
step 3.3, each frame of the data set image is transmitted into a camera regression model, and regression prediction is carried out on pixel points;
step 3.4, calculating the translation and rotation errors of the regression pose;
the following illustrates the effect of the invention on the test set:
table 3 summarizes the performance of all methods on the 7Scenes dataset, and it is clear that the method of the present invention is superior to other monocular image-based methods, with a 53% improvement in position accuracy and a 19% improvement in rotation accuracy over the baseline PoseNet based on monocular images, especially achieving the best performance in Scenes with highly repetitive textures (such as Chess and Stairs); this is a significant improvement over the prior art, and still achieves greater accuracy than the baseline in other conventional scenarios.
TABLE 3 network Performance comparison on 7Scenes dataset
Figure BDA0003624259020000111
Table 4 shows a quantitative comparison of PoseNet, MapNet, LsG and our method; because the training and test sequences are captured at different times and under different conditions, PoseNet has difficulty in dealing with these changes and outputting inaccurate estimates of a large number of outliers; MapNet generates more accurate results and reduces many outliers by introducing relative poses between successive frames as additional constraints. However, larger areas may contain more locally similar appearances, thereby reducing the ability to position the system. By employing content enhancement, LsG reduces accuracy, although it improves the problem to some extent; in contrast, the model of the present invention addresses these challenges more effectively in view of content and movement, with 83% improvement in position accuracy and 76% improvement in rotation accuracy compared to PoseNet +.
TABLE 4 network Performance comparison on Oxford RobotCar dataset
Figure BDA0003624259020000112
Camera positioning is a challenging task in computer vision due to the high variability of scene dynamics and environmental appearance; the invention relates to a camera positioning method based on mixed attention, wherein ResNet34 is used as a backbone network in a feature coding module for extracting deeper features; a mixed attention module is introduced in the pose regression process, the module weights the channel characteristics and the picture context characteristics, the geometrically stable characteristics are screened out, and the influence of dynamic objects and illumination changes is reduced. And finally, inputting the weighted characteristics into a pose regression device for guiding pose regression, and obviously improving the positioning accuracy of the model on outdoor and indoor data sets through experimental analysis.

Claims (9)

1. A camera positioning method based on mixed attention is characterized by comprising the following steps:
step 1, constructing a convolutional neural network based on non-local self-attention camera positioning;
step 2, training the neural network established in the step 1;
and 3, testing the network trained in the step 2.
2. The method according to claim 1, wherein the convolutional neural network of step 1 comprises three parts, namely a feature coding module, a mixed attention module and a pose regression module, and is implemented by the following steps:
step 1.1, inputting an image into a network, and performing downsampling by a feature coding module to extract features;
step 1.2, capturing the dependency relationship between a channel level and a space level on the characteristic diagram through a channel attention and non-local self-attention module, and outputting an attention weight diagram with the dependency relationship;
and step 1.3, inputting the calculated attention weight into a pose regressor for regressing the pose of the camera.
3. The hybrid attention-based camera positioning method according to claim 2, wherein the step 1.1 is specifically implemented as the following steps:
step 1.1.1, inputting an RGB image, and setting the size of a picture to be 256 multiplied by 256, namely the size of the picture input into a network is 256 multiplied by 3;
step 1.1.2, performing a common 7 × 7 convolution operation on an input image once, adjusting the size of the image to 128 × 128, adjusting the number of channels to 64 channels, and performing batch normalization and Relu function activation;
step 1.1.3, transmitting the feature graph obtained in step 1.1.2 into a residual convolution block to carry out 16 times of residual convolution, wherein the convolution kernel is 3 multiplied by 3, the size of an output picture is 8 multiplied by 8, and the number of channels is 512;
and step 1.1.4, carrying out average pooling and full connection operation on the characteristic diagram obtained in the step 1.1.3, and finally outputting the characteristic diagram of the 2048 channel.
4. A hybrid attention-based camera localization method according to claim 3, wherein the residual volume block in step 1.1.3 is configured as: firstly, performing channel expansion through 3 × 3 convolution, and performing BN and Relu activation; feature extraction was then performed by 3 × 3 convolution and BN and Relu activation was performed.
5. The hybrid attention-based camera positioning method according to claim 2, wherein the step 1.2 is specifically implemented as the following steps:
step 1.2.1, transmitting the feature map obtained by the feature extraction module into a mixed attention module, and simultaneously constructing channel attention and non-local self-attention;
step 1.2.2, a feature map output by a feature extraction module is transmitted into a channel attention module, and global features of an image are counted from a feature channel level;
and step 1.2.3, transmitting the feature map output in the step 1.2.2 into a non-local self-attention module, capturing the dependency of the features of the long-range picture, and finally outputting the feature map with 2048 channels.
6. A hybrid attention-based camera localization method according to claim 2 or 5, characterized in that the step 1.3 is implemented by the following steps:
step 1.3.1, inputting the 2048-dimensional feature map obtained in the step 1.2.3 into a pose regressor to construct a multilayer perceptron module;
step 1.3.2, inputting the feature map into a full connection layer to obtain a feature map with the size of 1 multiplied by 2048;
step 1.3.3, respectively inputting the obtained feature maps into two full-connection layers to obtain two three-dimensional feature vectors respectively representing translation and rotation;
and step 1.3.4, superposing the two obtained three-dimensional vectors to finally obtain a six-dimensional pose vector.
7. The method according to claim 1, wherein the data set of the network training in the step 2 is divided into an indoor data set and an outdoor data set, the indoor data set is 7Scenes, the outdoor data set is Oxford RobotCar, and the method is implemented by the following steps:
step 2.1, loading a data set and initializing weight parameters;
step 2.2, segmenting data set data, using 70% of images for training and 30% of images for estimation;
step 2.3, outputting a training loss value after every 5 epochs by adopting an L1 loss function;
step 2.4, setting the initial learning rate to be 5e-5, and training in a mode of automatic decline of the learning rate;
and 2.5, stopping training and storing the model when the loss value is not reduced after the training reaches 600 epoch.
8. The hybrid attention-based camera positioning method according to claim 7, wherein the step 2.2 is specifically implemented as the following steps:
firstly, inputting a training set into a network according to a preset batch, then, reducing a picture in a data set to 256 pixels, normalizing the image to enable the pixel intensity to be within a range of (-1,1), and setting the brightness, the contrast and the saturation to be 0.7 and the hue to be 0.5 on an Oxford RobotCar data set.
9. The hybrid attention-based camera positioning method according to claim 1, wherein the step 3 is specifically implemented by the following steps:
step 3.1, loading a test picture in the data set, and setting a regression dimension of the pose of the camera;
step 3.2, loading the trained model parameters and reading a test data set;
step 3.3, each frame of the data set image is transmitted into a camera regression model, and regression prediction is carried out on pixel points;
and 3.4, calculating the translation and rotation errors of the regression pose.
CN202210466169.0A 2022-04-29 2022-04-29 Camera positioning method based on mixed attention Pending CN114820792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210466169.0A CN114820792A (en) 2022-04-29 2022-04-29 Camera positioning method based on mixed attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210466169.0A CN114820792A (en) 2022-04-29 2022-04-29 Camera positioning method based on mixed attention

Publications (1)

Publication Number Publication Date
CN114820792A true CN114820792A (en) 2022-07-29

Family

ID=82508973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210466169.0A Pending CN114820792A (en) 2022-04-29 2022-04-29 Camera positioning method based on mixed attention

Country Status (1)

Country Link
CN (1) CN114820792A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750148A (en) * 2021-01-13 2021-05-04 浙江工业大学 Multi-scale target perception tracking method based on twin network
CN113223181A (en) * 2021-06-02 2021-08-06 广东工业大学 Weak texture object pose estimation method
US20210390723A1 (en) * 2020-06-15 2021-12-16 Dalian University Of Technology Monocular unsupervised depth estimation method based on contextual attention mechanism
CN114170304A (en) * 2021-11-04 2022-03-11 西安理工大学 Camera positioning method based on multi-head self-attention and replacement attention

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390723A1 (en) * 2020-06-15 2021-12-16 Dalian University Of Technology Monocular unsupervised depth estimation method based on contextual attention mechanism
CN112750148A (en) * 2021-01-13 2021-05-04 浙江工业大学 Multi-scale target perception tracking method based on twin network
CN113223181A (en) * 2021-06-02 2021-08-06 广东工业大学 Weak texture object pose estimation method
CN114170304A (en) * 2021-11-04 2022-03-11 西安理工大学 Camera positioning method based on multi-head self-attention and replacement attention

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
机器之心: "Transformer已成新霸主?FAIR等重新设计纯卷积ConvNet,性能反超", 《HTTPS://MP.WEIXIN.QQ.COM/S/VUIHXKMNEAVKXHFPCH7BBW》 *
荣昕萌: "基于非局部注意力与通道注意力的图像去噪方法研究", 《中国优秀硕士学位论文全文数据库》 *

Similar Documents

Publication Publication Date Title
CN112233038B (en) True image denoising method based on multi-scale fusion and edge enhancement
US20200250436A1 (en) Video object segmentation by reference-guided mask propagation
CN111915530B (en) End-to-end-based haze concentration self-adaptive neural network image defogging method
CN108648216B (en) Visual odometer implementation method and system based on optical flow and deep learning
CN111626960A (en) Image defogging method, terminal and computer storage medium
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN114170304B (en) Camera positioning method based on multi-head self-attention and replacement attention
CN112651423A (en) Intelligent vision system
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN113762358A (en) Semi-supervised learning three-dimensional reconstruction method based on relative deep training
CN118097150B (en) Small sample camouflage target segmentation method
CN114463218A (en) Event data driven video deblurring method
CN115115685A (en) Monocular image depth estimation algorithm based on self-attention neural network
CN115035172B (en) Depth estimation method and system based on confidence grading and inter-stage fusion enhancement
CN110889868A (en) Monocular image depth estimation method combining gradient and texture features
CN117876452A (en) Self-supervision depth estimation method and system based on moving object pose estimation
CN114359626A (en) Visible light-thermal infrared obvious target detection method based on condition generation countermeasure network
CN114155165A (en) Image defogging method based on semi-supervision
CN113850158A (en) Video feature extraction method
CN115035173B (en) Monocular depth estimation method and system based on inter-frame correlation
CN114820792A (en) Camera positioning method based on mixed attention
CN118279206A (en) Image processing method and device
Jiang et al. Attention-based self-supervised learning monocular depth estimation with edge refinement
CN114821438A (en) Video human behavior identification method and system based on multipath excitation
CN112164078B (en) RGB-D multi-scale semantic segmentation method based on encoder-decoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220729