CN113222033A

CN113222033A - Monocular image estimation method based on multi-classification regression model and self-attention mechanism

Info

Publication number: CN113222033A
Application number: CN202110547074.7A
Authority: CN
Inventors: 李阳; 赵明乐
Original assignee: Beijing Digital Research Technology Development Co ltd
Current assignee: Beijing Digital Research Technology Development Co ltd
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2021-08-06

Abstract

The invention discloses a monocular image estimation method based on a multi-classification regression model and an attention mechanism, which comprises the steps of firstly, replacing convolution in a convolution unit block by an input image through an image encoder, and replacing the convolution by using hole convolution; after the encoding of an image encoder, acquiring pixel-level context information according to a self-attention model, firstly enabling an input feature map to pass through a single-layer neural network and a ReLU function, and then performing global average pooling on the input feature map to acquire global context information of the image; entering scene depth soft inference, classifying input image pixels into depth classes, and performing ordered regression on depth values; by using the data provided by the probability map to obtain a depth value that is accurately smooth, an inferred value of depth for the pixel at the location is obtained. The invention reduces the grid effect caused by repeatedly using the same cavity convolution kernel by utilizing the sequential classification logistic regression model, the self-attention mechanism and the deep neural network to carry out the depth design of the monocular image scene.

Description

Monocular image estimation method based on multi-classification regression model and self-attention mechanism

Technical Field

The invention relates to the technical field of visual positioning, in particular to a monocular image estimation method based on a multi-classification regression model and an attention mechanism.

Background

With the high-speed development of scientific technology, the spatial resolution of images which can be acquired is higher and better, however, the application of images acquired by a common optical camera in some fields is still very limited, for example, in a smart phone with a face recognition function published in 2019, matching and recognition are performed only by a single front-mounted optical camera, so that a leak that the mobile phone can be successfully unlocked by using a prepared owner's picture is caused. This is because when the monocular image reduces the three-dimensional information to the two-dimensional image information, the depth information of the scene is lacking, and the camera cannot distinguish whether the scene is a three-dimensional real person or a two-dimensional portrait.

Depth information plays an important role in many application scenarios, such as Virtual Reality (VR) and Augmented Reality (AR), which have caused a huge heat tide. One key link of VR and AR is the reconstruction of three-dimensional scenes, which necessarily requires the participation of depth information. The accurate depth information enables the VR to "falsely" so that the AR-generated object can be perfectly integrated into the real world. In addition, to achieve a fully immersive experience, it is necessary to dispense with an operating handle or the like, to allow a person to directly interact with an object of the VR/AR, and to accurately recognize and track gestures and movements of the person, and to assist with depth information. Depth information also plays an important role in the field of various intelligent unmanned vehicles including automatic vehicles, And the current unmanned vehicles are generally equipped with a plurality of laser radars And cameras to realize functions such as obstacle detection, synchronous positioning And map construction (SLAM) (2017, review of unmanned vehicle environment perception technology, university of Catharan science, Wangsheng, Daxiang, Xuning, Zhang Peng Fei, 40(1), 1-6). In addition to the above-mentioned scenes, depth information is an important basic subject in photogrammetry and computer vision, and has a great application value in many fields such as intelligent medical treatment, security monitoring, visual navigation, and intelligent robots.

In recent years, deep learning has been widely used in many fields including computer vision, natural language processing, and artificial intelligence, and it was the first to break through in the image processing field in recent years (2012. classification of image networks based on deep convolutional neural networks. development of neural information processing systems, crimany, Sutskever, i., sinton, ke. E. (pp. 1097 + 1105)). The convolutional neural network in deep learning is an important means for extracting abstract features in an image, and the problem of depth estimation for researching a scene by using a deep learning method has gradually become a mainstream approach for solving the problem.

The monocular image depth estimation method based on the ordered multi-classification logistic regression model and the self-attention mechanism fully utilizes the multi-classification logistic regression model and the self-attention mechanism to carry out scene depth estimation on the image obtained by the common monocular camera, has numerous application scenes, and has strong theoretical significance and practical application value.

The traditional monocular depth estimation method is solved by the technologies of photogrammetry and the like. However, the monocular image inherently lacks reliable depth clues such as motion and stereoscopic vision relationship, and the recovery of the original depth in the three-dimensional space is inherently an ill-posed problem, and the true depth of a point on the monocular image can theoretically have infinite solutions. With the continuous development of the deep learning technology, monocular image depth estimation by using a deep learning neural network gradually becomes a mainstream method.

Based on a Convolutional Neural Network (CNN), the depth estimation of a monocular image by using the Convolutional Neural Network is firstly proposed, and a Convolutional Neural Network structure comprising two scales is designed. The two tandem CNNs divide the whole depth estimation process into two steps, namely, the rough estimation of the scene global depth on the whole image and the fine estimation of the roughly estimated depth map through the image local feature optimization. Finally, quite accurate depth estimation results are obtained, and the work initiates the way of deep learning in the field of monocular image depth estimation.

After that, many researchers design different neural network structures or perform optimization and improvement on the depth estimation of the monocular image by using new constraint conditions and a loss function on the basis of work of the Eigen team. For example, after one year, the Eigen team itself proposes a new network architecture, unifies three tasks of depth estimation, surface normal prediction and semantic annotation into a three-level neural network, and raises the resolution of the result to half of the input image. The up-sampling is realized by adopting a deeper residual error network and designing a small convolution instead of a large convolution, so that the up-sampling is more efficient, a novel loss function is provided, and a better result can be obtained. Long Short-Term Memory (LSTM) is used for acquiring image global information by a cyclic network, and is mixed with a general convolutional neural network to realize end-to-end monocular image depth estimation. (2016, overview of virtual reality system, software guide, Yangxi, Liu Xiaoling, 15(4), 35-38) the CNN and the CRF are unified in one frame, two convolution neural networks respectively correspond to an item containing depth information in the superpixel and an item related to the relation of adjacent superpixels in the energy function, and the maximum posterior probability of the two convolution neural networks is calculated. CNN combines with random forest, sets relative convolution neural network at each layer of nodes of binary tree to convolute the upper layer output, and judges the next trend is to transmit the output to left sub-node or right sub-node according to the output result of sub-network, thus greatly reducing the layer number of each CNN. The European computer vision conference is chartered, J., Cao, Y., Song., Y., Lao, R. (pp. 53-69)), and a transverse sharing unit for transferring information between networks is designed, so that two independent sub-convolution neural networks of a depth estimation network and a semantic segmentation network contain output results of each other, and the whole network training is constrained by the same loss function. (2018, deep prediction classification network based on deep learning, lie, R, skillful, K, Shen, C, Cao, Z, celluloid, H, Nao, L, arXiv preprinting arXiv: 1807.03959.) originally continuous image depth is discretized into classes in a certain depth range, the regression problem of depth estimation is converted into a classification problem, classification is realized by using a fully-convoluted depth residual error network, and finally a final depth estimation value is obtained by using a conditional random field optimization result.

When a depth learning method is used for carrying out scene depth estimation on a monocular image, basically, high-dimensional features contained in the image need to be extracted through a depth convolution neural network, and as the depth of the network is deepened and convolution layers are increased, the resolution of the image is sharply reduced after multiple convolutions. If multiple deconvolution were to be used in the network structure, the amount of parameters in the last layers of the network would increase dramatically, greatly increasing the time cost of training and computation (2019, context-aggregated network based on monocular depth estimation of attention, old, y., zhao, h., hu, Z, arXiv preprinted arXiv: 1901.10137.). Therefore, the resolution of the depth map finally obtained by many methods at present is only 1/4 to 1/2 levels of the input image.

Secondly, the supervised learning method needs to input a large number of pictures with real depth value labels to the neural network, and the real values are used as training constraints to carry out backward propagation on the neural network, so as to optimize parameters. However, accurate depth information is not readily available, so researchers rely heavily on public data sets. However, the high-quality public data with depth labels is still limited, and in reality, it is not easy to obtain the depth value corresponding to the scene as compared with the picture.

((2016, October). deep three-dimensional: fully automatic 2D to 3D video conversion using a deep convolutional neural network, at the European computer vision conference (pp. 842-) 857. Chammesring grid, xi, J., Girceck, Fall Hadi.) we propose a method for generating a new perspective map with a certain parallax using a deep convolutional neural network to realize 2D to 3D, on the basis of which many researchers start to train the neural network using left and right views, such as ((2016, October), unsupervised cnn: geometric rescue for single viewpoint depth estimation, at the European computer vision conference (pp. 740- & 756), Chalmsring grid, Gage, BG, V.K., Kanlo, Reed.) propose an unsupervised framework that generates a depth map by fully convolution of the neural network at the encoding stage, in the decoding stage, a right image is reconstructed by using a traditional binocular camera ranging principle, the input right view image is compared, a reconstruction error is used as a target function, and a network is trained reversely. In the process, a real depth map of a scene is not needed to be used as supervision, and only a left view and a right view which have known relations are needed. (2017, unsupervised monocular depth estimation based on left-right consistency, IEEE computer vision and pattern recognition conference discourse set (pp. 270-279), Goldel-Mkovada, Brostol) use a similar method, but utilize the left view to produce disparity maps for the left and right two views simultaneously, and improve the quality of the final output result by introducing the loss of consistency of the left and right views. (2018, extracting cross-domain stereo network learning monocular depth, European computer vision conference discourse set (ECCV) (pp. 484-. (2017, a semi-supervised depth learning method for monocular depth map prediction, an IEEE computer vision and pattern recognition conference set (pp.6647-6655), Kutzenkov, studler, Reebei) tries to use the sparse depth obtained by the sensor as a reference standard, and the depth estimation of the monocular image is realized together in a semi-supervised mode.

One advantage of the deep learning method over the ordinary parameterized machine learning method is that it does not need to explicitly give the relation between the target and the input, although the parameters to be learned are fixed. Deep learning methods therefore typically exhibit end-to-end (end-to-end) characteristics. In a common depth learning model for monocular image scene depth estimation, whether the depth estimation is in pixel units or in a super-pixel block unit after segmentation, the depth always corresponds to a continuous depth interval, so that a huge parameter space is generated, the convergence rate of network training is reduced, a larger training data volume is often needed, and especially the time cost and the data cost in a training stage are high.

Secondly, since the scene depth estimation problem of monocular images was solved using the depth learning method, the relative error of the estimated depth values has been reduced from the first 0.215 in 2014 to the lowest 0.1 or less at present (2018, depth of in-depth study: loss of monocular depth estimation and semantic boosters and attention-driven, european computer vision conference corpus (ECCV) (pp. 53-69), jiao, j., cao, y., song, y., liu, R.), which has been a significant progress with respect to the previous error much greater than 0.2, and a lifting margin of up to 50% lets researchers see the hope of further optimizing the accuracy of the algorithm lifting results. However, the value still has a certain difference from the relative error of the current binocular image depth estimation algorithm below 0.05.

Aiming at some problems and defects existing in the existing method, a neural network framework based on deep learning is provided, and the scene depth is deduced from a single monocular image in an end-to-end mode.

Aiming at the long tail characteristic of depth value distribution in a monocular image, a non-equidistant discretization mode is adopted to discretize the depth, a continuous regression problem is converted into a multi-classification problem based on an ordered multi-classification logistic regression principle, and a soft inference calculation mode of the depth value is provided according to a probability graph output by a network, so that the obtained depth graph is smoother, and the result is more accurate.

The method is characterized in that a depth order regression network of a self-attention model is combined, aiming at the problems that the image resolution is reduced, detail characteristic information and image global information cannot be well stored and the like caused by a convolutional neural network, a concept of cavity convolution is introduced, and a cavity convolution unit block formed by different convolution kernels is designed according to the bottleneck design widely used at present. Based on the idea of feature fusion between large and small scales and long and short distances in deep learning, an attention mechanism is introduced, the characteristic that a self-attention model can be used for efficiently correlating long-distance features is utilized, pixel-level feature expression is achieved, the feature expression is fused with the global features of the image, context information of the image is provided outside a deep convolutional neural network, and errors of depth estimation are reduced.

Disclosure of Invention

Aiming at the technical problems in the related art, the invention provides a monocular image estimation method based on a multi-classification regression model and an attention mechanism, which can overcome the defects of the prior art.

In order to achieve the technical purpose, the technical scheme of the invention is realized as follows:

the monocular image estimation method based on the multi-classification regression model and the self-attention mechanism comprises the following steps of:

s1, firstly, an input image is convoluted by replacing 3x3 in the last two convolution unit blocks through an image encoder, the 3x3 of the third convolution unit block is convoluted and replaced by 3x3 hole convolution with the sparse rate of 1,2 and 3 in sequence, three blocks are determined as one group and 8 groups are determined in total, and the 3x3 convolution is replaced by 3x3 hole convolution with the sparse rate of 1,2 and 5 in sequence through the fourth convolution unit block;

s2, after encoding by an image encoder, obtaining pixel-level context information according to a self-attention model to obtain a query element Q and a key element K, firstly passing an input feature map through a single-layer neural network and a ReLU function, and then obtaining global context information of the image by performing global average pooling on the input feature map;

s3, after the input image passes through an encoder and context information is acquired, the soft inference of scene depth is carried out, the pixels of the input image are classified into depth classes by using polynomial logic classification, and then the depth value is subjected to ordered regression by using a common softmax function as a loss function;

and S4, obtaining a probability map of each depth class on the input image, and obtaining an accurate and smooth depth value by integrating the depth values of a plurality of depth classes by using probability data provided by the probability map when the depth is inferred, and then inferring the depth on the position by the depth of two adjacent classes with the maximum probability to obtain the depth inferred value of the pixel on the position.

Further, in step S2, the context information includes context information at pixel level and context information at image level.

Further, in step S2, the global average pooling is to convert the input high-dimensional feature map into one-dimensional feature vectors, copy a vector for each input feature, associate the two feature vectors with one convolutional layer, and output the globally averaged feature map, where the output feature map needs to pass through one convolutional layer and two anti-convolutional layers.

Further, the output feature map is a feature map in which the detail features and the global features of the input image are aggregated.

Furthermore, the kernel sizes of the convolutional layers passed by the output feature map are all 1 × 1, and the step size is 1.

Further, in step S2, the pixel-level context information is obtained according to an attention-only mechanism, the key to the calculation of the attention-only mechanism is to find the key-value pairs in the image and obtain the weights of the key-value pairs, and the final output is a weighted sum in the high-dimensional feature space and input to the final classifier.

The invention has the beneficial effects that: in the monocular image scene depth estimation problem, the depth discretization mode that the depth interval is increased along with the increase of the depth value is designed through the long tail distribution characteristics of the scene depth on the image, the original regression problem is converted into a multi-classification problem, and the overfitting phenomenon easily caused by the use of a large amount of accurate depth map training is reduced. According to the strict ordering of the scene depth, the ordered regression is used for replacing the general multi-classification problem, the final depth value is calculated by using a soft inference mode considering the probability distribution of multi-depth intervals, the error caused by the depth discretization is reduced, and a smoother scene depth estimation result is obtained; the hole convolution replaces the common dense convolution, the receptive field of the network is increased, the image characteristics with larger scale are kept, and the grid effect caused by repeatedly using the same hole convolution kernel is reduced. On the premise of not increasing extra parameters remarkably, the performance of the network on image feature extraction is improved, and therefore the final scene depth estimation is optimized; based on an attention mechanism, an image context information synthesis module fusing pixel level features and image level features is designed in a depth order regression network. The self-attention model can effectively keep the relevance between the related detailed features at a longer distance, and the global information of the image ensures that the unique overall information of the image is stored after passing through the deep convolutional neural network. And aiming at the ordered regression problem of the depth, a loss function for guiding self-attention calculation is provided, so that the whole network can fully utilize the characteristic information of the image on different levels, and the accuracy of the final scene depth estimation result is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic diagram of a depth-ordered regression network framework based on a multi-classification regression model and a monocular image estimation method based on an auto-attention mechanism in combination with an auto-attention model according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a network structure of an encoder portion of the monocular image estimation method based on the multi-classification regression model and the auto-attention mechanism according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an attention calculation module of the monocular image estimation method based on the multi-classification regression model and the attention mechanism according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention belong to the protection scope of the present invention, and for the convenience of understanding the above technical solutions of the present invention, the above technical solutions of the present invention are described in detail below by specific use modes.

As shown in fig. 1, the present invention proposes a general framework algorithm flow similar to the estimation of scene depth by monocular image, wherein the input of the whole algorithm is a monocular RGB image, and a depth map corresponding to the input image is output in an end-to-end (end-to-end) form, and the map passes through three parts from input to output, and the first part is an image editor composed of a depth convolution network for converting the image into a high-dimensional feature space; the second part is a module comprising the context information of the self-attention representation image, and the depth estimation precision is improved by integrating the pixel-level self-attention and the overall image information obtained by global average pooling; and the third part is a depth inference module, a group of probabilities of each pixel on the corresponding depth class is obtained after the first two parts are processed, and final depth map output can be obtained through the depth inference module.

The technical scheme uses a ResNet-101 image encoder, when in use, 3 multiplied by 3 convolution in the last two convolution unit blocks is replaced, 3 multiplied by 3 convolution of the third convolution unit block is replaced by 3 multiplied by 3 hole convolution with the sparse rates of 1,2 and 3 in sequence, three blocks are in one group, and 8 groups are in total; the fourth convolution unit block replaces the 3 × 3 convolution with a 3 × 3 hole convolution with the sparsity rates of 1,2, and 5 in this order.

The network structure of the image encoder is shown in fig. 2, in which the network layers with parameters have 103 layers in total, and the size of the feature map output finally is 1/8 of the input image.

Then an image is input to a module for acquiring image context information, the module is complementary with a convolutional neural network, the context information is divided into two parts of pixel-level context information and image-level context information, the two types of context information are respectively solved through two independent sub-networks, and then the two types of context information are connected in series.

Wherein the context information at the pixel level is obtained from an attention model. As shown in FIG. 3, the key to the calculation of the attention mechanism in the self-attention model is to find the key-value pairs in the image and obtain the weight for each key-value pair. The final output is a weighted sum in the high dimensional feature space and input to the final classifier. Wherein the weight of the key-value pairs is obtained by examining the correlation of the query with the key. The introduction of the self-attention mechanism is to be able to quickly correlate long-range detail features that are far apart.

Then, firstly, a query element Q and a key element K are obtained, and the two elements are respectively realized by the following two mappings:

。

wherein

Is to inquire the channel number of the key element Q and the key element K and order the number of the channels to reduce the calculation amount

. Since the scale of self-attention is the pixel scale, the input feature map is passed through a simple single-layer neural network, with a 1 × 1 convolution, a batch normalization layer and the ReLU function as the activation function.

Wherein the context information of the image level is obtained by performing global average pooling on the input feature maps. The input high-dimensional feature map is converted into a one-dimensional feature vector through global average pooling, namely, each input feature map contributes to each item in the output, and then a vector is copied for each input feature, so that the features of a plurality of channels are mixed in the image-level context semantic feature map, and then the two (the context information of a pixel level and the context information of an image set) are related through a convolution layer.

The feature map is an output feature map in which the detail features and the global features of the input image are integrated. The feature map output by the attention module is then passed through one convolutional layer and two anti-convolutional layers.

The core size of the convolutional layer is 1 × 1, the step is 1, the number of the characteristic channels of the convolutional layer is reduced by half to 1024, and the purpose is to reduce the parameter space and improve the calculation efficiency. The purpose of the deconvolution layer is to increase the resolution of the image, since in the image encoder of the first part of the network the resolution of the image is reduced to 1/8 for the input image and the output depth map can reach the resolution level of the input map, two deconvolution layers are used to increase the image resolution to 1/4 and 1, respectively, of the original.

After an input image passes through an image encoder and a context information module, pixels of the image are classified into depth classes by using a plurality of logical classifications, and a common softmax function is used as a loss function to carry out ordered regression on depth values. After obtaining a probability map for each pixel to each depth class on the image, at the time of final depth inference, depth values of multiple depth classes may be synthesized by using probability information provided by the probability map to obtain more accurate and relatively smoother depth values, such that:

。

wherein𝜆_𝑖Namely, it is𝑖The depth values of the location pixels accumulate the area under the probability curve. The depth at that location is inferred by the depth of the two adjacent classes with the highest probability. Order to

，（

Representing a rounding-down operation), then get𝑖Depth inference value for location pixel:

。

then, in the ordered regression network for scene depth estimation through the monocular image, the probability map for inferring the depth is obtained by associating the pixel-level context information with the image global context information based on the attention mechanism, so that in addition to describing the loss function of the final predicted depth value and the real depth value, a loss function for attention is added. Loss function of the whole networkL _totalThe following were used:

。

wherein the content of the first and second substances,L _ordandL _attthe ordered regression loss and the attention loss are expressed separately,α _ordandα _attis the factor that adjusts for both losses. Loss of order regression of imagesL _ordIs the average of the order loss at each pixel in the image:

。

wherein the content of the first and second substances,𝑊and𝐻respectively representing the width and height of the image,𝑊×𝐻 = 𝑁。Ψ_𝑖is represented in𝑖Ordered regression loss of loxels. The logic loss function of multiple classifications is not used, but the whole process is regarded as the superposition of two classifications one by one according to the orderliness of the classifications, and each classification only judges whether the sample is larger than the second classification𝑘And (4) class. This allows for significant errors in those values that do not fit into the depth sequence, which can be changed during subsequent training:

。

wherein

Is the depth class in which the true depth value of the i-position pixel is located.

Representing the depth class of prediction for sample x and parameter set Θ

Is greater than that of𝑘The probability of a class. Probability map for network output𝑦_𝑖，

. This probability can be calculated by the softmax function:

。

for attention loss, we consider it to be the average of the attention loss over each pixel in the image:

for the attention loss on each pixel, the difference between its prediction and the true value is described using relative entropy, also called K-L divergence. For attention, the actual object under investigation is the weight coefficientW _i,jThe distribution is unknown, so the relative entropy is taken as a loss function of attention:

。

wherein the weight coefficientW _i,jCalculated by the following formula, and the reference valueW ^* _i,jThe following can be calculated in the same way:

。

in summary, according to the technical scheme of the invention, in the monocular image scene depth estimation problem, a depth discretization mode that the depth interval increases with the increase of the depth value is designed according to the long tail distribution characteristic of the scene depth on the image, so that the original regression problem is converted into a multi-classification problem, and the overfitting phenomenon easily caused by the use of a large amount of accurate depth map training is reduced. According to the strict ordering of the scene depth, the ordered regression is used for replacing the general multi-classification problem, the final depth value is calculated by using a soft inference mode considering the probability distribution of multi-depth intervals, the error caused by the depth discretization is reduced, and a smoother scene depth estimation result is obtained; the hole convolution replaces the common dense convolution, the receptive field of the network is increased, the image characteristics with larger scale are kept, and the grid effect caused by repeatedly using the same hole convolution kernel is reduced. On the premise of not increasing extra parameters remarkably, the performance of the network on image feature extraction is improved, and therefore the final scene depth estimation is optimized; based on an attention mechanism, an image context information synthesis module fusing pixel level features and image level features is designed in a depth order regression network. The self-attention model can effectively keep the relevance between the related detailed features at a longer distance, and the global information of the image ensures that the unique overall information of the image is stored after passing through the deep convolutional neural network. And aiming at the ordered regression problem of the depth, a loss function for guiding self-attention calculation is provided, so that the whole network can fully utilize the characteristic information of the image on different levels, and the accuracy of the final scene depth estimation result is improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. The monocular image estimation method based on the multi-classification regression model and the self-attention mechanism is characterized by comprising the following steps of:

2. The monocular image estimating method based on the multi-classification regression model and the self-attention mechanism according to claim 1, in step S2, the context information includes context information at a pixel level and context information at an image level.

3. The method for monocular image estimation based on multi-classification regression model and auto-attention mechanism as claimed in claim 1, in step S2, the pooling by global average is to convert the input high-dimensional feature map into one-dimensional feature vector, copy it into a vector for each input feature, associate it with each other by a convolution layer, and output the feature map after global averaging, wherein the feature map output needs to pass through one convolution layer and two deconvolution layers.

4. The monocular image estimating method based on the multi-classification regression model and the self-attention mechanism as claimed in claim 3, wherein the output feature map is a feature map in which the detail features and the global features of the input image are aggregated.

5. The method of claim 3, wherein the output feature map passes through convolutional layers with a kernel size of 1x1 and a step size of 1.

6. The method for monocular image estimation based on multi-classification regression model and auto-attention mechanism as claimed in claim 1, wherein the step S2, the obtaining of the pixel level context information is based on the auto-attention mechanism, the computing key of the auto-attention mechanism is to find the key value pairs in the image and obtain the weights of the key value pairs, the final output is a weighted sum in the high-dimensional feature space, and the weighted sum is inputted to the final classifier.